Highly Adaptive Linear Actor-Critic for Lightweight Energy-Harvesting IoT Applications

: Reinforcement learning (RL) has received much attention in recent years due to its adaptability to unpredictable events such as harvested energy and workload, especially in the context of edge computing for Internet-of-Things (IoT) nodes. Due to limited resources in IoT nodes, it is difﬁcult to achieve self-adaptability. This paper studies online reactivity issues of ﬁxed learning rate in the linear actor-critic (LAC) algorithm for transmission duty-cycle control. We propose the LAC-AB algorithm that introduces into the LAC algorithm an adaptive learning rate called Adam for actor update to achieve better adaptability. We introduce a deﬁnition of “convergence” when quantitative analysis of convergence is performed. Simulation results using real-life one-year solar irradiance data indicate that, unlike the conventional setups of two decay rate β 1 , β 2 of Adam, smaller β 1 such as 0.2–0.4 are suitable for power-failure-sensitive applications and 0.5–0.7 for latency-sensitive applications with β 2 ∈ [ 0.1,0.3 ] . LAC-AB improves the time of reactivity by 68.5–88.1% in our application; it also ﬁne-tunes the initial learning rate for the initial state and improves the time of ﬁne-tuning by 78.2–84.3%, compared to the LAC. Besides, the number of power failures is drastically reduced to zero or a few occurrences over 300 simulations.


Introduction
Energy-harvesting Internet-of-Things (EH-IoT) in wireless communications is a hot topic today, as it potentially enables perpetual operations of nodes in wireless sensor networks.Nonetheless, myriad uncertainties exist not only in the nature but also in the applications.Under such uncertainties, the power-performance trade-offs, i.e., striking the balance between energy saving and quality of service (QoS) should be addressed in an automatic way.
One popular solution is to use reinforcement learning (RL) that adapts itself to the environment at run-time with no a priori information about their changes (i.e., state transition probabilities) to produce the optimal actions.Nonetheless, the hyperparameter setting is sensitive to the environment, and, therefore, the system often fails self-adaptations.Many researchers resort to neural network-based RLs to ensure more scalability in such a situation.However, extra training data may be required while they are more compute-and memory-intensive compared to linear approximation-based RLs.Ideally, we would rather find a low-cost solution, i.e., µW-range, RL method for the upcoming µW-range IoT end devices [1].The current trend is moving more processing from the cloud to nodes, leading to edge computing.Thus, the power manager of a node itself should be lightweight to be consistent with the low power requirement induced by the low energy budget of the node.
Taking into account the potential continuous action and state spaces (e.g., energy storage device), linear function approximations can be exploited [2,3].The use of continuous spaces that can scale to discrete problems is also an advantage [4].In a previous work, we also investigated the linear actor-critic (LAC) method and employed a data buffer and an energy buffer to represent the system state.However, with such a state representation that induces power-performance trade-offs, the fixed learning rate hinders the adaptation to new situations, e.g., workload changes, since they directly impact the amplitude of the oscillation of the rewards or the variance of the gradients of the learned parameters.As such, to gain robustness to drastic changes in the gradients, we incorporate into the LAC an adaptive learning rate such as the Adam optimizer [5] (hereafter, Adam).Adam is originally designed to tackle the sparse gradient issue in training neural networks with large decay rates (e.g., β 1 = 0.9 and β 2 = 0.999).In this paper, we show that it can provide fast reactivity to the environment with smaller decay rates.We define reactivity as the adaptation from a previous near-optimality to a new one induced especially by non-trivial environmental changes.Further, we attempt to evaluate convergence speed in a quantitative manner, which is application-dependent and hardly conducted previously.Our contributions are summarized as follows: 1.
To provide better adaptability, we combined and evaluated the LAC algorithm with Adam (LAC-A) using smaller decay factors for transmission (TX) duty-cycle optimization in an application of sensor data TX in a point-to-point network.

2.
With the use of smaller decay rates in Adam, we can exclude the initialization bias correction terms to reduce the calculations, and we call the algorithm LAC-AB (LAC with Adam Biased).

3.
We defined the time of convergence quantitatively based on the mean and variance information for evaluating the speed of convergence of different approaches.4.
Simulation results show that, for our application use case, smaller decay rates β 1 such as 0.2-0.4 are better for power-failure-sensitive applications and 0.5-0.7 for latency-sensitive applications with β 2 ∈ [0.1, 0.3] in the LAC-AB algorithm.5.
We show that LAC-AB with such decay rates helps achieve more reactivity and stability to drastic gradient changes, such as doubled workload, and enables finetuning the actions to the initial state more quickly.
The rest of the paper is organized as follows.We present the state-of-the-art and clarify the differences from our approach in Section 2. Section 3 summarizes the system model used in our study.In Section 4, the detailed settings for the simulations of TX duty-cycle modulation are listed.Section 5 concisely explains the basics of RL algorithm.In Section 6, we introduce the use of Adam to deal with the reactivity issue along with the table of the whole LAC-AB algorithm.We then give the definition of time of convergence in Section 7.With respect to simulation studies, we firstly show the reactivity problem of the LAC algorithm using fixed learning rate in Section 8.1.Next, Section 8.2 demonstrates how effective the LAC-AB algorithm is compared to LAC and how we can evaluate its convergence adequately, as presented in the following subsection.Further simulation results are shown in Section 8.3 to study the combination of the two decay rates in terms of convergence speed, power failure and latency.Finally, Section 9 concludes this paper with some future work directions.

Related Work
Many research works in the literature of energy management in both lower-power IoT and EH-IoT exist.Roose et al. [6] proposed a hardware solution for a sense and compress application.The method proposed loads tuning parameters obtained by offline evolutionary algorithm onto statistics (sample variance)-based runtime control.It reduces the online resources, but the runtime adaptability might not be enough in the case of unexpected events that were not in the training dataset.A stepwise online adjustment is one solution presented in [7] that works with a certain discrete action space, which can be a limit for applications that require continuous spaces.The use of prediction of harvested energy is also a downside, as it is likely to produce errors that can make decision-making worse.
Energy management can be addressed through the solution of an optimization problem.For instance, the authors of [8,9] formulated the optimization problems with some constraints and solve it using Lyapunov optimization method.However, a non-convex problem is difficult and costly to solve, with possibly multiple optima.Bhat et al. [10,11] introduced hardware solutions for their optimization problems with the help of relaxed Karush-Kuhn-Tucker (KKT) conditions or of simplex algorithm for solving their linear programming.They needed to prepare design points [11] or energy harvesting model for predictions offline [10] in their methods.Note that our approach is independent of such offline analysis.Vigorito et al. [12] formulated a linear-quadratic problem by defining the ENO-Max condition in the TX duty-cycle modulation.This method is model-free and agnostic of prior knowledge on energy harvesting.Another model-free algorithm, tabular Q-learning [13], is designed as a dedicated hardware to address adaptive power management.However, it cannot scale to large state-action spaces and continuous states and actions.Note that all the aforementioned methods do not take data queues into account.
Masadeh et al. [14] presented an actor-critic RL method for TX output power control in energy-harvesting communications system.Their actor learns the parameters for the mean and standard deviation of a normal distribution, while the critic is constructed by a two-layer neural network, which can be costly for resource-constrained devices.Likewise, in [15][16][17], three layer neural networks are employed, which is again computation-and memory-intensive.For the purpose of low-cost implementation, linear function approximation is adopted in [2,18].However, these papers do not discuss the reactivity of their algorithm to environmental changes, or covariate shift, i.e., re-optimization to a new state.All of the above works use fixed learning rate, which cannot guarantee enough reactivity.By contrast, we aim at devising a lightweight reactive control algorithm.
The authors of [19][20][21] adopted deep learning-based RL (DRL) solutions with Adam.The effect of this adaptive algorithm, however, is to overcome the sparse gradient issue that frequents neural networks.To this end, larger decay rates in Adam are employed, which attains stable and fast convergence in training neural networks.Nonetheless, such setups of decay rates essentially lead to slow adaptation to new situations since they put less weight on recent changes and retain longer past gradient information.Besides, initialization bias correction is required.
Differently from those existing methods, we propose applying Adam to a low-cost LAC algorithm for fast adaptability.That is, thanks to the use of linear approximations, we can perform Adam with smaller decay rates such as β 1 , β 2 ∈ [0.1, 0.7], which can make our RL more reactive.Such small decay rates are also suitable, in that LAC is fully dependent on the latest action to learn, i.e., it is an on-policy RL.These choices of values are further elucidated by the use of exponentially weighted moving average (EWMA) in the literature of online workload change detection [22][23][24].Comparisons among the above state-of-the-art solutions and our approach are summarized in Table 1.

System Model
Adaptive control in EH-IoT has already been addressed in previous works (see, e.g., [18]).In the current paper, we study the exact same system.However, the system is now described to make the paper self-content.

Energy Harvesting and State-of-Charge Model
Several types of energy harvesting sources exist, e.g., solar, wind, vibration, thermal and piezoelectric.Here, we focus on solar energy-harvesting.The scavenged power P harv t is formally calculated by: with the solar irradiance I t [W/m 2 ], the size of the photovoltaic (PV) cell A[m 2 ], the conversion efficiency (rate) η, and the tracking factor (TF) of maximum power point tracking.
In the present work, we employ the harvest-use-store scheme [25,26].
A supercapacitor is considered an optimal solution as an energy storage in energyharvesting applications because of the well-balanced trade-offs among physical lifetime, energy density, maximum charge cycles, and self-discharging rate [27].Let E max and E f ail denote the maximum and minimum (i.e., failing-threshold) energy levels.The state-ofcharge (SoC) of the residual energy E t is represented as: Note that a severe self-discharging is a known issue in supercapacitors.Indeed, the self-discharge rate τ during time ∆t can be up to 20% per day [27].With its capacity C and voltage level V sc t , the leak power P leak due to self-discharging is given by: In our simulations, ∆t is equal to 1 min, and the discharging rate is simplified as , which stands for 20% per day.

Application Data and State-of-Buffer model
In this paper, an embedded sensor is assumed to generate data that are stored into the TX buffer.The sensor type is not considered here, even if the sensor technology and type of information sensed have a strong influence on the power consumption, and thus on the required harvested energy.Data acquisition can be periodic, aperiodic, or completely random.We assume here that the arrival of data to the TX buffer is generated based on the Poisson distribution with the average of λ [pkts/min] [28].
We consider a TX buffer with maximum capacity B max .The buffer stores data that are newly generated or that require retransmission because of sending issue.The State-of-Buffer (SoB) φ SoB t of the current buffer level B t is defined as:

Power Consumption and Transmission Model
Depending on the wireless protocol, different controllable variables exist such as the duty-cycle, the output power, the modulation, the spreading factor, the bandwidth, and the cyclic redundancy check.In this paper, we assume a continuous duty-cycle control of a CC2500 transceiver module whose output power is fixed to +1 dBm that consumes 21.5 mA according to the data sheet [29].Considering the nA-and mA-order of current in deep-sleep mode and in active mode, the impact of TX duty-cycle control is much larger than that of TX output power control.
We assume the cycle period T cycle = 1 min that consists of the active time T active and the sleep time T sleep .The TX duty-cycle D t is defined as the ratio of T active to T cycle .Therefore, we have T active = D t • T cycle .Taking into account η act and η sl p that are the efficiency of the DC-DC regulator in active and sleep mode [7], respectively, the TX power consumption during a cycle P cycle is expressed as: where P active is the power consumption in the active mode.T active is comprised of the sum of time-on-air of frame TX and acknowledgements (RX) [30].Note that the transceiver wake-up time overhead is therefore ignored in this paper.T ack denotes the time required to receive an acknowledgement packet: it is assumed to be equal to the acknowledgement frame's time-on-air.Under these assumptions, P active is obtained by: where P tx and P rx are the power consumption during packet transmission and acknowledgement reception, respectively.While the overall power consumption of an IoT node typically comprises sensing, processing and communication power, only the last one was considered in our simulations.In addition, this work neglects the power consumption overhead of the proposed actor-critic-based controller.Note that a combined path-loss and shadowing model [31] in outdoor environment is assumed, and the packet error ratio was constantly zero throughout the whole simulations.

Application Scenario
We consider the case of TX duty-cycle optimization in an energy-harvesting IoT sensor end-node communicating with a sink node.In such a situation, the space of application scenarios is quite large.The common features of the application scenario assumed in our study are described as follows: - The control update interval (CUI) is T cui = 30 min.-For the PV cell model, we set the cell area A, conversion efficiency η, and tracking factor TF as 2.5 cm 2 , 10%, and 96.3%, respectively [7].This choice is consistent with an off-the-shelf solar cell that can harvest power from µW to mW per cm 2 , depending on the lighting condition [32].Note that we use the real-life solar irradiance data provided by Oak Ridge National Laboratory [33].- The self-discharge of a supercapacitor whose capacitor size is 1F is considered 20% per day (detailed in Section 3.1).The harvest-use-store scheme is adopted [25,26] to provide high energy efficiency.- The wireless link quality is under the influence of path-loss and shadowing.- The workload follows the Poisson distribution.The average rate doubles after the first six months (where the algorithm is put through the test of fast adaptability/reactivity).More precisely, the system receives the average of 1.0 pkt/min for the first six months, and it impulsively becomes twice (2.0 pkt/min) afterwards.
Figure 1 illustrates this application scenario.For the solar irradiance data, we use three kinds of datasets:

Reinforcement Learning
In reinforcement learning (RL) [34], a decision maker and its surroundings exist.The former is called the agent, while the latter the environment.The agent learns to optimize its actions by interacting with the environment.As opposed to dynamic programming where the agent perfectly knows the environment's behavior as a model, no a priori information of the environment is given in RL, which makes it practically useful in real-life, as the environment is often too complex and transient to accurately build its model.
We denote the agent's action and the environment's state by a t and s t , respectively.With the premise of Markov decision process, the agent decides, based on its policy π(a t |s t ), the action a t that influences the environment and its state changes to s t+1 .As a result, the agent receives from it a reward r t+1 = f (s t , a t , s t+1 ) that represents how good the action a t was at the state s t .The goal of RL is therefore defined as maximizing the expected total rewards from the present to the future.The future discount factor γ is introduced in general to put more weight in the near future.The expected total future-discounted reward, more frequently called the value function V π at state s t , is then defined as: Nonetheless, this value is practically impossible to calculate, since the future is unforeseeable.Hence, we need to accurately estimate this value.
At this point, the RL's goal can be re-defined and is comprised of two purposes: 1.
Determine the estimate of the value function v π under a certain policy; Note that the value function v * under the optimal policy π * is defined as ∀s ∈ S, v * = max π v π (s) where S is the state space.We must seek these two under the following two conditions: (1) The reward for each state-action pair is unknown.(2) The state transition probability is unknown.Getting samples from the interactions with the environment helps the agent capture information about the reward and state transition probability to learn the policy and the value function.In actor-critic RL, the actor is in charge of the first purpose and the critic the second one.Each goal can be achieved by policy gradient theorem and TD(λ) algorithm (refer to [2]), which we opt for in this paper, or by the use of neural networks.

Algorithm: LAC-AB
The LAC-AB algorithm is based on the LAC that was originally devised by Ait Aoudia et al. [2].One difference is that we apply Adam to the LAC to address the reactivity problem caused by fixed learning rate, which we discuss in Section 8.1.Thus far, Adam is typically used for sparse gradient issue in use of neural networks.The decay factors β 1 and β 2 are therefore set as 0.9 and 0.999 (which we call PA (prior art) setting below), respectively.The use of linear functions breaks free from such an issue, enabling us to set the decay factors smaller, such as 0.1-0.7 as generally adopted in the literature of workload change detection [22][23][24], so as to achieve faster adaptation.The reason is two-fold: -Smaller values lead to more weight on recent changes, i.e., faster online adaptation.- The gradient variance can become too large and yet carries an important information for parameter updates that can be lost with larger values of β 1 and β 2 .
Further, with such smaller rates, we can reduce some computations by excluding the bias correction terms (refer to [5]) that are accounted in Adam.Hence, the LAC-AB algorithm is a LAC algorithm using Adam with no initialization Bias correction terms.
Figure 2 illustrates the structure of the LAC-AB and Algorithm 1 shows its whole algorithm concerning our application example.However, it is trivial to transfer this algorithm to other application use cases.The state is composed of the State-of-Buffer (SoB) and State-of-Charge (SoC), which take into account the "quantity" of the in and out of data and energy.The value function is assumed to be linearly proportional to the multiplication of 1 − φ SoB and φ SoC (Line 6), which indicates that the value of the state is higher when less SoB and more SoC are confirmed.Adam is employed with no initialization bias correction terms in actor update (Lines 11-13) with adaptation-aware setups.The linear relationship is also assumed between (mean) action value and the multiplication of φ SoB and φ SoC (Line 14), which means that smaller action values (i.e., less performance) is enough when the SoB level is less, and higher values can be provided when the SoC level is higher.The final action is generated based on the Gaussian distribution (Line 15) to guarantee explorations and to find an optimal action.Note that the algorithm mostly contains multiply and add operations; only two divisions, one squared root operations and a Gaussian random number generator are used.For minimizing SoB and maximizing SoC 4: Less SoB and more SoC are better states (better values) Calculate the eligibility trace z t+1 8: Update the critic parameter / * Actor: Policy gradient theorem using Adam with no initialization bias corrections * / 9: Estimate the first-order moment 11: Estimate the second-order moment 12: Update the actor parameter / * Next TX duty-cycle selection * / 13: Less SoB, smaller action values; More SoC, higher action values Return a t+1 18: end for each

Definition of Convergence
The convergences of TD(λ) algorithm in linear function approximations and policy gradient theorem are proven, but it is tricky and often application-specific to determine when they have converged.For instance, in [35], the authors defined the time of convergence in an episodic task as when the returns of the first 10 consecutive episodes are all within 5% of the average of the final 150 episodes.Meanwhile, other work have compared several methods and/or setups and analyzed the convergence only by the visual qualitative measurements [17,20].
In this study, we analyzed two kinds of convergence: the convergence for the initial state (i.e., for the first six months) and the convergence for a new state (i.e., for the last six months).We call the time of those convergence the time of fine-tuning (ToF) and the time of reactivity (ToR), respectively.Since the optimization process may greatly differ for each simulation due to the Gaussian policy as well as other stochastic factors, such as workload, scavenged energy, and wireless conditions, and the variance of the trace also appears to converge, as shown in Figure 3, the convergence analysis was conducted for the average trace of a concerned variable.Figure 4 shows the idea of how to analyze the convergence using two time-windows.We augment the approach used in [35] and define them as follows: 1.
All the mean values (e.g., actor parameter values ψ t ) taken over all the simulations at the same time points in a x-day sweeping window are all within 5% error band of the average of all the mean values in the last x-day window under almost the same state (e.g., under the same workload scenario in our test study here).

2.
The variances of the mean values taken over all the simulations at the same time points are confirmed to be not different, i.e., the homogeneity of variance is tested and confirmed by means of Levene's test [36], more precisely Brown-Forsythe [37] test, with the confidence interval of y% (note that we cannot say "the same" mathematically).Note that we run the sweeping window from the first day towards the end with step size of a time point, which is equal to T cui = 30 min, until the convergence is admitted.The reason for evaluating the homogeneity of variance is that the sequences of the mean values may be different in terms of variance between the two windows, which can defy the convergence.The use of Brown-Forsythe test is because we cannot expect the mean values to follow normal or symmetric distribution, and it is not as sensitive to violations of the normality assumption as alternative tests such as Hartley's F max test [38].Further, we observe the daily (e.g., weather) and seasonal effects that also heavily impact the optimization (see Section 8.2); therefore, we crafted an artificial one-year trace data by stacking one-day trace data upon one another 365 times.The use of such data trace is considered acceptable because the RL algorithm adopted is on-policy and is incapable of learning features or correlations between the current day and any past days.In other words, the algorithm is agnostic of time-independent information.
We set x = 5 throughout this paper, which corresponds to, for example, 240 data samples in a window in the case of T cui = 30 min, to ensure that the convergence is not merely temporal, and y = 5.Note that this quantitative way of evaluating the convergence is application-specific and applicable to our case.Although this is an attempt to provide the convergence time and its comparison, the performance itself may be sufficient at an earlier point for the practical use.

Simulation Results
The models shown in Section 3 and adaptive decision-making algorithms proposed in Section 6 were all coded in C++ and different simulation studies were conducted.Note that we used the real-life solar irradiance datasets provided by Oak Ridge National Laboratory [33].We now present the results of these simulations in the following sub-sections.

Divergence and Reactivity Problem
Our application scenario involves the TX duty-cycle optimization, which is in essence the same as the one addressed by the LAC algorithm in [2].Their reward function represents the multiplication of SoC and packet rate that is equivalent to the duty-cycle as a performance factor.The evolutions of ψ and TD-errors over a year obtained by simply applying their approach are illustrated in Figure 5.Such a representation of the reward function may entice the agent to greedily increase the duty-cycle for more rewards, which ends up causing power failures.More precisely, a sufficient energy reserve during the daytime helps such an increase, while, at some point, the blown-up duty-cycle cannot quickly be reduced anymore by the time the whole energy runs out due to a constant exploration range.In this case, the upper bound of the TX duty-cycle is way too large for the system as well as the application; in other words, no proper upper bound for braking duty-cycle explosion exists in the existing method.Note that, with the use of either Adam or a much smaller learning rate for the case of five-year dataset, the same divergence was observed.To resolve this issue, we suggest the use of SoB both as a performance and an upper bound index.
RL algorithms are supposedly self-adaptive to environmental changes.Such changes can be characterized by their periodicity and speed.To counteract them, the system can modulate its action(s) and/or the CUI.We fix the CUI as 30 min and consider an impulsive workload change (from 1.0 pkt/min to 2.0 after the first six months) to shed light on the reactivity issue of fixed learning rate.The set of hyperparameters for each algorithm is listed in Table 2.Note that these values are used for all simulation study throughout the paper.The traces of ψ t values and TD-errors averaged over 90 successful runs are illustrated in Figure 6.The power failures (in total 10 out of 10 unsuccessful cases, since one power failure was observed in each of those cases) are highlighted as red crosses at each time point, just showing when they occurred.As the CUI is fixed, the gradient (or its variance) becomes larger after 1 December, i.e., the workload doubles up as the SoB term is included in the reward, which leads to larger actor updates for ψ due to fixed learning rate.TD-errors also fluctuate in a certain range without divergence.This is because the SoB index provides the upper bound for how much performance is necessary.Note that we observed ψ values since the convergence speed is slower than TD(λ) algorithm and the TD-error follows the same trend as ψ values.Hence, in the case of the fixed CUI, the learning rate needs to be adapted at run-time.In the next subsection, we introduce a widely-used adaptive learning method, i.e., Adam, to the LAC algorithm by arranging the decay rates and show its effectiveness for reactivity alongside with fine-tuning adaptation to the initial state.

Effectiveness and Convergence of LAC-AB
As observed in the previous section, the existing LAC method is limited in adaptability due to fixed learning rate.Thus, we introduce Adam to LAC to deal with this problem and call it LAC-A algorithm.In addition, as explained in Section 6, it would be possible to exclude the use of initialization bias correction terms in Adam, which leads to the LAC-AB algorithm.In this section, we use 300 simulation results and compare these approaches with different decay rate setups in terms of convergence, latency, and power failures.Note that the latency in this work is defined as the time from when a packet arrives in the data buffer until when it is successfully transmitted to the sink node.
First, we use the EHD1 dataset and make a comparison among LAC-A with PA setting, LAC-A, and LAC-AB with different decay rates, through which we show more suitable setups for LAC-A/LAC-AB and the improvement in terms of latency, reactivity, and the number of power failures.The hyperparameters for this analysis are listed in Table 2.The transitions of ψ t for the three modes (LAC with PA setting, LAC-A, and LAC-AB) with β 1 = β 2 = 0.4 are depicted in Figure 7.The use of Adam rescales the gradient between before and after the workload change; however, this does not necessarily help avoid the power failures.While the number of failures increased to 28 times in PA setting compared to 10 in the case of LAC (fixed learning rate) (Figure 6), the system yielded no failures with 0.4 for both decay factors.This suggests that too large decay rates may cancel the gradient direction that is calculated as a result of Gaussian policy.Moreover, we observe that smaller decay rates in Adam help achieve fine-tuning of the learning rate, faster reactivity and even less gradient variation, since they allow for faster online tracking of changes in gradients, the rewards, or even the SoB and the SoC.The decay rate 0.4, for example, is relatively small and may permit no initialization bias corrections.Table 3 summarizes the latency and the number of power failures for each case above.The latency is evaluated before and after the workload change.As expected, the latency values of both LAC-A and LAC-AB with β 1 = β 2 = 0.4 are the same in the order of 10 −2 (e.g., 3.52 min), except for the standard deviation of the latency in the first six months, whose error is only 0.03 min (0.3%).As such, for the rest of the paper, we leverage and focus on the LAC-AB algorithm for achieving faster adaptations while reducing the computation and memory footprint.Nonetheless, in the case of the latter two modes, the ψ traces of the last half a year constantly decrease, which makes it harder to judge the convergence, whereas that of the first half remains almost constant.We use the EHD2 and EHD3 datasets where the seasonal and weather changes are removed and the randomness of real-life solar irradiance is still retained.The analysis with the use of those datasets can be supported by the fact that the LAC algorithm is an on-policy RL and no other algorithm for capturing correlations of any two different days is used.We obtained the traces of ψ for EHD2 and EHD3 depicted in Figure 3 as well as that for EHD1.We can claim qualitatively that the ψ of LAC-A using β 1 = β 2 = 0.4 for EHD1 dataset converges after the workload change despite its constant decrease, because the value converges to 2.25 × 10 −2 for EHD2 (corresponding to December) and to 2.0 × 10 −2 for EHD3 (corresponding to June) and the trace in-between can be explained by interpolation.This observation explains that the difference in the ratio of variations of the SoB and the SoC at each control interval gives rise to different optimization process.It can also be seen that the seasonal and weather-induced fluctuations in energy-harvesting may blur the optimization process.Hence, the analysis of ToF and ToR was conducted using EHD2 and EHD3.This way, we can also infer the range of convergence time for whenever the state changes.

Decay Rates Study for LAC-AB
For β 1 and β 2 , various combinations can be made, and, therefore, we need to grasp the tendency of which ones work better.To this end, we used EHD1 to evaluate the number of power failures and EHD3 to assess the convergence speed.We conducted 300 simulations for each set based on β 1 , β 2 ∈ [0.1, 0.7] with a 0.1 step, because no use of bias correction terms implies the use of lower values.With respect to ToF/ToR analysis, we used the average values over 300 simulations, since the randomness in each uncertainty such as Gaussian policy, workload, harvested energy, and wireless link quality still makes the optimization process unclear and yet can be mitigated by taking the average.Afterwards, we applied the ToF and ToR definition in Section 7 to these results.Note that the ToF and ToR can be measured also with EHD2 dataset, but the outcome is quite similar to EHD3's, and, therefore, removed for brevity.The results are depicted in Figure 8.The number of power failures (Figure 8  The analysis on the latency in the two different workload periods was also conducted, as illustrated in Figures 9 and 10.As can be seen, the results show the opposite trends.For the first six months, the latency has a tendency to decline as β 1 increases, ranging from 3.4 to 3.6 min and from 10 to 12 min for the mean and standard deviation, respectively, except for β 1 ∈ [0.5, 0.7] with β 2 ∈ [0.1, 0.2].By contrast, the rising trends are observed during the period of doubled workload, where the mean and standard deviation of latency fall between 6.0 and 6.2 min and between 10.65 and 11 min, respectively.Again, the exceptions occur with β 1 ∈ [0.5, 0.7] with β 2 ∈ [0.1, 0.2].These phenomena are explained by the effectiveness of high adaptability to each situation; higher adaptability by smaller decay rates is more suitable when the state such as SoB varies greatly (e.g., during the last six month), and less when the state is relatively constant (e.g., during the first six month).We ran 300 simulations and obtained ToF and ToR for both EHD2 and EHD3 dataset, since we can expect from the results in Figure 7 that the convergence speed may differ depending on the season.Figure 11 shows the ToF and ToR for the chosen sets of decay rates with the baseline values of those of the LAC algorithm using fixed learning rate.Obviously, the convergence speed improved compared to the case of LAC.With β 1 /β 2 = 0.3/0.1, for instance, the ToF and ToR reduced to 7.15 and 0.02 days for EHD2 (or 1 December dataset) and 7.48 and 7.67 for EHD3 (or 1 June dataset), respectively, compared to 83.40 and 47.52 days for EHD2 and 60.17 and 45.77 days for EHD3 in using fixed learning rate in LAC algorithm.Across all the combinations, the worst ToF and ToR are still 13.06 and 5.67 days for EHD2 and 13.12 and 14.44 for EHD3, respectively.In other words, the fine-tuning and reactivity speed have improved by at least 78.2-84.3% and 68.5-88.1%.With respect to the latency and power failure, Figure 12 depicts these metrics for the chosen decay rate combinations in comparison to the LAC algorithm without Adam.The EHD1 dataset was used to obtain the results consistent with the real-world situations.The differences in the mean values of latency are not outstanding across all cases, but the mean standard deviation for the first six months tends to decrease as the decay rates become larger, especially until β 1 reaches 0.4, while that for the last six months shows the opposite trend.Nonetheless, too large decay rates as well as fixed learning rate are more likely to bring about power failures.With smaller decay rates, the controller reacts more quickly to environmental stochasticity, leading to larger variations in latency, i.e., duty-cycle in case of less drastic changes as in the first six months, and yet with no or a couple of power failures, compared to 38 times in case of using fixed learning rate.
To summarize, we advise setting small decay rates such as β 1 ∈ [0.2, 0.4] and β 2 = 0.1 for power-failure-sensitive applications and larger β 1 ∈ [0.5, 0.7] with relatively smaller β 2 ∈ [0.2, 0.4] for latency-sensitive ones, although these values may vary according to target applications.With any of these setups, the number of power failures can be drastically reduced to zero or a few, and the reactivity speed falls around within a day up to 15 days and the initial convergence is attainable in about 5-13 days for our application use case where the CUI is 30min.To cope with reactivity to new situations, it would be another solution to reset and re-learn from scratch by detecting the environmental change as in [23] that requires the detection algorithm.However, our solution can achieve faster ToR than ToF, and therefore it is simpler and yet effective without such a mechanism.

Conclusions and Future Direction
In this paper, we propose LAC-AB, the integration of the LAC algorithm and Adam with smaller decay rates and no initialization bias correction terms in order to grapple with drastic environmental changes and avoid power failures.This idea is of importance because the recent trend in EH-IoT requires µW-range low-cost systems with high adaptability to the changing environment.The consideration of the SoB would act as an index for necessary and sufficient performance and help achieve scalability to wide applications.The introduction of Adam overcomes the reactivity issue caused by fixed learning rate in LAC algorithms.For quantitative analysis of convergence, we attempted to define the time of convergence for the ToF and ToR based on the mean and variance of the optimized parameter.First, we proved that, along with the effectiveness of SoB as a part of state, the conventional decay rates of β 1 = 0.9 and β 2 = 0.999 are not suitable and no use of initialization bias correction with smaller rates is acceptable.Then, we performed extensive simulations to obtain the best possible sets of β 1 , β 2 .Finally, our simulation results show that our proposal enables improving both the fine-tuning and reactivity, yielding 13.06 and 5.67 days for EHD2 and 13.12 and 14.44 for EHD3 for ToF and ToR, respectively, as the worst case among the best possible combinations of β 1 , β 2 , compared to 83.40 and 47.52 days for EHD2 and 60.17 and 45.77 days for EHD3 in the case of LAC algorithm using fixed learning rate.This corresponds to improvements of fine-tuning and reactivity of at least 78.2-84.3% and 68.5-88.1%,respectively.Importantly, the number of power failures is zero or a few times over 300 simulation cases in the LAC-AB algorithm, compared to 38 times for the LAC algorithm without Adam.Hence, our proposal augments LAC with more adaptability at low cost for lightweight IoT applications.
The proposed algorithm mostly involves multiply and add operations along with only two divisions, one squared root operation and one Gaussian random number generation.Ongoing research is being conducted to turn these computations into only multiply and add operations with the use of fixed-point quantization, and therefore, to make our algorithm more lightweight.Preliminary work suggests that the approximated version of our proposed algorithm exhibits similar behavior and would even allow the design of a dedicated hardware component.
1. EHD1: Non-processed real-life one-year data from 1 June 2018 to 31 May 2019 2. EHD2: Real-life one-year data made by stacking 365 one day (1 December 2018) worth of solar irradiance data 3. EHD3: Real-life one-year data made by stacking 365 one day (1 June 2018) worth of solar irradiance dataIn each simulation study presented below, we specify which dataset was used by the notation, EHD1, EHD2, or EHD3.The uses of EHD2 and EHD3 is justified in Sections 7 and 8.2.

Figure 1 .
Figure 1.Overview of the application scenario.

Figure 2 . 1 : 3 :
Figure 2. Overview of the actor and critic in LAC-AB algorithm.

Figure 5 .
Figure 5. Divergences of ψ (left) and TD-errors (right) over one year with the state-of-the-art LAC method (red crosses indicate the power failure points).

Figure 6 .
Figure 6.Transitions of ψ (left) and TD-errors (right) in case of the LAC algorithm using fixed learning rate with the SoB as a performance upper bound (red crosses represent when power failure(s) occurred).
, left) tends to decrease as β 1 value goes down to [0.1, 0.4] with at most one failure.With β 1 ∈ [0.1, 0.4], both ToF and ToR become faster when using lower β 2 ∈ [0.1, 0.3].This can be explained by the fact that the sudden change in gradients, i.e., SoB and/or SoC will be mitigated by the quick rescaling of variance, i.e., smaller decay rate such as β 2 = 0.1.By contrast, opting for β 2 = 0.1 and even 0.2 severely deteriorates the performance of energy management when using larger β 1 such as 0.6-0.7.For this range of β 1 , we obtain better outcome in both power failure and convergence speed with β 2 ∈ [0.3, 0.4] .All the above results considered, the choices must be made by considering the trade-offs between power failure and convergence speed.

Figure 10 .
Figure 10.Latency analysis of LAC-AB during the last six months for different sets of (β 1 , β 2 ) using EHD1 dataset: (left) mean; and (right) standard deviation.The experiments thus far led us to choose one of the best values of β 2 for each β 1 ∈ [0.1, 0.7] for further investigations:β 2 = 0.1 for β 1 ∈ [0.1, 0.4], β 2 = 0.2 for β 1 ∈ [0.5, 0.6],and β 2 = 0.3 for β 1 = 0.7.We ran 300 simulations and obtained ToF and ToR for both EHD2 and EHD3 dataset, since we can expect from the results in Figure7that the convergence speed may differ depending on the season.Figure11shows the ToF and ToR for the chosen sets of decay rates with the baseline values of those of the LAC algorithm using fixed learning rate.Obviously, the convergence speed improved compared to the case of LAC.With β 1 /β 2 = 0.3/0.1, for instance, the ToF and ToR reduced to 7.15 and 0.02 days for EHD2 (or 1 December dataset) and 7.48 and 7.67 for EHD3 (or 1 June dataset), respectively, compared to 83.40 and 47.52 days for EHD2 and 60.17 and 45.77 days for EHD3 in using fixed learning rate in LAC algorithm.Across all the combinations, the worst ToF and ToR are still 13.06 and 5.67 days for EHD2 and 13.12 and 14.44 for EHD3, respectively.In other words, the fine-tuning and reactivity speed have improved by at least 78.2-84.3% and 68.5-88.1%.

Figure 12 .
Figure 12.Latency and number of failed simulations of LAC-AB algorithm.

Table 1 .
Comparison of each algorithmic representations of the actor-critic method.

Table 3 .
Latency (min) and power failures for the three different algorithms and setups.