^{1}

^{★}

^{2}

^{3}

^{4}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In a network of low-powered wireless sensors, it is essential to capture as many environmental events as possible while still preserving the battery life of the sensor node. This paper focuses on a real-time learning algorithm to extend the lifetime of a sensor node to sense and transmit environmental events. A common method that is generally adopted in ad-hoc sensor networks is to periodically put the sensor nodes to sleep. The purpose of the learning algorithm is to couple the sensor’s sleeping behavior to the natural statistics of the environment hence that it can be in optimal harmony with changes in the environment, the sensors can sleep when steady environment and stay awake when turbulent environment. This paper presents theoretical and experimental validation of a reward based learning algorithm that can be implemented on an embedded sensor. The key contribution of the proposed approach is the design and implementation of a reward function that satisfies a trade-off between the above two mutually contradicting objectives, and a linear critic function to approximate the discounted sum of future rewards in order to perform policy learning.

A sensor network is a network of spatially distributed sensing devices used to monitor environmental conditions (such as, temperature, sound, vibration, pressure, _{i}_{i}

Sleep

Wake up

Read sensory information through the relevant ports. Ex. Analog to Digital conversion (ADC) port

Communicate the data to a remote node

Go back to sleep.

Environmental events that are random in nature (such as a tremor), require continuous sensing. If a sensor node deterministically sleeps, then it could miss critical events during the sleep period. However, time spent awake by the sensor, does not guarantee a random event in the environment—which drastically drains the on-board power of the sensor. Therefore, compromising the two conflicting objectives of staying awake to increase the probability of capturing critical information and conserving limited on-board energy to prolong the service time is generally a challenging task. Assume a sensor node is required to choose an optimum strategy of sleeping to preserve on-board power as long as possible while maximizing the chances of capturing important changes in the environment. Since the desired behavior is not known, the sensor can learn based on rewards and punishments in a reinforcement based learning paradigm [

Markov decision process (MDP) is a stochastic process and is defined by the conditional probabilities [

When quantified supervisory feedback is unavailable, reward based or reinforcement based machine learning technique has proven to be an effective method [_{k}/W_{max}_{i+1} − _{i}_{t}/Sl_{ave}_{max}_{ave}_{max}_{k}_{k−1}|) in the environment between two consecutive wake-up instances. However, this reward function cannot achieve a long term optimality of the sleeping behavior of the sensor. Therefore, we should design a critic function that estimates the total future rewards generated by the above reward function for an agent following a particular policy. The total expected future rewards

An optimum value function can be learnt by updating the internal parameters of the function to minimize the temporal difference error Δ. Since Δ is a known quantity, learning the value functions are strictly a supervised learning mechanism. However, it learns how a qualitative reward function behaves with states and actions. It should be noted that the ability to predict is an essential ingredient of intelligence because it allows us to explore for better behavioral policies without waiting for the final outcome. When a human baby learns to do various things like walking, jumping, cycling,

Assume a critic function has learnt to predict total expected future rewards at any given state using a temporal difference based learning scheme. _{dist}

If Δ_{dist}_{dist}_{dist}

Let us have a closer look at the actual function that the critic is trying to estimate. Ideally the total expected future rewards is given by

The choice of a value for ^{1} = 0.6^{2} = 3.6^{3} = 0.216^{4} = 0.126^{5} = 0.077^{6} = 0.046. This implies that any future reward value beyond the sixth time step will be discounted by less than 4%. Therefore, we can reasonably assume that we take only those future reward values upto the sixth time step. This reduces the complexity of estimating the critic function. However, this assumption depends on the uncertainty of the environment which is reflected in the value of the discounting factor. If discounting factor

Here we investigate the possibility of developing an autoregressive function to estimate ^{T}^{T}

The following recursive least squares algorithm is used to optimize the polynomial parameter vector

The above least squares estimation algorithm was run for different orders of the polynomial from 2 to 7. We observe that the total estimation error across all data sets behaved for different orders of the polynomial. Based on the simulation results and experimentations, it was noted that order 4 polynomial is best suited with a minimum average estimation error

With critic loaded on the sensor node, a 10 times reduction in the number of packets transmitted was achieved with a very few misses in registering the events. The events registered (

We further simplify the Markov decision making process and extend the real-time learning algorithm to a cluster of sensors that can choose the most optimum one out of several possible sleep strategies.

_{i}^{th}_{i}^{th}_{i}

_{i}_{max}_{max}_{max}

_{i}_{i}γ_{i}

_{i}

Our purpose here is to achieve the reward function defined by the following multiple objective unconstrained optimization problem:

We performed simulations for three sensor nodes that can adaptively choose from three different strategies of sleeping duty cycles given by [10%, 50%, 100%]. In order to test the ability of the learning strategy to adapt itself under changing environmental conditions, we use a sinusoidal ambient temperature profile that changes its frequency over time, given by

The above intelligent behavior emerged in a cluster of three sensors that used Markov decision process with a simple reward function that combined the two contradicting needs—to gather as much information as possible and to preserve as much on-board energy as possible—of a typical stand-alone sensor node. According to this approach, a cluster will have a general tendency to select the sleeping strategy that has accumulated the largest discounted sum of rewards. However, the explorative moves will enable the sensors to keep a track of the changing level of optimality of other strategies in a stochastic environment. Therefore, the Markov decision making strategy can automatically adapt to suit the statistics of the environment.

Real-time learning algorithm is developed to extend the lifetime of a sensor node to sense and transmit environmental events. The purpose of our new learning algorithm is to couple the sensor’s sleeping behavior to the natural statistics of the environment therefore it can be in optimal synchronization with changes in the environment by sleeping with steady environment and staying awake when turbulent environment. Theoretical and experimental validation of a reward based learning algorithm that can be implemented on an embedded sensor is presented. Our results show the inclusion of proposed learning algorithm is significantly more effective to preserve sensor on-board energy by optimal selection of sleeping strategies.

We have presented results for a network of three sensor nodes. However, the method can be extended to a generic

Temporal difference based learning to predict.

Actor−critic based learning: using the ability to predict to improve the behaviors (control policy). Here is a sleeping policy of the sensor node.

How the temporal difference can be used to improve the policy. Here

The structure of the polynomial critic function.

Evaluations of reward and critic.

Implementation of reinforcement learning on sensors in an outdoor environment, by using MTS400 CA embedded board with external antenna.

Adaptive behavior of a cluster of sensor nodes following a Markov decision process in a stochastic environment.

The comparison of performance with and without the critic based adaptive sleeping behavior.

Time | Number of Transmitted Packet | |
---|---|---|

With Critic | Without Critic | |

0 | 0 | 0 |

50 | 100 | 600 |

100 | 120 | 1200 |

150 | 200 | 1850 |

200 | 260 | 2500 |

250 | 310 | 3100 |

300 | 400 | 3750 |