A Data Compression Hardware Accelerator Enabling Long-Term Biosignal Monitoring Based on Ultra-Low Power IoT Platforms

For highly demanding scenarios such as continuous bio-signal monitoring, transmitting excessive volumes of data wirelessly comprises one of the most critical challenges. This is due to the resource limitations posed by typical hardware and communication technologies. Driven by such shortcomings, this paper aims at addressing the respective deficiencies. The main axes of this work include (a) data compression, and (b) the presentation of a complete, efficient and practical hardware accelerator design able to be integrated in any Internet of Things (IoT) platform for addressing critical challenges of data compression. On one hand, the developed algorithm is presented and evaluated on software, exhibiting significant benefits compared to respective competition. On the other hand, the algorithm is fully implemented on hardware providing a further proof of concept regarding the implementation feasibility with respect to state-of-the art hardware design approaches. Finally, system-level performance benefits, regarding data transmission delay and energy saving, are highlighted, taking into consideration the characteristics of prominent IoT platforms. Concluding, this paper presents a holistic approach based on data compression that is able to drastically enhance an IoT platform’s performance and tackle efficiently a notorious challenge of highly demanding IoT applications such as real-time bio-signal monitoring.


Introduction
Internet of Things (IoT) short-range, ultra-low power communication technologies comprise one of the most rapidly evolving research areas attracting significant interest both from academia and the industry [1].Consequently, respective communication protocols as well as platforms have emerged as prominent infrastructure upon which future Cyber-Physical Systems (CPS) can be based.A major factor for this growth can be attributed to the novel communication paradigm, through which relative communication approaches have been introduced enabling flexible communication, distributed operation and rapid deployment without the need for any pre-existing infrastructure, mobility support, low-power functionality and many more critical features.Additionally, IoT and CPS have significantly benefited from the advancements in hardware and Very Large Scale Integration (VLSI) design leading to very low cost, complexity, size and, most importantly, power consumption embedded systems able to be used as a suitable hardware infrastructure [2][3][4].Furthermore, another aspect related to hardware design, concerns the anticipated advantages yielded by the hardware accelerators.The latter, comprise highly specialized hardware components integrated on general-purpose IoT platforms, dedicated to perform highly complex operations in the most efficient way.Such approach offers multifaceted benefits since the design of nodes can exhibit significantly higher degree of dynamicity and modularity, tailored to specific applications.At the same time, demanding operations are performed with maximum efficiency, while the main processing unit of the nodes is not overwhelmed.
A consequence of the aforementioned benefits is that the relative application domain has significantly expanded from simple data processing and transmission scenarios to increasingly demanding ones.Respective examples entail significant processing workload (e.g., medical applications), the support of complex communication standards (e.g., IPv6 routing protocol support enabling true IoT deployment), and the cooperative functionality among nodes in order to achieve a common goal etc.Of course, aspects such as time-constrained communication and power aware operation impose stringent requirements.In order to meet the respective requirements, researchers must surpass the challenge of the extremely limited resource availability encountered in all nowadays Wireless Sensor Networks (WSN) platforms.The respective limitation ranges from low processing power offered by typical low-end micro-controller units, to limited communication bandwidth offered by protocols like IEEE 802.15.4 or Bluetooth, as well as scarce availability in terms of memory (few tens of Kbytes) and energy.Concerning the latter limitation (i.e., energy availability), emphasis must be put on the fact that the wireless radio interface comprises the main power consumption factor, which highlights the respective deficiencies.Therefore, a common critical objective in various pragmatic approaches, aiming towards lifetime extension, is the deactivation of the relative hardware when respective functionality is not required for extended periods of time [5,6].
In this paper, the authors tackle the aforementioned deficiencies stemming from scarce bandwidth and energy availability, which in conjunction with demanding applications comprise a notorious challenge.As it will be analyzed in the following section, real-time analysis of biomedical modalities requires excessive volume of data transmission, forming two challenging side effects.On one hand, it is easily quantifiable that in many cases the bandwidth offered cannot handle the required data rate.On the other hand, even in cases where the channel capacity is adequate for data creation rates, the respective scenarios lead to continuous operation of highly power consuming components, and especially the wireless radio interface.Respective cases contradict typical WSN platform paradigm, which effectively dictates the transition of power-hungry hardware to low-power states for substantial periods of time.
Driven by such conditions, this paper proposes a holistic solution that allows mitigating the respective side effects based on a highly efficient and resource conservative data compression hardware accelerator.Specifically, the critical contribution of this work is the design, development and system level evaluation of the potential benefits of the proposed hardware component taking into consideration realistic, demanding biomedical datasets as well as real performance characteristics of state-of-the-art IoT platform.As it will be presented, the developed compression algorithm and its hardware design is able to compress the targeted modalities on-the-fly.Additionally, it is able to yield a considerable compression rate in all cases, thus enhancing the transmission capabilities of an IoT platform.Furthermore, the proposed hardware accelerator leads to drastic power conservation due to the data volume reduction, assuming that IoT nodes are able to deactivate radio interface when not transmitting.To the best of the authors' knowledge, this is the first attempt to offer a practical, efficient and feasible complete IoT compression component going beyond proposing an isolated software compression approach.On the contrary, the proposed solution has been implemented in hardware and evaluated taking into account realistic wireless transmission delays and power consumption statistics of prominent IoT wireless communication interfaces.
The rest of the paper is structured as follows: In Section 2 critical information is provided highlighting the application scenarios, relative literature and the data characteristics of the specific problem.Section 3 presents the design and evaluation of the proposed compression algorithm yielding the required characteristics so as to optimally address the targeted problem.Section 4 comprises the cornerstone of this work and presents the actual hardware module design, implementation and evaluation.Section 5 offers a critical system level performance evaluation of the proposed hardware accelerator assuming that it has been integrated in prominent real IoT platforms and two different wireless communication interfaces.Finally, Section 6 offers a summarizing discussion on performance evaluation and module's implementation capabilities while Section 7 highlights the main points of this work and offers significant future extensions based on the presented work.

Relative WSN Compression Approaches
Based on the previous analysis, an elicitation concerning adequate compression algorithms for epilepsy monitoring and utilization in WSN networks is conducted.The elicitation process followed a multifaceted approach considering (a) the nature of the application scenario the authors focused on, which is electrocardiogram/electroencephalogram (ECG/EEG) physiological signal monitoring, (b) the representation of the digitized data as time-series and (c) the performance aspects taking into consideration the utilization in on-the-fly WSN scenarios.With respect to the first characteristic, the criticality and the accuracy required for the monitoring and study of epileptic people dictates zero tolerance to datum corruption due to the compression process.Consequently, only lossless compression algorithms suitable for wireless sensors have been considered [7,8] as opposed to lossy approaches.Regarding the second parameter, we focused on algorithms well known for their effectiveness on time-series datasets [7][8][9][10][11] taking into consideration that both Electroencephalography (EEG) and Electrocardiography (ECG) result into time-series datasets.Finally, emphasizing on the utilization of such algorithms in the context of WSNs, our elicitation process focuses on low complexity in order to offer viable solutions for typical WSN nodes.In addition, it aims at minimizing the delay overhead and operating in a time constrained manner [10].Taking into account a wide range of adequate compression approaches, the algorithms Lossless Entropy Compression (LEC) [8] and Adaptive Lossless Entropy Compression (ALEC) [10] were finally selected as the starting point for the hardware implementation.LEC is a low complexity lossless entropy compression algorithm resulting in a very small code footprint that requires very low computational power [12].Its operation is based on utilizing a very small dictionary and exhibits impressive on-the-fly compression capabilities.Consequently, LEC is quite attractive for WSN deployments.The main steps of the algorithm include:

•
Calculation of differential signal.

•
Computing of the difference d i between the binary representations r i and r i − 1 of the current and previous measurements respectively, encoding is applied upon d i resulting in the corresponding bit sequence bs i .

•
The sequence bs i is then concatenated to the bit sequence stream generated so far.
The main processing complexity is attributed to the encoding phase of the compression, which aims at transforming d i to bs i bit sequences.During this process, firstly the number n i of bits needed to encode the value of d i is computed.Secondly, the first part of bs i , indicated as s i , is generated by using the table that contains the dictionary adopted by the entropy compressor.In that respect, in our initial evaluation JPEG algorithm has been adopted because the coefficients used in the JPEG have similar statistical characteristics to the measurements acquired by the sensing unit.In our implementation, the table has been extended so as to cover the necessary 16-bit resolution of the Analog to Digital Convertor (ADC) in the sensors.Thirdly, the second part of bs i , indicated as a i is the n i low-order bits of d i is calculated [8].
Focusing on the implementation features of LEC, it can be implemented by maintaining in memory only the column s i of the aforementioned table.Overall, it can be easily extracted that LEC avoids any computationally intensive operation, which is highly appreciated in scarce resource WSN platforms.As a result, it exhibits very low execution delay facilitating real-time operation.However, basing its operation solely on a static lookup table, it does not support dynamic configuration capabilities with respect to the characteristics of a specific signal.The latter, is a significant drawback that negatively affects LEC's compression rate capabilities [12].ALEC, on the other hand, is based on an adaptive lossless entropy compression approach also requiring low computational power.However, it uses three small dictionaries, the sizes of which are determined by the resolution of the analog-to-digital converter (ADC).Adaptive compression schemes allow the compression to dynamically adjust to the data source.The data sequences to be compressed are partitioned into blocks and for each block the optimal compression scheme is applied.Effectively, the algorithm is similar to LEC algorithm, while the main difference is that ALEC algorithm uses three Huffman coding tables instead of the one table used for the DC coefficients in JPEG algorithm.Specifically, ALEC uses adaptively two Huffman tables and three Huffman tables, respectively.It is noted that compared to LEC, ALEC exhibits an increased number of the lookup tables it employs.As a result, its compression-rate efficiency is also increased.However, each data block is passed through two lookup tables, which eventually will result in an increase of the algorithm's processing delay [12].

Application Scenarios and Data Characteristics
A critical objective of this effort concerns the performance enhancement offered by an efficient compression module to the state-of-the-art WSN platforms targeting highly demanding applications.Epilepsy monitoring based on IoT platform, represents the main targeted application scenario of this paper.From an engineering point of view, the respective application scenarios are based on acquiring excessive amounts of data (digitized physiological measurements), for extended periods of time, which must be either stored locally or transmitted to an aggregation point.The most demanding case, comprising the main objective of this paper, concerns the real-time monitoring of an epileptic person.This represents a critical requirement since epileptic events can occur unexpectedly and unpredictably, while the triggering event is highly personalized to the specific person.For that reason, periodic-based monitoring leads to myopic and unreliable results and conclusions.However, in both cases highly accurate signal monitoring is also required, which represents also a critical requirement towards effective compression.
WSNs comprise a rapidly evolving research area proposing a new communication paradigm making them appealing for such cases.However, in realistic scenarios the respective platforms suffer from critical resource limitations especially in areas such as energy availability, communication bandwidth and processing power that, at the same time, are interdependent resource consuming factors.More specifically, the operation of state-of-the-art WSN platforms is based on small batteries, typically offering energy capacity from 450 mAh (e.g., Shimmer platform) up to 3300 mAh (e.g., two AA batteries).Consequently, all components must offer low-power characteristics resulting in processing units with low processing capabilities (provided by 16-bit-based Micro-Controller Units) and limited available memory (in the area of 10 Kbyte RAM).Additionally, the data transfer is based on low-power wireless interfaces such as IEEE 802.15.4 and Bluetooth offering bandwidth capabilities at physical layer from 250 Kbps up to a few Mbps.
Taking into consideration the aforementioned application scenarios, the two modalities of paramount importance in epileptic seizure study, resulting into significant amount of accumulated data, are EEG and ECG measurements.Typical acquisition devices produce samples represented as 16-bit numbers.Furthermore, a wired EEG setup is usually comprised of 64 sensors with sampling frequency up to 2.5 kHz, while ECG typically requires 4 sensors with adequate sampling frequency of a few hundreds of Hertz.Given that the main application scenario targeted is the real-time monitoring of epileptic persons, even with some rough calculations, it can be derived that a setup of 64 EEG sensors requires higher wireless bandwidth (not considering packet headers and control data) than the one prominent communication technologies can provide.Specifically, assuming typical wireless technologies such as IEEE 802.15.4 and Bluetooth, such scenarios pose overwhelming burden to WSN platforms typically offering extremely limited resources [9,[13][14][15][16][17][18].
Such workload scenarios demand that power-hungry components (such as the wireless interfaces or/and the processing unit) are continuously operating at their maximum power consumption operational state.Such behavior, however, contradicts to the main objective of typical WSNs of residing such components to low-power states for extended periods of time enabling drastic extension of the network lifetime and thus meet demanding requirements of many hours or days of monitoring.
Therefore, in such cases the reduction of the amount of data (i.e., by compressing them) can yield significant and multifaceted benefits to the system performance.On one hand, if we assume that a specific amount of data is compressed by x%, the respective data volume reduction can be correlated to analogous reduction of radio utilization resulting in significant energy conservation.On the other hand, the x% compression percentage can also be envisioned as analogous reduction of bandwidth requirement and thus effectively increasing the limited wireless channel utilization while reducing the resulting data transfer delay.

Compression Algorithm Design
In this section, the proposed, and implemented, lossless entropy compression algorithm is presented, focusing on low computational power.Its main characteristic is the exploitation of the observed data's frequencies in the time series in order to offer an efficient dynamically adaptive Huffman table for the encoding process.The main novelty is based on extending the approaches derived from previously presented LEC and ALEC algorithms (i.e., low complexity, low code footprint, low compression delay).In this way, the proposed algorithm retains the very good processing delay performance characteristic.However, at the same time it significantly enhances achieved compression rate by increasing adaptability to the data characteristics.Consequently, compression rate exhibited by the proposed extension is highly competitive compared to respective solutions published in relative literature, while it also offers optimum compression rate vs compression processing delay trade-off.It is noted that a detailed analysis of the characteristics of this compression approach and evaluation at software level based on Matlab implementations can be found in [12].

Rationale of the Proposed Algorithm
LEC algorithm compresses data in its entropy encoder with the use of a fixed table leading to a very fast execution performance.This table is an extension of the table used in the JPEG algorithm to reach the size necessary for the resolution of the ADC in use.Additionally, it is based on the fact that the closer the absolute of a value is to zero more frequently it is observed in the differential signal.
However, it has been noticed that this frequency distribution may be valid for a file or a stream of data, however it is not always accurate considering fractions of the file or the data stream.Based on this observation ALEC algorithm uses a small amount of fixed Huffman Codes tables that can be alternatively used to produce smaller code for a packet of data.Furthermore, the specific table is not optimal for the specific data under test at each particular experiment.
Therefore, in the proposed scheme (as depicted in Figure 1) a novel approach is introduced, where the Huffman Codes lookup tables used are continuously adjusted according to the characteristics of the specific data used.Also the degree that the tables are adjusted is also configurable, offering fine tuning capabilities and enhancing the added value of the respective novel approach.

Utilization of Data Statistical Knowledge
Usually statistical knowledge is available when time-series data are measured and transmitted through wireless sensors.A method can be used based on earlier observations, but since data are changing over time this knowledge can be of questionable value as far as the compression effectiveness is concerned.
Therefore, in the proposed scheme, the previously observed frequency values are exploited to effectively update Huffman Code tables for the values to follow.Initially, the differential signal is produced, as most likely it has values that can be more effectively compressed.Following the initial phase, the differential signal is separated in fixed size packets.In the first packet, since there is no statistical knowledge of the data, the method is using the table from the LEC algorithm.However, in each following packet the statistical knowledge from previous data is used to create on the fly an adaptive Huffman Code table.The alphabet of numbers is divided into groups, the sizes of which increase exponentially.Each new differential signal value observed, leads to the appropriate groupʹs frequency increment.When the processing of a data packet ends, the frequencies of each group are used to extract the possibilities of a value that belongs to that group.The blocks are sorted in descending order by their possibilities and a binary tree is created.
After the formation of the Huffman Code table for the following packet, the current frequency table is element wise multiplied with a factor varying between 1 and 0. Therefore, as this parameter is approaching 0 the degree by which history (i.e.frequencies observed in previous packets) is taken into account in the next Huffman Code table diminishes.Therefore, if ʺ0ʺ is selected only the frequencies of the cur-rent packet are used; if 1 is selected the frequencies of every previous packet are equally used in the encoding of the next packets [12].

Utilization of Data Statistical Knowledge
Usually statistical knowledge is available when time-series data are measured and transmitted through wireless sensors.A method can be used based on earlier observations, but since data are changing over time this knowledge can be of questionable value as far as the compression effectiveness is concerned.
Therefore, in the proposed scheme, the previously observed frequency values are exploited to effectively update Huffman Code tables for the values to follow.Initially, the differential signal is produced, as most likely it has values that can be more effectively compressed.Following the initial phase, the differential signal is separated in fixed size packets.In the first packet, since there is no statistical knowledge of the data, the method is using the table from the LEC algorithm.However, in each following packet the statistical knowledge from previous data is used to create on the fly an adaptive Huffman Code table.The alphabet of numbers is divided into groups, the sizes of which increase exponentially.Each new differential signal value observed, leads to the appropriate group's frequency increment.When the processing of a data packet ends, the frequencies of each group are used to extract the possibilities of a value that belongs to that group.The blocks are sorted in descending order by their possibilities and a binary tree is created.
After the formation of the Huffman Code table for the following packet, the current frequency table is element wise multiplied with a factor varying between 1 and 0. Therefore, as this parameter is approaching 0 the degree by which history (i.e., frequencies observed in previous packets) is taken into account in the next Huffman Code table diminishes.Therefore, if "0" is selected only the frequencies of the cur-rent packet are used; if 1 is selected the frequencies of every previous packet are equally used in the encoding of the next packets [12].

Performance Analysis of the Proposed Algorithm
In this section, a brief comparative analysis is presented of the proposed compression algorithm (Real Time Huffman) against LEC and ALEC approaches highlighting critical advantages offered.Considering the former comparison, all three algorithms are implemented in Matlab environment and are being evaluated with respect to the compression rate achieved and compression delay assuming execution on the same personal computer.The evaluation is based upon real EEG and ECG signals, extracted either from respective open data bases or from real measurements undertaken in the context of the specific evaluation effort, which comprise the most challenging modalities in epilepsy monitoring.A brief description of the sources of the datasets as well as the execution environment is as follows.

PhysioNet Database
PhysioNet [19] was established in 1999 as the outreach component of the Research Resource for Complex Physiologic Signals cooperative project.From this database signals were extracted and used as evaluation testbeds of the implemented algorithms from the following two specific subcategories:

Apnea-ECG Database
This database has been assembled for the PhysioNet/Computers in Cardiology Challenge 2000 [20].From this database the ecgA04apnea and ecgB05apnea used in the evaluation process have been acquired.

CHB-MIT Scalp EEG Database
This database [21], collected at the Children's Hospital Bos-ton Massachusetts Institute of Technology (MIT), consists of EEG recordings from pediatric subjects with intractable seizures [22].

University of Patras, EEG and ECG Signals
The dataset is provided by the Neurophysiology Unit, Laboratory of Physiology School of Medicine, University of Patras (UoP) [23].EEG was recorded using 58 EEG Ag-AgCl electrodes according to the extended international 10-20 system.

EkgMove ECG Signals
EkgMove [24] is a psycho-physiological measurement sys-tem for research applications.From the various measurement capabilities of the sensor, the ECG signal of a single subject has been used.
As a result, from the respective evaluation, Figure 2 presents the achieved compression rate of the three different compression algorithms over the dataset of eight different signals is presented.As depicted, the proposed RT-Huffman algorithm manages to offer the highest compression in all cases.Specifically, compared to LEC, RT-Huffman offers an increased compression rate varying from 1.5 up to 4.3% while the same variation with respect to ALEC reaches up to 3.5%.
Table 1 presents the measurements concerning the processing delay of compression algorithms under evaluation.In this case, due to drastic processing delay differences of the different biosignals, in order to extract more objective conclusions Table 1 depicts the absolute delay for the algorithm offering the lowest measurement and the percentage deviation for the other two algorithms.
As extracted from Table 1, LEC yields the lowest processing demands upon the processing unit.However, what is even more important is that in all cases the proposed RT-Huffman proves to be the second less resource demanding solution clearly outperforming ALEC and exhibiting a steady and relative small overhead compared to LEC thus able to meet time constrained performance demands.Table 1 presents the measurements concerning the processing delay of compression algorithms under evaluation.In this case, due to drastic processing delay differences of the different biosignals, in order to extract more objective conclusions Table 1 depicts the absolute delay for the algorithm offering the lowest measurement and the percentage deviation for the other two algorithms.As extracted from Table 1, LEC yields the lowest processing demands upon the processing unit.However, what is even more important is that in all cases the proposed RT-Huffman proves to be the second less resource demanding solution clearly outperforming ALEC and exhibiting a steady and relative small overhead compared to LEC thus able to meet time constrained performance demands.

Hardware Implementation Design
The proposed algorithm was implemented at hardware level and a respective system level performance evaluation was carried out.The compression module processes input samples on the fly with a latency of 4 clock cycles.As depicted in Figure 3, a small 16 × 16 bit Input Buffer stores the incoming 16-bit samples and propagates them to the compression module when instructed by the Local Controller.On system power-up the samples are propagated through the differential datapath comprised by a Subtractor and an Absolute Calculation Unit.The absolute value of all samples is used

Hardware Implementation Design
The proposed algorithm was implemented at hardware level and a respective system level performance evaluation was carried out.The compression module processes input samples on the fly with a latency of 4 clock cycles.As depicted in Figure 3, a small 16 × 16 bit Input Buffer stores the incoming 16-bit samples and propagates them to the compression module when instructed by the Local Controller.On system power-up the samples are propagated through the differential datapath comprised by a Subtractor and an Absolute Calculation Unit.The absolute value of all samples is used to update a metric table with statistical information and is also used to produce the compressed output.This is the initialization phase of the system.
When a number of samples equal to the defined block size has been collected, the controller enters in calculations' phase and pauses further sample propagation to the rest of the system.The Huffman Micro-Processor Unit calculates and produces the Huffman (S) table, based on the populated metric table, which will be applied on the next block of incoming data.The custom microprocessor functions with encoded operations are designed so as to optimize this phase.The core of the Huffman algorithm is implemented by performing parallel memory accesses on the 33 × 9 bits Tableful Parallel Memory and by un-rolling all nested while and for loops to serial operations.A 512 × 36 Instruction Memory drives the micro-processor to execute all Real Time Huffman algorithm calculations which leads to a substantial processing latency.The worst-case processing latency due to the iterative nature of the algorithm is calculated to 4175 clock cycles per block of 500 samples.Once the Huffman (S) table has been calculated, the controller resumes propagation of samples through the differential data-path and S is applied to the next block of incoming samples in order to produce the compressed output.When a number of samples equal to the defined block size has been collected, the calculation phase is activated again and so on.
Huffman Micro-Processor Unit calculates and produces the Huffman (S) table, based on the populated metric table, which will be applied on the next block of incoming data.The custom microprocessor functions with encoded operations are designed so as to optimize this phase.The core of the Huffman algorithm is implemented by performing parallel memory accesses on the 33 × 9 bits Tableful Parallel Memory and by un-rolling all nested while and for loops to serial operations.A 512 × 36 Instruction Memory drives the micro-processor to execute all Real Time Huffman algorithm calculations which leads to a substantial processing latency.The worst-case processing latency due to the iterative nature of the algorithm is calculated to 4175 clock cycles per block of 500 samples.Once the Huffman (S) table has been calculated, the controller resumes propagation of samples through the differential data-path and S is applied to the next block of incoming samples in order to produce the compressed output.When a number of samples equal to the defined block size has been collected, the calculation phase is activated again and so on.In Figure 4, the block data flow with respect to compression of the input data block and the calculations of the S table is depicted.It is noted (indicated with the immediate receipt of the data of the 4th data block following the 3rd data block) that if the intermediate interval between the last datum of one block and the first datum of the subsequent is smaller than the time required to perform the S table calculations, the Ready For Data (RFD) signal will go low; then excessive data can be stored in an input buffer until the S table calculations are concluded, and then the compression process resumes.In Figure 4, the block data flow with respect to compression of the input data block and the calculations of the S table is depicted.It is noted (indicated with the immediate receipt of the data of the 4th data block following the 3rd data block) that if the intermediate interval between the last datum of one block and the first datum of the subsequent is smaller than the time required to perform the S table calculations, the Ready For Data (RFD) signal will go low; then excessive data can be stored in an input buffer until the S table calculations are concluded, and then the compression process resumes.
Huffman Micro-Processor Unit calculates and produces the Huffman (S) table, based on the populated metric table, which will be applied on the next block of incoming data.The custom microprocessor functions with encoded operations are designed so as to optimize this phase.The core of the Huffman algorithm is implemented by performing parallel memory accesses on the 33 × 9 bits Tableful Parallel Memory and by un-rolling all nested while and for loops to serial operations.A 512 × 36 Instruction Memory drives the micro-processor to execute all Real Time Huffman algorithm calculations which leads to a substantial processing latency.The worst-case processing latency due to the iterative nature of the algorithm is calculated to 4175 clock cycles per block of 500 samples.Once the Huffman (S) table has been calculated, the controller resumes propagation of samples through the differential data-path and S is applied to the next block of incoming samples in order to produce the compressed output.When a number of samples equal to the defined block size has been collected, the calculation phase is activated again and so on.In Figure 4, the block data flow with respect to compression of the input data block and the calculations of the S table is depicted.It is noted (indicated with the immediate receipt of the data of the 4th data block following the 3rd data block) that if the intermediate interval between the last datum of one block and the first datum of the subsequent is smaller than the time required to perform the S table calculations, the Ready For Data (RFD) signal will go low; then excessive data can be stored in an input buffer until the S table calculations are concluded, and then the compression process resumes.

I/Os Timing Protocols
As seen in Figure 5, the mechanism has a 4 CPU cycle delay between the time an uncompressed datum is processed and the time the corresponding compressed equivalent is ready at the output.Therefore, and based on the lookup table at that point, datum 0×FFD8 corresponds to 0×39A requiring 10 bits according to the lookup table, datum 0×000 corresponds to 0×3A5 also corresponding to 10 bits and so on.

I/Os Timing Protocols
As seen in Figure 5, the mechanism has a 4 CPU cycle delay between the time an uncompressed datum is processed and the time the corresponding compressed equivalent is ready at the output.Therefore, and based on the lookup table at that point, datum 0×FFD8 corresponds to 0×39A requiring 10 bits according to the lookup table, datum 0×000 corresponds to 0×3A5 also corresponding to 10 bits and so on.The same timing process is followed until the end of the block as indicated in Figure 6, where the cycled datum is the 500 th (i.e.last) datum of the block corresponding to the corresponding cycled datum of data out pin.

Implementation Results
In Table 2, the processing performance characteristics are presented regarding the processing delay of one block of data and the corresponding throughput considering a relative low frequency clock, adequate for embedded systems.

Calculation of the throughput rate:
A data block of 500, 16bit samples will be processed at worst case every: 4 cc (data-path latency) + 500 cc (input samples) + 4175 c (processing delay) = 4679 cc.
Throughput rate (bps) = (16 × 500 bits)/(4679 × Tclk), For 62.5 MHz CLK, we get worst case processing latency estimation at 106.8 Mbps, which is equal to 13.375 "500 × 16bits samples" data-blocks per second.Based on such throughput capacity the following comparison with software-based respective implementation can be made.
Considering 10-minute EEG signals acquired from the MIT data base [22] the corresponding file is 307.200bytes and the required compression delay exhibited by the software-based algorithm was measured to be approximately 20 secs considering an i7 dual core pc.Based on the previous calculation, the respective delay of the hardware accelerator is anticipated to be around 20 msec, thus offering a drastic decrease regarding delay as well as resource consumption.Furthermore, such evaluation indicates that the presented hardware accelerator comprises a highly efficient solution The same timing process is followed until the end of the block as indicated in Figure 6, where the cycled datum is the 500th (i.e., last) datum of the block corresponding to the corresponding cycled datum of data out pin.

I/Os Timing Protocols
As seen in Figure 5, the mechanism has a 4 CPU cycle delay between the time an uncompressed datum is processed and the time the corresponding compressed equivalent is ready at the output.Therefore, and based on the lookup table at that point, datum 0×FFD8 corresponds to 0×39A requiring 10 bits according to the lookup table, datum 0×000 corresponds to 0×3A5 also corresponding to 10 bits and so on.The same timing process is followed until the end of the block as indicated in Figure 6, where the cycled datum is the 500 th (i.e.last) datum of the block corresponding to the corresponding cycled datum of data out pin.

Implementation Results
In Table 2, the processing performance characteristics are presented regarding the processing delay of one block of data and the corresponding throughput considering a relative low frequency clock, adequate for embedded systems.

Calculation of the throughput rate:
A data block of 500, 16bit samples will be processed at worst case every: 4 cc (data-path latency) + 500 cc (input samples) + 4175 c (processing delay) = 4679 cc.
Throughput rate (bps) = (16 × 500 bits)/(4679 × Tclk), For 62.5 MHz CLK, we get worst case processing latency estimation at 106.8 Mbps, which is equal to 13.375 "500 × 16bits samples" data-blocks per second.Based on such throughput capacity the following comparison with software-based respective implementation can be made.
Considering 10-minute EEG signals acquired from the MIT data base [22] the corresponding file is 307.200bytes and the required compression delay exhibited by the software-based algorithm was measured to be approximately 20 secs considering an i7 dual core pc.Based on the previous calculation, the respective delay of the hardware accelerator is anticipated to be around 20 msec, thus offering a drastic decrease regarding delay as well as resource consumption.Furthermore, such evaluation indicates that the presented hardware accelerator comprises a highly efficient solution

Implementation Results
In Table 2, the processing performance characteristics are presented regarding the processing delay of one block of data and the corresponding throughput considering a relative low frequency clock, adequate for embedded systems.

Calculation of the throughput rate:
A data block of 500, 16 bit samples will be processed at worst case every: 4 cc (data-path latency) + 500 cc (input samples) + 4175 c (processing delay) = 4679 cc.
Throughput rate (bps) = (16 × 500 bits)/(4679 × Tclk), For 62.5 MHz CLK, we get worst case processing latency estimation at 106.8 Mbps, which is equal to 13.375 "500 × 16 bits samples" data-blocks per second.Based on such throughput capacity the following comparison with software-based respective implementation can be made.
Considering 10-minute EEG signals acquired from the MIT data base [22] the corresponding file is 307.200bytes and the required compression delay exhibited by the software-based algorithm was measured to be approximately 20 s considering an i7 dual core pc.Based on the previous calculation, the respective delay of the hardware accelerator is anticipated to be around 20 ms, thus offering a drastic decrease regarding delay as well as resource consumption.Furthermore, such evaluation indicates that the presented hardware accelerator comprises a highly efficient solution when considering multiple channels acquired concurrently or/and low frequency processing units being available by the sensors.
Additionally, the implementation presented is resource conservative as indicated in Table 3, depicting the degree of FPGA resources required by the specific implementation.Such measurements effectively indicate that more than one hardware accelerator could coexist in the same FPGA board offering increased flexibility.

Power Dissipation
For 1 Kbps requirement, 1 kHz clock will be sufficient.However due to XPower Analyzer limitations [23] we can get a power estimate for frequencies as low as 1 MHz (for worst case 100% toggling rates).Specifically, the total power usage of an FPGA device can be broken down to total static power and total dynamic power.Static power is associated with DC current while dynamic power is associated with AC current.Static power is power consumed while there is no circuit activity.Dynamic power is power consumed while there is inputs/circuit activity (Table 4).

System-Level Performance Evaluation
The main objective of this section is to analyze both qualitatively and quantitatively the benefits that the integration of the proposed compression module can offer to a real WSN platform, by taking into account both the performance and the energy conservation.
The degree by which the bandwidth required to transmit a specific amount of data is reduced, due to utilization of compression, depends solely on the algorithm and not on the integration efficiency.However, the system-level transmission delay and the energy consumption metrics depend heavily on the hardware design of the compression modules, as well as on the integration of the component to a specific WSN platform.To evaluate both compression delay and compression power consumption, we assume a signal sample of X Bytes that can be compressed by Y% and adopt the following rationale.
As far as the transmission delay is concerned, the two equations that enable us to calculate the required time interval to send the respective information with and without the compression module are the following ones.

Without the compression module integrated:
where Tx XBytes : Represents the time interval required to successfully transmit X Bytes and it is measured using a specific platform and specific WSN communication technology.
With the compression module integrated: where Tc XBytes is the time interval required to compress X Bytes that can be accurately calculated for a specific processing clock frequency of the implementation presented in Section 4. Tx x×y%Bytes is the time required to successfully transmit the compressed amount of data which can be accurately measured using a specific platform and a specific WSN communication technology.Tdc XBytes is the time interval required to decompress the compressed data back to X Bytes which, without loss of accuracy, can be considered equal to Tc XBytes since the compression and decompression algorithms are symmetrical in the proposed solution.
Moving on to energy consumption the respective equations used are as follows.
Energy consumption without integrated compression module: where Ex no_compress is the energy consumed for sending successfully X Bytes of data.P tx_radio is the transmission power consumption of a specific radio transceiver.V tx_radio is the voltage supply required of specific radio transceiver (provided by respective datasheets).I tx_radio is the current draw during transmission of a specific radio transceiver (provided by respective datasheets).Tx XBytes is the time required to successfully transmit X bytes; it is measured using a specific platform and a specific communication technology.
Energy consumption with integrated compression module: Ex compress = Ec XBytes + E tx x ×y%Bytes = (P compr × Tc XBytes ) + (P tx radio × Tx X×y%Bytes ) = (P compr × Tc XBytes ) + (V tx_radio × I tx_radio × Tx X×y%Bytes ), where Ex compress is the energy required to compress X Bytes and successfully transmit the compressed data.Ec XBytes is the energy required for compressing X Bytes.E tx x ×y%Bytes is the energy required to successfully transmit the compressed data.P compr is the power consumption (static + dynamic) of the compression module, when active.Tc XBytes is time interval required to compress X Bytes, which can be accurately calculated for a specific processing clock frequency.V tx_radio is the voltage supply required for a specific radio transceiver (provided by respective datasheets).I tx_radio is the current draw during the transmission of specific radio transceiver (provided by respective datasheets).Tx X×y%Bytes is the time required to successfully transmit X × Y% Bytes; it is measured using a specific platform and a specific WSN communication technology.
It is noted that energy consumption is not considered for the decompression phase, under the assumption that the receiver (i.e., the home gateway) is always plugged in the main power supply.Furthermore, two assumptions are made so as to extract the results presented in following section.First, it is assumed that when the compression module is not used it is shut down and does not consume energy; second, it is assumed that when the radio does not transmit it is tuned off and thus there is no energy consumption.

Performance Evaluation Setup
In order for the analysis approach presented in the previous section to lead to realistic and useful results, it is important to present the main hardware components (including the different configuration options considered) utilized as well as the actual data, comprised by samples taken from real EEG/ECG signals, which are two of the most demanding modalities in epilepsy monitoring (which represents our main application scenario).
The most important component influencing our evaluation is the implemented compression hardware module.As analyzed in Section 4, the hardware module throughput (thus the delay required to compress a specific amount of data) depends on the clock frequency based on Equation (1).
The power consumption depends also on the frequency clock.Therefore, we considered four different clock frequencies; then, we measured the power consumption and based on Equation (5) the following table has been extracted.Table 5 presents the two main performance metrics of the compression hardware implementation proposed.Table 5 depicts also a tradeoff that, as will be proven, comprises a critical factor with respect to the overall performance evaluation.As depicted, as the clock frequency increases, both power consumption and throughput also increase.Consequently, when compressing a specific amount of data increasing the clock frequency implies that, on one hand, the component needs more energy while active but, on the other hand, it performs its operation much faster, and consequently the time period required to be active is drastically reduced.A secondary observation concerns the fact that the main factor of power consumption increase is the dynamic power and not the static power (its influence is rather negligible).

Short-Range Wireless Communication Radio Characteristics
The second component influencing the system-level performance is the specific off-the-shelf radio transceivers used to transfer data.In the context of this evaluation, we considered the Shimmer2R platform [25] comprising a prominent commercial solution offering highly efficient Bluetooth Class 2.1 as well as IEEE 802.15.4 interfaces.

Bluetooth Interface Characteristics:
The specific platform uses the RN-42 Bluetooth radio chip exhibiting the following voltage and current demands [26].
Voltage supply = 3.3 V Typical Current draw during transmission = 45 mA Additionally, the respective node is programmed using TinyOS operating system so additional delay could be added as expected to be the case in any off-the-shelf WSN sensor.

IEEE 802.15.4 Interface Characteristics:
The specific platform uses the CC2420 IEEE 802.15.4 radio chip exhibiting the following voltage/current demands [14].
Voltage supply = 3.3 V Typical Current draw during transmission (max) = 17.4 mA Additionally, the respective node is programmed using TinyOS operating system so additional delay could be added as expected to be the case in any off-the-shelf WSN sensor.

Dataset Compression Characteristics
The actual signals comprising the evaluation dataset comprise also a significant parameter influencing the performance of hardware accelerator.It is assumed that all datasets are parts of signals of equal time duration and specifically of 10min interval.Specifically, the signals considered are derived from the extended dataset presented in Section 3.3 and the characteristics relevant to the specific evaluation are as follows: 1. ECG_UoP An ECG signal with sampling rate 2.5 KHz, 16 bits per sample and 10 min duration leads to an amount of data equal to 3.000.000Bytes; the implemented compression algorithm is able to compress them to 1.292.952Bytes, thus yielding 56% compression rate.Shimmer Platform using BT technology network resources either in the delay or in energy consumption domain.Also, in this case the type of signal has a critical role; this is due to the fact that the more compression prone the signal is the higher the performance enhancement will be, which in this case is close to 70% both for delay and energy reduction.
Focusing on 802.15.4-based graphs, once again the exhibited delay and energy consumption decrease are both quite emphatic.Both reach up to 70%, which can be achieved even with a moderate FPGA clock frequency of 20 MHz (62.5 MHz being the highest supported by the FPGA development board considered for the implementation).Another interesting observation concerns the fact that even considering the significantly lower power consumption of IEEE 802.15.4 transceivers (compared to Bluetooth counterpart), the integration of a compression accelerator module would still lead to a significant performance boost and consequent enhance the efficiency of any decision support system.

Discussion
This paper explores the possibility to exploit data compression, in order to effectively tackle the wireless data transmission requirements of Epilepsy monitoring by typical WSN platforms.Adequate and objective study of Epilepsy requires continuously acquisition of respective modalities at high rates, thus creating excessive volumes of data.Following a multifaceted research and development approach, in this paper, firstly the problem is addressed algorithmically by the comparison of software-based compression algorithms targeting at optimum trade-off between compression rate and compression delay.In that respect, as analytically presented in [12] the proposed algorithm highly competitive performance.Secondly, going a step further towards practical solutions, the proposed algorithm is implemented at hardware level in the context of a typical FPGA platform.The implemented hardware module is presented in detail in this paper, while performance wise it reduced the compression delay by a factor of 1/1000.At the same time, dynamic power consumption ranges between 15 and 98 mW representing a viable solution for contemporary WSN platforms.The latter is further emphasized in the third part of this work, where the anticipated effect of integrating such a component with real prominent WSN platforms is evaluated.In this aspect two well-known wireless interfaces are considered, characterized by difference features and capabilities.Furthermore, the effect of running the proposed module at different clock frequencies is also taken into consideration.In all cases the benefits of using the proposed solution are quiet Figure 8 reveals behavioral patterns similar to the ones in previous cases, strengthening the indications regarding the performance enhancement of the compression module.Focusing on BT-based graph, an interesting point is the fact that in comparison to the respective measurements considering ECG UoP signal, in this case a steadily higher performance enhancement (~8%) can be extracted.This enhancement increase is attributed to the lower sampling frequency of the EkgMove sensor leading to fewer samples for same time interval, thus leading significantly faster compression execution from the proposed compression module.This is actually the part that effectively saves network resources either in the delay or in energy consumption domain.Also, in this case the type of signal has a critical role; this is due to the fact that the more compression prone the signal is the higher the performance enhancement will be, which in this case is close to 70% both for delay and energy reduction.
Focusing on 802.15.4-based graphs, once again the exhibited delay and energy consumption decrease are both quite emphatic.Both reach up to 70%, which can be achieved even with a moderate FPGA clock frequency of 20 MHz (62.5 MHz being the highest supported by the FPGA development board considered for the implementation).
Another interesting observation concerns the fact that even considering the significantly lower power consumption of IEEE 802.15.4 transceivers (compared to Bluetooth counterpart), the integration of a compression accelerator module would still lead to a significant performance boost and consequent enhance the efficiency of any decision support system.

Discussion
This paper explores the possibility to exploit data compression, in order to effectively tackle the wireless data transmission requirements of Epilepsy monitoring by typical WSN platforms.Adequate and objective study of Epilepsy requires continuously acquisition of respective modalities at high rates, thus creating excessive volumes of data.Following a multifaceted research and development approach, in this paper, firstly the problem is addressed algorithmically by the comparison of software-based compression algorithms targeting at optimum trade-off between compression rate and compression delay.In that respect, as analytically presented in [12] the proposed algorithm highly competitive performance.Secondly, going a step further towards practical solutions, the proposed algorithm is implemented at hardware level in the context of a typical FPGA platform.The implemented hardware module is presented in detail in this paper, while performance wise it reduced the compression delay by a factor of 1/1000.At the same time, dynamic power consumption ranges between 15 and 98 mW representing a viable solution for contemporary WSN platforms.The latter is further emphasized in the third part of this work, where the anticipated effect of integrating such a component with real prominent WSN platforms is evaluated.
In this aspect two well-known wireless interfaces are considered, characterized by difference features and capabilities.Furthermore, the effect of running the proposed module at different clock frequencies is also taken into consideration.In all cases the benefits of using the proposed solution are quiet apparent.Specifically, considering Bluetooth wireless transmission, end-to-end delay reduction ranges between 11% and 65%, while energy is respective reduction ranges between 40% and 66%.Performance enhancement is even higher when considering IEEE 802.15.4 wireless communication since both transmission delay and energy consumption are reduced by from 5% to 70%.
Without a doubt, such performance and resource conservation enhancements offer convincing arguments that hardware-based data compression comprises a viable solution towards network/node lifetime increase and wireless communication bandwidth optimum management.

Conclusions
In order for IoT platforms to comprise a reliable and efficient technological infrastructure for nowadays as well as future CPS applications, novel hardware, real-time approaches are required in order to enhance the platforms' capabilities while minimizing resource wastage.This paper addresses a notorious challenge of wireless interfaces' excessive utilization considering highly demanding application scenarios such as bio-signal monitoring.To tackle respective shortcomings of nowadays platforms, the two main pillars of this paper are (a) the design and implementation of an efficient, real-time compression algorithm, and (b) the design and implementation of a highly efficient and resource conservative hardware accelerator for the proposed algorithm.The idea behind this approach is to offer on-the-fly hardware compression to any IoT platform, thus overloading the main processing unit, which is usually of limited processing capabilities.Consequently, this paper offers a holistic solution to the problem of wirelessly transmitting excessive volumes of data, tackling all aspects from the design and software algorithmic performance, to the design of the actual hardware accelerator and finally the overall system performance.With respect to the latter, taking into consideration the power consumption of a real prominent IoT platform and respective delay transmission capabilities, the proposed hardware compression accelerator can yield an approximate 70% system wide delay and energy consumption reduction.Furthermore, the proposed module comprises a highly feasible and practical solution since it captures just 21% of the Spartan3 FPGA's configurable logic block slices and the maximum performance gains are achieved at just 1/3 of the maximum frequency clock tested.Such characteristics advocate the integration of the proposed module to nowadays WSN platforms.

Figure 4 .
Figure 4. Compression module block data flow.Figure 4. Compression module block data flow.

Figure 4 .
Figure 4. Compression module block data flow.Figure 4. Compression module block data flow.

Figure 5 .
Figure 5. Compression module Input/Output block timing.Start of block.

Figure 6 .
Figure 6.Compression module Input/Output block timing.End of block.

Figure 5 .
Figure 5. Compression module Input/Output block timing.Start of block.

Figure 5 .
Figure 5. Compression module Input/Output block timing.Start of block.

Figure 6 .
Figure 6.Compression module Input/Output block timing.End of block.

Figure 6 .
Figure 6.Compression module Input/Output block timing.End of block.

Table 1 .
Processing delay performance evaluation.

Table 1 .
Processing delay performance evaluation.

Table 2 .
Compression module processing delay and throughput rate (Assuming × MHz operating frequency).

Table 2 .
Compression module processing delay and throughput rate (Assuming × MHz operating frequency).

Table 2 .
Compression module processing delay and throughput rate (Assuming × MHz operating frequency).

Table 5 .
Power consumption and throughput performance of compression hardware component.