A Low-Latency, Low-Power FPGA Implementation of ECG Signal Characterization Using Hermite Polynomials

: Automatic ECG signal characterization is of critical importance in patient monitoring and diagnosis. This process is computationally intensive, and low-power, online (real-time) solutions to this problem are of great interest. In this paper, we present a novel, dedicated hardware implementation of the ECG signal processing chain based on Hermite functions, aiming for real-time processing. Starting from 12-bit ADC samples of the ECG signal, the hardware implements ﬁltering, peak and QRS detection, and least-squares Hermite polynomial ﬁt on heartbeats. This hardware module can be used to compress ECG data or to perform beat classiﬁcation. The hardware implementation has been validated on a Field Programmable Gate Array (FPGA). The implementation is generated using an algorithm-to-hardware compiler tool-chain and the resulting hardware is characterized using a low-cost off-the-shelf FPGA card. The single-beat best-ﬁt computation latency when using six Hermite basis polynomials is under 1 s with a throughput of 3 beats/s and with an average power dissipation around 28 mW, demonstrating true real-time applicability.


Introduction
Cardiovascular disease is the number one cause of death worldwide [1]. An electrocardiogram (ECG) registers the electrical activity of a heart, and it stands as a valuable diagnostic tool. However, in clinical routines, ECG analysis is performed as a visual inspection by a cardiologist, which is a tedious task, further aggravated in the case of long-term ECG. For instance, 24 h of Holter recordings contains around 100,000 heartbeats. Figure 1 depicts the main components of the ECG, with the most important for diagnosis being the waves P, Q, R, S and T. The Q, R and S waves are normally studied together as the QRS complex. The P wave represents the moment when the auricles contract to send blood to the ventricles, and at the end of the PR segment, the ventricle is full. During the QRS complex, the ventricle expels their contents and are fully emptied at the end of the ST segment. The T wave indicates that the heart is at rest.
Developing efficient techniques to automate ECG analysis is instrumental in helping a cardiologist with their diagnosis. The detection of arrhythmias is of special interest [2]. The QRS complexes of heartbeats can be successfully used to identify most arrhythmia types [3][4][5]. The T wave does not contribute to the identification process [6] and the P wave, even though it provides relevant information about arrhythmias, possesses a low signal-to-noise ratio (SNR), so it is not reliable [7,8].
ECG analysis starts with the detection and characterization of the beats [9]. The detection of the QRS complex is carried out with a high accuracy; a 99.7% detection accuracy was reported in [10]. As for the characterization of the beat, among the different methods [6,11,12], the use of a function space based on Hermite polynomials has many advantages [3,10,13]: dimensionality reduction, low noise sensitivity, etc. The ECG samples are fitted with a linear combination of basis functions, and the coefficients of this linear combination are used as features for representing heartbeats. As an example of the resulting dimension reduction, the 144-sample QRS complex obtained at a rate of 360 sps can be reasonably characterized with 6 or 7 parameters [14]. Regarding the average classification error, values as small as 0.34% are reported in [15], thus supporting the development of new classifiers based on Hermite functions as well as hardware implementations able to provide high-quality real-time heartbeat analysis.
One disadvantage of the Hermite representation is that it is computationally demanding. There are some approaches addressing this problem. In [16], graphics processing units (GPU) are used to accelerate the offline processing of Hermite fitting of heartbeats. The use of Field-Programmable Gate Array (FPGA) devices is supported in [17]; in this paper, the results of an FPGA-based implementation aiming at wearable systems are presented. Reconfigurable devices (i.e., FPGA) allow for developing a custom architecture that can be adjusted to the different levels of computation performance and energy efficiency. Moreover, they can be used to prototype a system before being implemented as an application-specific integrated circuit (i.e., ASIC), which can achieve even better computation and electrical consumption performance. The developing times required for both FPGA and ASIC is quite long and complex in comparison with the traditional software approach (i.e., microprocessor-based or GPU-based), and high-level synthesis (HLS) tools have thrived in the last few years [18,19]. In this work, the HLS tool AHIR [20][21][22] has been used. AHIR is an open-source alternative to proprietary products that allows us to generate RTL descriptions from C language with reduced development times.
The central contribution of this paper is the design and implementation of a novel hardware module able to characterize heartbeats in real time by means of Hermite functions. This module can be used as the input to systems to compress the ECG data as well as to classifiers. Despite the interest in producing hardware systems for real-time processing of ECG signals [23][24][25][26], to the best of our knowledge, this is the first time that Hermite function fitting with a complete preprocessing chain is implemented in hardware for ECG processing. The main contributions of this paper are as follows: • Novel hardware implementation of full processing chain for real-time ECG characterization based on Hermite functions; • Introduction to the AHIR HLS tool; • Implementation of the system in a low-cost FPGA-based board; and • On-board power consumption measurements.
The paper is organized as follows: Section 2 elaborates on the Hermite fitting of heartbeats; in Section 3, the AHIR tool is presented; Section 4 describes the system implemented on an FPGA device; the implementation results are in Section 5, and they are analysed in Section 6; and, finally, the conclusions are drawn in Section 7.

Estimation of the QRS Complex with Hermite Polynomials
As mentioned in Section 1, QRS complexes are employed for arrhythmia detection and the use of Hermite functions allows us to reduce the number of dimensions involved in the ECG classification, without sacrificing accuracy [3,10] as well as enabling the transmission of ECG compressed data [27]. Moreover, Hermite fitting representations are robust in the presence of noise.
The MIT-BIH arrhythmia database [28] is used as a benchmark in this work. It contains 48 2-channel ECG recordings, sampled at a frequency of 360 Hz and with a duration of approximately 2000 beats (half an hour). Each beat has been manually annotated by at least two cardiologists, so it can be used to check the outcome of ECG automatic classification. The database includes an extended set of arrhythmias, and it has been extensively used in automatic arrhythmia classification [4,10,29].
Prior to QRS characterization, the ECG signal must be processed to remove the baseline drift and high-frequency noise [30]. The QRS complexes have a length of 70-100 ms; therefore, extracting a window of 200 ms around the peak (i.e., R wave) of the beat ensures that we acquire the complete complex while leaving the T and P waves out. The QRS window is expanded up to 400 ms by means of zero padding the extremes of the 200-ms window since the Hermite functions converge to zero in ±∞. Thus, the QRS beat data used as an input to the Hermite polynomial approximation consists of a 144-sample vector x = {x(t)} that can be estimated with a linear combination of N Hermite basis functions φ n by means of coefficients c n (Equation (1)). In this work, we use N = 6, which provides a good compromise between having a compact representation and having a good accuracy in the representation of the beat [14].
The aim of the Hermite fitting is to find the approximation to the QRS complex {x(t)} with the best minimum-mean-square-error (MMSE). The approximation of x(t) is expressed asx with where H n (t/σ) is the n th Hermite polynomial. The Hermite polynomials can be computed recursively as The parameter σ is a time-scaling factor in the polynomials that adjusts the width of the Hermite functions to the one of the actual QRS complexes. The maximum value of σ for a given order n is studied in [3].
Give σ, the orthogonality of the Hermite basis function allows us to find the optimal coefficient-those that minimize the square error-as In order to find the best fit, the MMSE approximation for each σ is obtained, and the one with the smallest value is kept. As a result, each heartbeat is represented by a set composed of the best σ and the corresponding fit coefficients c = {c n (σ)} (n ∈ [0, N − 1]) and it is possible to use only these parameters to perform the morphological classification of the heartbeats [3,29]. Figure 2 depicts the effect of increasing the number of Hermite function in the beat estimation. Figure 2a shows the original beat (in black) and the estimation with N ∈ {6, 12, 24}. It can be seen that, as long as the value of N is increased, the estimation captures the variations of the heartbeat in more detail. Figure 2b shows the decreasing trend of the minimum square error (MSE) for each estimation.

From Algorithm-to-Hardware Using AHIRV2, a C-2-VHDL Compiler
The AHIRV2 compiler tool-chain [20][21][22] provides a pathway from a C-program to actual synthesizable hardware. The tool-chain takes a description of an algorithm (described in C) and produces a VHDL logic circuit description that is equivalent to the algorithm.
The AHIRV2 compiler starts with a C program and produces VHDL. The clang 2.8 compiler (www.clang.org; accessed on 1 September 2021) acts as the C front-end and is used to emit LLVM byte-code (www.llvm.org), which is then converted to VHDL using the following transformations: 1.
The LLVM byte-code is translated into an internal intermediate format, which is itself a static-single assignment centric control-flow language (named Aa), which allows for the description of parallelism using fork-joined structures as well as arbitrary branching; 2.
The Aa description is translated to a virtual circuit (the model is described in the next subsection). During this translation, the following major optimizations are performed: declared storage objects are partitioned into disjoint memory spaces using pointer reference analysis, and dependency analysis is used to generate appropriate sequencing of operations in order to maximize the parallelism. Inner loops in the Aa code are pipelined so that multiple iterations of a loop can be executed concurrently; 3.
The virtual circuit is then translated to VHDL. At this point, decisions about operator sharing are taken. Concurrency analysis is used to determine if a shared hardware unit needs arbitration. Optimizations related to clock-frequency maximization are also carried out here. The generated VHDL uses a pre-designed library of useful operators ranging from multiplexors and arbiters to pipelined floating point arithmetic units (arbitrary precision arithmetic is supported, and in particular, there is full support for IEEE-754 single precision and double precision add/multiply with all rounding modes).

An Illustration of the Virtual Circuit Generated by the AHIRV2 Compiler
The virtual circuit generated by the AHIRV2 compiler consists of three cooperating components: the control path, the data path and the storage system [21,22].
To illustrate the model, we consider a simple example.
The AHIRV2 tool-chain transforms this program to produce a virtual circuit, which is depicted in Figure 3. The virtual circuit is then translated to synthesizable VHDL. The virtual circuit consists of three components independent parts, namely the data path, the storage subsystem and the control path.

The Data Path
The data path is a directed hyper-graph with nodes being operations and arcs being nets (shown as ovals). Each net has at most one operation that drives it. Furthermore, most operations have a split protocol handshake with the control path: two pairs of request/acknowledge associations ( * sr/ * sa for sampling the inputs and * cr/ * ca for updating the outputs). The operation samples its inputs upon receiving the sr request symbol and acknowledges the completion of this action by emitting the sa acknowledge symbol. After receiving the cr symbol, the operation updates its output net using the newly computed value. The sequencing is as follows:

sr -> sa -> cr -> ca
Note that an operation can be re-triggered while an earlier edition of the operation is in progress (this is important if the operation is implemented using a pipelined operator).
Some data-path operations (such as the multiplexor shown on the top and the decision operation shown at the bottom left in Figure 3) follow a simpler protocol. The multiplexor has a pair of requests and a single acknowledge, with the condition that at most one of the requests can be received at any time instant. The input corresponding to the request is then sampled and stored in the output net of the multiplexor. The decision operation has a single request and two acknowledges. Upon receipt of the request symbol, the decision operation checks its input net and emits one of the two acknowledges depending on whether the input is zero or nonzero.  In Figure 3, the following data-path operations are instantiated: Note that the data path only shows the operations and their interconnection. When the data path is implemented as hardware, multiple operations may be mapped to a single operator depending on cost/performance trade-offs. In this case, multiplexing logic is introduced in the hardware. These decisions and manipulations are performed in the compiler stage, which is responsible for transforming the virtual circuit to VHDL.

Storage Subsystem
The load and store operations in the data path are associated with memory subsystems. In general, there can be multiple disjoint memory subsystems inferred by the compiler. In this particular case, the arrays a[] and b[] are mapped to disjoint memories, due to which the two loads are allowed to proceed in parallel (the relaxed consistency model is enforced). In order to maintain the relaxed consistency model, the memory subsystems are designed to use a time-stamping scheme, which guarantees first-come-first-served access to the same memory location.

Control Path
The control path in the virtual circuit encodes all of the sequencing that is necessary for correct operation of the assembly. The control path (shown on the left in Figure 3) is modelled as a Petri-net with a unique entry point and a unique exit point. The Petri-net is constructed using a set of production rules, which guarantee liveness and safeness [21]. Transitions in the Petri-net are associated with output symbols to the data-path (these can be described by the regular expressions * sr and * cr) and input symbols from the data path (these are of the form * sa and * ca). The * sr symbols instruct an element in the data path to sample its inputs and the * cr symbols instruct an element in the data path to update its outputs (all outputs of data path elements are registered). The * sa and * ca symbols are acknowledgements from the data path, which indicate that the corresponding requests have been served.
The following classes of dependencies are encoded in the control Petri-net: where there is a WAR dependency through c, then the cr request to B can be issued only after the sa acknowledge from A has been received; • Load-Store ordering: If P, Q are load/store operations to the same memory subsystem, and if at least one of P, Q is a store, and if P is supposed to happen before Q, then the sr request to Q must be emitted only after the sa acknowledge from Q has been received. The memory subsystem itself guarantees that requests finish in the same order that they were initiated. This takes care of WAR, RAW and WAW memory dependencies.
The control path in Figure 3 shows the sequencing generated by these rules. When pipelining an inner loop, the execution of an operation in a particular iteration is enabled as soon as its dependencies on results from previous iterations are satisfied.

Implementation of the System
The analysis of an ECG signal received from a sensor goes through the following steps:

1.
Initial signal filtering to remove noise and drift; 2.
ECG beat recognition and identification of the QRS complex; 3.
ECG beat feature extraction: this can be performed in various ways. We look at the use of Hermite polynomials for the same; 4.
ECG classification: based on the beat features, classify the beat as normal or anomalous. This last step is not part of the current work.
We have implemented a signal chain that integrates the first three steps in the list above. Our main contribution is that we have built a custom hardware implementation of the entire signal flow up to Hermite classification, and demonstrated that sophisticated low power, real time ECG analysis is possible in hardware and that high level algorithm to hardware design techniques offer a practical pathway to such realizations.
The incoming ECG signal is assumed to be generated by an 11-bit ADC with a sampling rate of 360 Hz. For all experiments described in this report, we used 11-bit sampled data from the MIT arrhythmia reference database [28]. The initial signal processing such as the band-pass filter characteristics and the algorithm for QRS detection have been well studied in the literature [30]. The use of Hermite polynomials to extract features from the ECG signal has also been studied extensively [3,10,29].
The entire signal chain is illustrated in Figure 4. In our implementation, the signal chain is divided into two stages. The first stage (the front-end) is responsible for the signal filtering and the QRS peak detection. The second stage takes the identified beats and calculates a best Hermite-polynomial fit for the identified beat. We illustrate this division in Figure 5. All the elements of the signal chain are explained in Sections 4.1 and 4.2. Section 4.3 elaborates on the final system architecture included the signal chain as well as the control block and communications interfaces.

Algorithmic Description of the First Stage
The first stage is responsible for the filtering and QRS peak detection, and the sequence followed is shown in Listing 1.

The Band-Pass Filter
The bandpass filter used is a 99-tap FIR filter with 16-bit taps. The pass-band is set between 6 Hz and 28 Hz. The stop-band attenuation is chosen to be −40 dB. We acknowledge the use of an online filter design tool (http://t-filter.engineerjs.com) [31].

The QRS Detection Algorithm
The QRS detection algorithm is implemented in three stages:

1.
The band-pass filter outputs are sent through a derivative filter. This acts as a high pass filter that identifies the regions of rapid change (including the QRS complex); 2.
The output of the derivative filter is rectified and integrated using a moving average filter with 32 taps. The strong peaks of the sequence generated by this moving average filter are expected be in correspondence with the peaks of the QRS complex; 3. The output of the moving average filter is analysed by a threshold crossing state machine that attempts to identify the center peaks of the QRS complex.
The threshold crossing state machine is illustrated in Figure 6. For the sake of brevity, we do not present the entire C code of the finite state machine. However, a summary of the C code is shown in Listing 3. The algorithm gives the position of the QRS peak, and the heartbeat for further analysis consists of 144 samples centered at this peak.

The Second Stage: Calculation of Hermite Polynomial Fits
The first stage in the signal chain provides a QRS peak and a detected heartbeat (post band-pass filtering). Suppose is the detected beat. The Hermite polynomial basis set consists of the first six Hermite polynomials and a scale factor σ. The value of σ ranges between a minimum value of 1/120 and 1/90 and is discretized into 10 values. Denote the Hermite polynomial with order N and scale-factor σ as h N (σ) = {h N (σ, k)} 143 0 . We calculate the dot products as N varies from 1 to 6 and σ u varies as described above. The dot products are computed using single precision IEEE floating point arithmetic. The Hermite polynomial values are precomputed and stored in the hardware as tables.
The best fit is determined by the value of the scale factor σ u , which minimizes the mean square error This value of σ and the corresponding coefficients c σ,j are the features of the beat extracted by the Hermite fit. These values are used for further characterization of the beat as normal or anomalous [10,29].
The algorithm used for the second stage is shown in Listing 4. // report the best fit. sendBestFitToOutput(); } }

System Architecture
The system architecture follows the two stage approach described at the beginning of the section. The architecture is depicted in Figure 7.
A UART is used to configure the system by downloading the pre-calculated Hermite polynomials, the filter coefficients, and some configuration parameters. In this case, there are sixty distinct Hermite polynomials, each with 144 samples, with each sample being coded in single precision IEEE floating point format (4 bytes per sample).
After the initial configuration, ECG samples are streamed to the hardware, and fit coefficients are extracted for every detected beat. The peak throughput and total latency in the signal chain are characterized.

Results
The Xilinx Artix 7 series FPGA xc7a35tcpg236 (Xilinx, San Jose, CA, USA) was used as the platform for the hardware implementation. In particular, we used the BASYS-3 FPGA board from Digilent (Pullman, WA, USA) [32]. For synthesis, we used the Xilinx Vivado 2019.4 tools. The block diagram of the test setup is shown in Figure 8. In this setup, the host computer first uses the UART to download the Hermite polynomial tables and the filter coefficients to the system. After this is performed, ADC samples are streamed to the FPGA over the UART at a baud rate of 115,200. The post Hermite fits and QRS peak locations are monitored by an application on the host computer. It must be stressed that AHIR allows for simulation of the system by using benchmarks written in C. During the simulation, it is possible to select if the simulation is using the compiled C files or if the hardware functions are simulated by means of an HDL simulator (i.e., GHDL). In both cases, the input vectors are read from files and the output vectors are stored also in files, so it is possible to check the correctness of the hardware implementation. For the overall system, we present the The summary of resource utilization is shown in Table 1. For these particular FPGA devices, the limiting factors are the look-up tables (LUT). Thus, devices with more logic resources are required if the order of the polynomial is to be increased. To measure the latency in the entire signal chain, we timed the difference between the entry of the first byte of an ECG sample and the exit of the last byte of the Hermite characterization for the corresponding beat. For the throughput, we observed the maximum rate at which beat data could be supplied to the system. For a clock of 50 MHz, the latency and throughputs obtained were 0.82 s and 3 beats/s.
To characterize the power consumption, we observed the difference between the idle current drawn by the FPGA when it was quiescent (unprogrammed) and the current drawn by the FPGA during full speed (maximum throughput) operation. We use the power measurement setup presented in Figure 9.
Basys 3 board features a jumper JP2 that is used as power source select and is located at the entrance of the power supply. It selects whether power supply comes from the USB cable or External power supply. In this work, we use USB power supply of 5 V. We add a shunt resistance over this jumper and use differential probe to measure the voltage over the shunt. Since the resistance is in series with the power supply, we are able to obtain the current that goes to the board from the power supply. By knowing the input voltage and input current, we obtain the power consumed by the board. The resistance value is chosen to ensure the correct functionality of the power supply regulators located on the Basys 3 board, as explained next. Voltage regulator circuits create the required 3.3 V, 1.8 V and 1 V from the main power supply [32]. A power supply of 1 V is used for an FPGA core; 1.8 V is used for an auxiliary FPGA supply and RAM memory; and 3.3 V is used for IO pins, USB connection, clocks, Flash, etc. Based on typical and maximum current values for each of these supplies, listed in [32], we compute an approximate value for the shunt resistance. According to our estimates, the peak current values for the design should not exceed 80mA, and current demand on the other two supplies should not be extreme either. As a result, when maximum typical current values for the 1.8 V and 3.3 V (150 mA and 1.5 A, respectively) and 80 mA for the 1 V supply are assumed, an approximate value for the resistance is 0.52 Ω. We use 0.47 Ω for our measurements as a value that is close to the estimated one.
Since we are interested in the current consumed by the design only, we first measure the current when the FPGA is programmed and the application is running, i.e., data are sent and received. The measured current is 170.96 mA on average. Then, we subtract the current measured when the FPGA is programmed, but without any data traffic, that becomes 165.36 mA. By subtracting this current, we eliminate the current consumed by other parts of the board as well as the FPGA static current. Consequently, the proposed design consumes 5.6 mA on average. When this current is multiplied by the 5 V input voltage, it results in 28 mW of approximated FPGA dynamic power.
The results are summarized in Table 2. The obtained latency and throughput fits real-time requirements, and the power consumption is low.

Discussion
The hardware implementation of automatic ECG analysis systems is essential for ambulant monitorization of patients, and there are several examples in the literature for both ASIC [23,24] and FPGA [25,26] implementations. However, to the best of our knowledge, there are no hardware implementations of ECG signal processors that apply the Hermite fit for beat compression or classifications. For example, the work in [26] describes the implementation of another technique called Empirical Mode Decomposition applied to ECG signals in a Spartan 3E FPGA but does not report power, performance and area metrics. As for the detection performance, the overall accuracy reported is 94.8%, while with Hermite functions, it is possible to achieve 96.66%. The work in [25] is a HW/SW co-design where the QRS complex extraction is implemented in an FPGA and is based on geometrical properties of a two-dimensional phase-space portrait of the ECG signal, while the beat classification is performed by Open Source ECG analysis software. The data are read from and written to the on-board DDR memory, while the data proposed in this work are sent and received by UART, corresponding to a more realistic case, since it could be easily replaced by an ADC interface. Additionally, the pre-processing and pre-partition are performed on the software in [25], so a fair comparison with this work would be difficult to achieve. The authors reported a premature ventricular detection of 92.36%, while with Hermite functions, it is possible to reach 96.86%.
Preliminary results of the proposed design were presented in [17]. Only the Hermite fit process was tackled in our previous work, so the pre-processing chain was neglected. A peak power consumption of 3 W was reported in contrast with the averaged power of 28 mW achieved in the current design. This new version of the circuit can be used to feed a hardware block to perform data compression or classification in real-time with a low power consumption.
The reported performance metrics are promising. The latency is close to a second, which is suitable given that heart rates are commonly between 1 and 2 beats/s; thus, the results of the first beat characterization appear after 1 or 2 beats. The throughput is around 3 beats/s, which covers heart rates of 180 beats/min, which is an extreme situation for a person. Finally, the power consumption is around 30 mW, which is a low value for an FPGA.
Summarizing, the results yield that the system is capable of real-time and lowpower processing.

Conclusions
In this paper, we presented the design of an FPGA-based system able to perform real-time ECG characterization through Hermite polynomials. The AHIR HLS tool was used to perform the development and testing. The system was successfully implemented on a low-cost board with a latency of less than 1 s, a throughput of 3 beats/s and a power consumption around 28 mW. Hence, we demonstrated that complex low power, real-time ECG analysis is possible through high-level synthesis.
The current design can be easily modified and extended due to the flexibility provided by the AHIR set of tools. On one hand, the number of polynomials used in the estimation (i.e., N) can be increased to improve the accuracy of the estimations. Moreover, a clustering block to help in the classification process can be added [10]. In any case, it is clear that a bigger FPGA device is necessary. Additionally, the throughput can be increased to consider higher heart rates, which involves increasing parallelism and, therefore, increasing the resources demand. All of these new ideas can be easily designed and tested with the HLS approach provided by AHIR.
Author Contributions: M.P.D., G.C., D.G.M. and A.O., design of the signal processing algorithms. M.P.D., G.C. and R.J., conceptualization, implementation and testing of the research. All authors developed the methodology. M.P.D., G.C. and R.J. discussed the basic structure of the manuscript, drafted its main parts, and reviewed and edited the draft. All authors have read and agreed to the published version of the manuscript.
Funding: This research has been partially funded by the Spanish Ministry of Science, Innovation and Universities through project RTI2018-095324-B-I00.

Conflicts of Interest:
The authors declare no conflict of interest.