Heartbeats Do Not Make Good Pseudo-Random Number Generators: An Analysis of the Randomness of Inter-Pulse Intervals

The proliferation of wearable and implantable medical devices has given rise to an interest in developing security schemes suitable for these systems and the environment in which they operate. One area that has received much attention lately is the use of (human) biological signals as the basis for biometric authentication, identification and the generation of cryptographic keys. The heart signal (e.g., as recorded in an electrocardiogram) has been used by several researchers in the last few years. Specifically, the so-called Inter-Pulse Intervals (IPIs), which is the time between two consecutive heartbeats, have been repeatedly pointed out as a potentially good source of entropy and are at the core of various recent authentication protocols. In this work, we report the results of a large-scale statistical study to determine whether such an assumption is (or not) upheld. For this, we have analyzed 19 public datasets of heart signals from the Physionet repository, spanning electrocardiograms from 1353 subjects sampled at different frequencies and with lengths that vary between a few minutes and several hours. We believe this is the largest dataset on this topic analyzed in the literature. We have then applied a standard battery of randomness tests to the extracted IPIs. Under the algorithms described in this paper and after analyzing these 19 public ECG datasets, our results raise doubts about the use of IPI values as a good source of randomness for cryptographic purposes. This has repercussions both in the security of some of the protocols proposed up to now and also in the design of future IPI-based schemes.


Introduction
eHealth is a relatively novel term that is commonly used to refer to healthcare services delivered through (or making an extensive use of) technology and telecommunications systems. eHealth can be seen as a special subset of the Internet of Things (IoT), where "things" are essentially sensors that are constantly gathering information about the medical condition of a subject. Additionally, when these sensors are placed in, on or around the human body to monitor anywhere and anytime the vital signs of the bearer, it is said to be part of a Body Area Network (BAN) (also known as a Body Sensor Network (BSN)). BAN devices can communicate with a central device (also known as a hub, which is commonly implemented by a smartphone) with Internet connectivity, and in the near future, all these devices will be able to interact directly between each other.
Information gathered by a BAN, which may contain highly sensitive data privacy-wise, is usually shared with other devices in the network and can also be sent to public servers in order to be accessible by different people such as medical staff, the user's personal trainer or just for private purposes. It has been thought that Implantable Medical Device (IMD) such as pacemakers, insulin pumps or cochlear implants were the only devices in charge of measuring biological information. However, there are many other gadgets such as smartphones, wristbands or even smartwatches that can be used to sense some vital signs of the bearer without interfering in her/his life.
The security of this network and the gathered sensitive data have been identified as comprising one of the most challenging tasks by the research community [1][2][3] before deploying it in a real scenario. As an example, imagine someone who is equipped with sensors whose information is shared via wireless; it could be easy for an attacker to sniff the communication channel in order to listen to the transmitted packages and get some knowledge about the bearer. Therefore, new cryptographic protocols are needed not only to protect the user's identity, but also to protect the integrity of the patient's medical data [4,5].
Biometrics refers to the identification and authentication methods that, using biological signals, can identify or validate the identity of a person. In the few last years, several works have been focused on biometric authentication and identification [6][7][8]. This kind of authentication system has great potential because each biological trait must be universal, collectable, unobtrusive, permanent, unique and difficult to circumvent [9]. Biometric signals can be classified into physiological and behavioral signals [10]. Examples of physiological signals include face recognition, fingerprint, iris, Electrocardiogram (ECG), Electromyogram (EMG) or Galvanic Skin Response (GSR). Behavioral traits have also been proposed, such as the voice, signature, keystroke dynamics or lip movements, among others.
Biometrics have also been used to generate personal cryptographic keys [11] by using biological signals as a Pseudorandom Number Generators (PRNG). Therefore, in order to check if a given sequence of numbers can be considered random, there are some well-known tests like Shannon's entropy, the Monte Carlo test or the frequency test, among others. However, instead of using a subset of tests, there are some public suites like the ENT (ENT can be downloaded at http://www.fourmilab.ch/random/) test, a software published by the National Institute of Standards and Technology Statistical Test Suite (NIST STS), the DieHard (Diehard can be downloaded at http://stat.fsu.edu/pub/diehard/) tool or TestU01 (TestU01 can be downloaded at http://simul.iro.umontreal.ca/testu01/tu01.html), software that are more likely to be used when evaluating the randomness property [12]. It is important to remark that the ENT test was initially for general purposes, whereas the other suites are focused on guaranteeing some security properties.

Overview of Our Results
In the last few years, entropy analysis has been shown to be an effective mechanism to assist doctors in medical problems [13]. For instance, the analysis of brain images can help the detection of some brain diseases [14,15]. Another good example is the detection of cardiac problems through the analysis of ECG records [16][17][18]. In addition, and outside of the medical context, recent works have demonstrated that ECG signals can be also used as a source of entropy for security purposes [8,[19][20][21]. In particular, this is done by calculating the Inter-Pulse Interval (IPI), which is the time interval between two consecutive R-peaks of the ECG. If an arbitrary R-peak occurs at time t R (i), then IPI can be computed as the time difference between t R(i) and t R(i−1) :IPI (i) = t R(i) − t R(i−1) , as can be seen in Figure 1. We refer the reader to Section 2.2 for more details about the components of an ECG signal and to Section 3.2 for the IPI extraction algorithm.
Nowadays, apart from the common medical electrodes that record the ECG signal, there exists a myriad of devices equipped with dedicated sensors to measure the heart signal. For instance, measuring the heart rate can determine the efficiency of a workout or even the calories that someone has burned. In order to do so, the exercise machines used in gyms normally have some metallic areas located on the support bars that interpret small electrical signals passing through the skin. There are, however, some other wearable devices with Photoplethysmographic (PPG) sensors that record the blood pressure to obtain the heart beats, i.e., a device illuminates the skin with a light source like an LED to detect the changes in the light absorption. Nowadays, PPG monitors are usually found in most of the wristbands and smartwatches. Some other mechanisms like chest bands are commonly used by athletes when they are training or even in competitions to check their heart rates Figure 1. A typical electrocardiogram (ECG) signal and its main features: peaks (P, Q, R, S, T, U), waves, segments and intervals. IPI, Inter-Pulse Interval.
Many authors have claimed that the Least Significant Bits (LSBs) of the IPI contain a high degree of entropy [6,8,19,[22][23][24][25][26][27][28][29][30]. In addition, most of these authors use some public databases to prove this entropy property, and thus, with this method, the resulting bits can be considered as random numbers and can be part of key generation protocols in authentication procedures.
Recent IPI-based authentication, identification and key generation protocols (e.g., [25,[28][29][30]) suffer from two main weaknesses. First, they only use measures of entropy to determine whether the generated cryptographic material (keys and other intermediate values, such as nonces) are random or not. Second, the datasets used in these works are rather small and, therefore, possibly not significant enough. Additionally, such datasets contain ECG signals obtained both from healthy subjects and others that suffer some heart-related pathology, and it is unclear whether this feature has some influence on the overall quality (i.e., randomness) of the derived bits. Some of these observations have been already raised in [12], in which the authors pointed out the need to perform a more sound assessment of the quality of the generated keys using larger datasets and additional randomness tests. Nevertheless, the code that the authors used to run these experiments is not available.
In this work, we overcome these weaknesses by performing an analysis of the randomness of 19 different public databases containing heart signals. Our contributions can be summarized as: • We have downloaded 19 public databases with information about heart signals from different people. All datasets are taken from the Physionet repository (https://physionet.org/physiobank/ database/#ecg) [31], which contains heart signals from both healthy volunteers and people with cardiac conditions. We then extracted the last four bits of the IPI of each person per database, thus creating a bit stream whose quality can be tested. In doing so, we attempt to address the gap detected in [12]. • We analyze all files independently to check if the ECG can be considered to be a good random number generator. To do so, two random number suites (ENT, general purpose, and NIST STS, security) have been run over all previously generated files. To the best of our knowledge, this is the first work that discusses how the ECG signal should be used in cryptographic protocols as a source of random numbers. Our scripts are made public (https://github.com/aylara/Random_ECG) to facilitate the replication of our results by other researchers. • Contrary to prior proposals, we demonstrate that the ECG signal contains some degree of randomness, but its use in cryptographic applications is questionable. Some databases obtained reasonable results on either ENT or NIST STS. However, none of the tested databases obtained good results on both at the same time except the mitdb database.
The rest of the paper is organized as follows: Section 2 provides some background on biometric authentication according to ECG and the basic description of some random tests. Section 3 describes the evaluation of our implementations and a discussion of the results. This paper ends with some conclusions in Section 4.

Background
In this section, we provide some background on related work: biometric authentication, IPI-based authentication and key derivation protocols, as well as randomness tests.

Biometric Authentication
Biometric protocols provide security services such as the authentication and identification of a given person among a large set of people. Figure 2 illustrates the standard pipeline of a biometric system, from the signal acquisition and preprocessing to the final decision-making process to identify/authenticate the subject. At the core of the system there is a pattern-matching process between a freshly-acquired template built from the subject's signal and a previously-stored template. The matching process is usually done by defining an acceptance threshold and calculating the Hamming distance between both templates to decide whether the subject is or is not authenticated. The signal is usually acquired by sensors that can be located in, on or around the human body. Examples of well-known biometric signals include the iris [32], the fingerprint and face [33,34], the voice [35] and the ECG [36].  Biometric approaches have been combined with traditional cryptographic primitives in several ways, including the replacement of matching algorithms by secure versions [37,38], using biometric templates in Secure Multiparty Computation (SMC), homomorphic encryption schemes [34,39,40] or with elliptic curves [41,42]. Apart from cryptographic proposals, the use of biometric signals to generate cryptographic keys has been widely studied in the literature (see, e.g., [6,[22][23][24][25][26][27][28][29][30]43]). In most of these works, the authors obtain a biological signal from different sensors or devices, such as the Electroencephalogram (EEG), the PPG, the ECG or accelerometers, and check whether the signals can be considered random or not. To do so, the common practice is to extract some feature(s) from the signal and then run several randomness tests to validate the hypothesis.
Particularly, the use of IPIs has gained special attention in cryptographic applications as a random number generator; for instance, in [44][45][46] to generate a private key, in [6] to be part of an authentication protocol, in [47][48][49] as an alternative to classical key establishment protocols or in [50] as part of a proximity detection protocol. It is worth noting the transcendence that IPIs have in all the aforementioned scenarios and why the random number generation is crucial. Figure 1 shows a typical ECG trace. The signal contains six different peaks, known by the letters P, Q, R, S, T and U. Heartbeats are commonly measured as the time distance between two consecutive R-peaks. This is known as Inter-Pulse Interval (IPI), and several works published over the last decade have noted that the sequence of IPIs contains some entropy. To obtain such random bits, each IPI should be first quantized, i.e., represented in binary code using some coding scheme. Most works omit the details about the particular coding scheme used, which is quite unfortunate since this is a critical component for the entropy (or lack thereof) of the resulting binary sequence. One notable example is the work of Rostami et al. [6], which will be described in more detail in Section 3.2 since it is the coding scheme used in this paper.

IPI-Based Security Protocols
The majority of the proposed works in this area, e.g., [25,[27][28][29][30]51], conclude that the last four bits of each IPI can be used as a random number because of their high entropy. Thus, if an authentication protocol requires a 128-bit key to work, it would be necessary to acquire 32 IPIs (i.e., at least 33 consecutive R-peaks). Considering that a regular heart beats at 50-100 bits per minute (bpm), the key generation process would take between 20 and 40 s. To prove that the extracted bits have a certain level of randomness, most works use either the common Shannon or Rényi entropies [29], which are not enough to claim the randomness property of a sequence of beats. Additionally, in [6,26,27,47,51,52], the authors remark about the same claims about the randomness of the IPIs by running the NIST STS battery of randomness tests, whereas in [8], the authors rely on the ENT suite. Table 1 summarizes the datasets that the existing works in this area have used. Additionally, in the last column, the number of executed tests can be seen where, for instance, NIST STS (5/15) means that authors have run five tests out of the 15 that the NIST STS suite has. Note that [52] is the only work where the authors ran all tests of which NIST STS is composed. We were not able to find the main reasons for running a subset of tests in the rest of the works that use NIST STS.  [28] mitdb (no info is given) Shannon's entropy [30] mitdb (no info is given) Shannon's entropy [8] mitdb (no info is given) ENT [29] mitdb (no info is given) Rényi's entropy [47] PhysioNet 1 NIST STS (9/15) [26] 84 subjects from a private dataset and European ST-T NIST STS (5/15) [27] 18 subjects from MIT-BIHand 79 from the European ST-T NIST STS (10/15)

Randomness Tests
One key aspect of all IPI-based protocols is the assumption that some bits (four, typically) of each IPI are highly entropic. This condition is necessary, but not sufficient to guarantee the security of the protocol. In other words, high entropy does not necessarily imply randomness. Therefore, more sophisticated tests should be also applied to ensure that the values are indistinguishable from a random sequence.
In this paper, we have used the ENT [54] and NIST STS [12] suites to evaluate how good the generated random numbers are. In particular, ENT is a suite composed of the following tests: entropy, chi square, arithmetic mean, Monte Carlo and serial correlation coefficient statistical tests. Finally, ENT reports the overall randomness results after running the aforementioned tests. On the contrary, NIST STS is a suite made of fifteen statistical tests: frequency monobit and block tests, runs, longest run of ones in a block, binary matrix rank, the discrete Fourier transform (spectral) test, overlapping and non-overlapping template matching, Maurer's universal statistical tests, linear complexity, serial, approximate entropy, cumulative sums, random excursions and random excursions variant tests. Finally, NIST STS reports a p-value that indicates whether the given sequence has passed each test or not.
For completeness, we refer the reader to the Appendix A where we provide a brief description of each one of the tests that form part of both the NEST and NIST STS suites.

The Randomness of IPI Sequences
This section describes our experiments to analyze the randomness of the IPI values and a discussion of the obtained results.

Dataset
For consistency with previous research in this area, we have first downloaded the mitdb, ptbdb and mghdb Physionet (the software package to access the data repository can be found at https: //physionet.org/physiotools/wfdb.shtml) databases from [31], and we have tried to replicate the experimental setting used by both Rostami et al. in [6] and Xu et al. in [47]. The results, however, were impossible to reproduce due to the lack of information that the authors provide in the original papers. The downloaded databases contain the information of several subjects, and we do not know how the original experiments were run, e.g.,: (i) by acquiring the last 4 LSBs of the ECG of each one of the subjects and after that running (a subset) of the NIST STS tests per person; (ii) if the authors generated one single file with the information of all subjects belonging to the same database and then this file was used as the input of some of the NIST STS tests; or (iii) if the authors generated one single file with the information of all subjects of all databases and then they run (a subset of) the NIST STS tests.
Due to the fact that only one single value was given in [6] regarding the final results of the NIST STS and also that at a certain moment, the authors claim that they used an aggregate of different databases for the error generation, we assume that the authors used Approach (iii): they created one single file with the four LSBs of the IPI of different people belonging to different datasets. Nevertheless, we consider that this is not a realistic experiment because of the heterogeneous of the databases (see Table 2) as was also pointed out by [12]. On the contrary, the authors in [47] neither provide the achieved results of the NIST STS, nor do they say which database(s) they use for testing.
For these reasons, we have substantially extended this setting to 16 additional datasets of ECGs also present in the Physionet repository. All these datasets contain ECG records obtained from a variety of real subjects with different heart-related pathologies in many cases. Table 2 shows the main features of the 19 datasets used in this work. Furthermore, we have computed the median value of the extracted IPIs per file (person) per database. For instance, it is easy to argue that heart signals acquired from people equipped with holters (cdb) cannot be used to prove that the heart signal is random enough. Similar cases occur with the iafdb, ptdb or twadb databases with medians of 37, 68 and 87 IPIs, respectively.
In order to avoid the aforementioned problems and to allow other researchers to reproduce the results, we have split up the results into their corresponding databases. After that, we have extracted the four LSBs of each subject and ran the random tests (NIST STS and ENT suites) on each individual file (corresponding to each subject of each database) to evaluate how good the generated random numbers are. Finally, the results are grouped per pathology (database), and we give a percentage of the files (persons) that successfully passed the random tests.

IPI Extraction
Previous works in this area found out that the four LSBs of each IPI are highly entropic [6,47]. We replicated this process as follows. We first used a MATLAB script available at the Physionet repository (https://physionet.org/physiotools/software-index.shtml) to obtain the ECG signal for each record (person) in each one of the 19 datasets. We next applied the following steps:

1.
Get the sampling frequency for each signal, which is available in an associated description record.

2.
Run Pan-Tomkins's QRS detection algorithm [74] over the ECG signal to extract the R-peaks.

3.
Get the timestamp of each R-peak and calculate the difference between each pair of consecutive R-peaks to obtain the sequence of raw IPI values.

4.
Apply a dynamic quantization algorithm to each IPI to decrease the measurement errors. This process consists of generating discrete values from an ECG (continuous signal).

5.
Apply a Grey code to the resulting quantized IPI values to increase the error margin of the physiological parameters. 6.
Extract the four LSB from each coded IPI value.
Each sequence of extracted bits per record of each dataset is stored in separate files for subsequent analysis.
Additionally, we have also conducted one more experiment in MATLAB under a MacPro laptop with 4 Gb of RAM to estimate how long the signal should be to extract a stream of length x bits. To do so, we have computed the average number of IPIs and the length average of the signal of the nineteen databases. In Figure 3, these results can be seen from which we can conclude that the relation between time and the length of bits is linear, and for instance, after almost 4 h, we will have approximately 60,000 bits, which can be used as random numbers. It is also noticeable that these results are consistent with the hypothesis that in order to extract a valid cryptographic key, only a few seconds are enough. In other words, to generate a cryptographic key of 128 bits, a device should wait between 20 and 50 s to create that key. It is also remarkable that depending on the scenario, this time constraint might not be feasible to deploy; e.g., a person who is suffering from a heart attack cannot wait for a minute to authenticate her/his pacemaker with the caregiver device. Beatstreams (bits) 10 5 avg. beatstreams polyfit error

Measuring Randomness
In this section, we discuss the results of applying both the NIST STS and ENT test suites to the datasets discussed above.

ENT
As described in Appendix A.1, the ENT suite is comprised of six tests of randomness. Table 3 shows the optimum value for each one of them. Along with this, we also provide two additional values for each test: (i) a threshold, which constitutes a more affordable value for each test since the optimal output is quite restrictive and most sequences would fail the tests otherwise; and (ii) the test result obtained for an input sequence consisting of a simple counter value from zero to 2 14 . The purpose of this experiment is just to demonstrate that the result of a single test cannot be used alone to claim evidence of randomness; see, e.g., the output achieved by the counting sequence for the entropy, the arithmetic mean, the serial correlation coefficient and the optimum compression. The results obtained after applying the six ENT tests to each one of the files (persons) (with the IPIs of their ECG signals in our 19 datasets) can be seen in Table 4. Each cell in the table provides the percentage of persons who pass the test using the threshold shown in Table 3. For instance, in the case of the mitdb database, we have generated 46 files, belonging to 46 persons involved in this database, with a median of 1113 IPIs per file. The results for this database are that all persons pass both the entropy and optimum compression tests (100%), but none of them pass the chi square test (0%); 45 out of 46 pass the arithmetic mean and the serial correlation tests (97.83%); and 22 out of 46 pass the Monte Carlo value for π (47.83%).
Overall, the first noticeable observation is that these results are quite good across all datasets in the entropy, optimum compression and serial correlation, whereas for the chi square, the results are catastrophic. The situation is similar for Monte Carlo for the π test where all databases fail, but szdb, slpdb, edb and shareedb, achieving 71.43%, 74.47%, 60% and 55.52%, respectively. On the contrary, the arithmetic mean test achieves good results, but the vfdb and the cudb fail that test with 17% and 44.44%, respectively. Looking at the results from a dataset perspective, we were not able to identify if there exists some correlation among the tests results with the information available to us (number of samples, sampling frequency, signal length, IPIs per file or characteristics of the subjects). See also the discussion provided later on in Section 3.4 for an additional analysis on this.

NIST STS
In Appendix A.2, a description of all fifteen tests that comprise this suite can be read. As a common feature, all NIST STS tests are parameterized by a variable n, which means the length of bits of the processed bitstream. Additionally, some of the tests can also detect local non-randomness: the frequency test within block, overlapping and non-overlapping template matching, Maurer's "universal statistical" test, linear complexity, serial and approximate entropy tests. These tests are also parameterized by a second variable denoted as m or M. Those tests that use the m parameter are mainly focused on detection of m-bit patterns in the stream, whereas those tests that use the M parameter check the distribution of the specific feature across n/M blocks of equal size (M bits). In Table 5, the minimum requirements in terms of length can be seen.
Furthermore, if we take into account the values of Table 2 regarding the length (median) of our datasets, we cannot run the original NIST STS with enough of a confidence level. The Physionet datasets are irregular in their size, with several of them having too small a size to be used with the original tests. In order to circumvent the length constraints that the original NIST STS has, we have used a variant [75] of the original software package.  Table 6 provides the success rate obtained for the 15 NIST STS tests for the files (subjects) of each dataset. In this case, we used the pass criteria included in each test, which are based on an analysis of the yielded p-values. In other words, p-values of less than 0.01 are considered rejected. Overall, the results are similar to those obtained for ENT, although in this case, the success rate is generally higher in most cases. Furthermore, there are substantial differences across datasets. For instance, iafdb, ptbdb and twadb obtain success rates higher than 80% in 12, 12 and 11 out of the 15 tests, respectively. In contrast, the performance of many datasets is considerably poor, with less than 50% of their records not passing a majority of the tests: see, for example, the cases of apnea-ecg and cudb (more than 50% of the records fail nine out of 15 tests); svdb (more than 50% of the records fail 19 out of 15 tests); edb, slpdb, szdb and vfdb (more than 50% of the records fail 11 out of 15 tests); mghdb (more than 50% of the records fail 12 out of 15 tests); and lspdb (more than 50% of the records fail 13 out of 15 tests). In the case of slpdb and szdb, the results are very deficient, with all signals in both datasets failing nine out of the 15 tests (i.e., 0% of success rate). In terms of performance against individual tests, the results are rather diverse, with a few exceptions. The case of the linear complexity test stands out, as most datasets exhibit an extremely poor result. This suggests the existence of patterns that can be modeled by linear prediction functions, which undoubtedly implies predictability. Similarly, most datasets perform badly in the monobit and block frequency tests, which reveals a non-negligible imbalance of zeroes and ones (monobit frequency) and, more generally, all possible M-block bit patterns (block frequency).

Discussion
In Table 7, a summary of all tested databases can be seen with the typology of each dataset in order to find out some relations between them. Notice that if we analyze the results on average, all databases achieve reasonable results in the ENT suite, whereas eight out of 19 pass the NIST STS tests. Nevertheless, this is not true at all, as we can see in Table 4 that none of the tests pass the chi square test, which is crucial due to this test checking if the sequence is random or not [54]. Moreover, the Monte Carlo test achieves 36.45%, i.e., only the edb, shareedb, slpdb and szdb databases pass the Monte Carlo test.
It has been previously pointed out (see Table 2) that many authors only use the mitdb dataset, for which it is true that it passes most of the tests of both suites, but not all of them. Thus, it is not a real assumption to claim that ECG can be considered to be random only by taking the entropy results. We have proven that a counter achieves similar results, and it is well known that it cannot be used as a random generator (Table 3).
On the one hand, we have run all ENT tests (six out of six) for all databases with different samples per signal of each one of the subjects; however, here, we have only focused on the mitdb database because it has been commonly used in the literature. Figure 4a shows that when the length of the IPIs (number of bits in the file) is less than 6000 bits, the probability of success is less than 0.5 on average, whereas when the length is greater than 6000, the probability is between 0.1 and 0.8. Those results corroborate the same results previously obtained in Table 4, where the chi square test achieves 0% success, whereas the optimum compression test nearly has 100% success. On the other hand, we have run all NIST STS tests (15 out of 15) for all databases with different IPIs with respect to the median. Nevertheless, similar to the ENT experiment, we have only focused on the mitdb test instead of the rest of the databases. Contrary to the results obtained in Option a) in [6], Figure 4b shows that when the length of the IPI (number of bits in the file) increases, the results are worse, and even when the length is higher than 7000 bits, the probability of being successful is close to 0.5.
After analyzing Table 7 carefully, where the average of the results can be seen, we extract the following information: • When the median number of IPIs is higher than 1800, then the databases achieve extremely poor results (two passed tests out of 15 in the worst case) in the NIST STS. Examples of these databases are vfdb, szdb, slpdb, mghdb, edb, apnea-ecg and shareedb.

•
When the median number of IPIs is between 1800 and 415, then the databases are on the borderline of passing (at least) half of the NIST STS. Examples of these databases are svdb, cudb, stdb, qtdb, mitdb and nstdb. There is also one exception to this rule: aami-ec13, which has a median of 48.5 IPIs, and it achieves a 33.3% (five passed tests out of 15), which is similar to the results of svdb.

•
When the median number of IPIs is between 415 and 37, the databases achieve extremely good results (14 passed tests out of 15 in the best case) in the NIST STS. Examples of these databases are cdb, twadb, pbdb, iafdb, cebsdb. As before, there is an exception to this rule: aami-ec13, which has a median of 48.5 IPIs, and it only passes five out of 15 tests. We have tested 19 public databases from the Physionet repository. This has recently become a common practice in security proposals, and mitdb has been used as a starting point for authenticationand security-based protocols. According to the results presented in this work, we can claim that mitdb is not the best database for this purpose, but cebsdb is. However, other tests such as Diehard were impossible to be run with these databases because of the length of the signals; Diehard needs binary files that usually go from 10 to 12 million bytes.

Conclusions
In this work, we have addressed the random number generation issue by using heart signals; in particular, ECG records are used. Some authors have claimed that the four LSBs of the IPI values have a certain entropy level. Despite this, we have proven they have some entropy degree, and we have also showed that the ECG records and, consequently, the IPI values derived from them should not be considered a good source of randomness only by observing that value. We have used both the ENT and NIST STS test suites to evaluate the randomness property of 19 public and well-known ECG databases, and the results point to the fact that IPIs values are not as random as supposed. The database that achieves better results is cebsdb (healthy volunteers records) instead of mitdb (arrhythmia record), which is the most common database used in the literature. The use of the cebsdb database seems more appropriate since users do not suffer any medical condition and no defect (or bias) is a priori expected in the signals; in addition, the size of the database is more appropriate.
The results obtained through the in-depth analysis conducted clearly point to two conclusions: (1) a short burst of bits derived from an ECG record may seem random; but, (2) large files derived from long ECG records should not be used for security purposes (e.g., key generation algorithms). These conclusions should be taken with caution since these are conditioned on: (1) the IPI extraction algorithm described in Section 3.2; and, (2) the 19 public databases studied. Finally, we highlight here that all the necessary scripts to reproduce our experiments are publicly available (https://github.com/ aylara/Random_ECG).
As future work, we plan to extend this analysis to other biological signals like PPG or EEG.

Appendix A.2. NIST STS
The NIST STS test suite [12] is a set of fifteen statistical tests to evaluate random and pseudo-random number generators used in cryptographic applications. These tests are often used as a first step in spotting low-quality generators, but they are by no means a substitute for cryptanalysis. In other words, successfully passing all tests does not guarantee that the generator is strong enough.
All tests take as input a sequence of (binary) numbers and return a p-value that is then used to assess whether the sequence passed or not each test. In the following, we briefly describe each test in turn.

Frequency (monobit) test:
This is one of the simplest tests, which checks if the input sequence has a balanced number of ones and zeroes (i.e., if the distribution of bits is uniform).
Frequency test within a block: This test is an extension of the frequency monobit test, which can be considered as a particular case with the block size M equal to one. For values M > 1, this test checks if the frequency of ones in an M-bit block is approximately M/2.
The runs test: This test measures whether the number of runs of ones and zeroes of various lengths is as would be expected for a truly random sequence [12]. A run is a consecutive sequence of bits with the same value. The test returns a p-value per block length.
Longest-run-of-ones in a block: This test checks the length of the longest run of ones in a previously-defined block with length M and compares it with the expected value for a truly random sequence.
The binary matrix rank test: This test generates m × n binary matrices over GF (2) using the values of the input sequence (each row of a matrix is a substring of the sequence) and checks whether the ranks are linearly dependent.
Discrete Fourier transform (spectral) test: This test calculates the discrete Fourier transform of each subsequence of bits and computes the peaks, which might reveal patterns in the original sequence. The test uses a threshold t = log(1/0.05)n, where n is the length of the sequence. If the number of peaks is at most 5%, the sequence can be considered as random.
Non-overlapping template matching test: In this test, the random sequence is split into M substrings of length l. The test seeks the number of occurrences of a given template. If a pattern is found, the algorithm resets the substring M to the bit after the found pattern, otherwise M is reset to the next bit.
The overlapping template matching test: This test is identical to the non-overlapping template matching test, but using overlapping substrings (i.e., using a sliding window that advances 1 bit at a time).
Maurer's "universal statistical" test: The purpose of this test is to detect if the sequence can be significantly compressed without loss of information. One of the main drawbacks of this test is that it requires a substantially long sequence for the result to be relevant.
The linear complexity test: This test computes the linear complexity of the input sequence. If the value is too short, the sequence is not considered random enough.
The serial test: The focus of this test is to calculate the frequency of all possible overlapping M-bit patterns in the whole sequence. That is, each M-block should have the same probability of appearing as any other M-bit pattern.
The approximate entropy test: This test is focused on the frequency of all possible overlapping m-bit patterns in a sequence. In short, this test compares the frequency of two adjacent lengths (m and m + 1) to the expected result for a random sequence.
The cumulative sums test: In this test, zeroes are converted to negative ones, and ones remain the same. This test is based on the maximum distance from zero of a random walk defined by the cumulative sum of the sequence. For a random sequence, the cumulative sum should be close to zero.
The random excursions test: This test calculates the number of cycles having exactly K visits in a cumulative sum random walk, which is derived from partial sums.