Revisiting Multiple Ring Oscillator-Based True Random Generators to Achieve Compact Implementations on FPGAs for Cryptographic Applications

: The generation of random numbers is crucial for practical implementations of cryptographic algorithms. In this sense, hardware security modules (HSMs) include true random number generators (TRNGs) implemented in hardware to achieve good random number generation. In the case of cryptographic algorithms implemented on FPGAs, the hardware implementation of RNGs is limited to the programmable cells in the device. Among the different proposals to obtain sources of entropy and process them to implement TRNGs, those based in ring oscillators (ROs), operating in parallel and combined with XOR gates, present good statistical properties at the cost of high area requirements. In this paper, these TRNGs are revisited, showing a method for area optimization independently of the FPGA technology used. Experimental results show that three ring oscillators requiring only three LUTs are enough to build a TRNG on Artix 7 devices from Xilinx with a throughput of 33.3 Kbps, which passes NIST tests. A throughput of 50 Kbps can be achieved with four ring oscillators, also requiring three LUTs in Artix 7 devices, while 100 Kbps can be achieved using an structure with four ring oscillators requiring seven LUTs.


Introduction
Random number generation is crucial for the security and applicability of cryptographic algorithms and protocols. In this sense, the generation of good random numbers has been a recurrent issue from the beginning of the development of secret-and public-key cryptosystems, and it has become critical nowadays due to the increasing computing power available to any attacker. From the initial development of computers, they have been a useful tool for generating such numbers, first for scientific and statistical uses [1], and later for cryptographic purposes [2]. Nevertheless, true random numbers cannot be generated by means of programming, and it is required to obtain entropy sources, initially from peripherals [3] and later through the proposals of external hardware generators [4]. Currently, computers make use of true random number generators (TRNGs) combined with deterministic random bit generators (DRBGs) included in external chips such as Trusted Platform Modules (TPMs) [5,6]. In the case of embedded systems implemented on reconfigurable logic, such as FPGAs or systems-on-chips, the usual solution is similar, combining a TRNG followed by a pseudorandom number generator (PRNG) (or a DRBG) in order to obtain a good trade-off between randomness, area resources, and power consumption [7,8]. In this sense, several designs of TRNGs [9][10][11][12][13] and PRNGs [14][15][16] to be implemented on FPGAs have been proposed. Among these proposals, those based on the use of ring oscillators (ROs) as sources of entropy combined with XOR gates to generate the final random bitstream seem to present the best statistical properties, although at the cost of high area requirements [10]. In this work, we will analyze and propose some variants to these TRNG implementations that are suitable to be used in cryptographic applications on FPGAs, and we will provide a method to optimize the area and throughput of the implementation independently of the programmable device technology/manufacturer. The rest of the article is organized as follows: Section 2 revises previous work in the literature regarding TRNGs implemented on FPGAs, Section 3 provides the study of bitstream generation of basic sample ring oscillators, and Section 4 revisits multiple XORed ring oscillators in order to achieve ultracompact TRNGs. Finally, Section 5 compares these structures to other work in the literature, and the conclusions are provided in Section 6.

Previous Work
The implementation of TRNGs requires sources of entropy as generators of randomness, along with a distillation process to avoid weaknesses in these sources. Taking into account the structure of FPGAs [17], the sources of entropy are usually based on ROs. In this sense, a simple RO is built using one NOT gate, connected as in Figure 1a. This feedback structure oscillates at a frequency that varies significantly depending on the process variations of the transistors in the cells where the ROs are placed when implementing the circuit [18]. In [19], a simplified model to estimate the delay of an RO is proposed: where d AVG is the average delay of the RO, d PV is the delay component due to process variations, and d NOISE is the delay component due to the noise generated in the logic element. Note that d NOISE depends on multiple factors, such as temperature, humidity, circuit activity, or side effects from the activity of neighboring cells. Therefore, this dynamic delay is the main source of the entropy originating from the RO. The values generated at the output of the RO need to be sampled to generate a random bitstream. The simplest method to perform this sampling is to place a flip-flop (FF) at the output, as shown in Figure 1b, which corresponds to the so-called SRO-FF structure. In this case, the clock input of the FF acts as a sampling signal, at frequency f samp , being f RO >> f samp . If the clock signal is decorrelated from the output of the RO, d NOISE presents enough variability (the other delays will remain relatively invariant, as the SRO-FF is implemented in only one logic element, LE), a random bitstream will be obtained. However, the variability of d NOISE may be not enough to avoid pattern repetitions due the periodicity of the clock and RO output signals. A proposal to overcome this issue is presented in [20], where a second RO (RO2) is used to generate the sampling signal. The output signal of RO2 (rclock) feeds a frequency divider in order to maintain the relation f RO >> f samp . This structure is known as ERO-TRNG [10], and it should be noted that it presents a limitation in the throughput of the RNG, as the output of the frequency divider is used as the clock signal for the system processing the bitstream. Another alternative consists of the structure shown in Figure 2, where the sampling clock signal is generated from two ROs [21], but it has strict requirements in the maximum delay between the two RO feed signals, thus making it necessary to manually place and route them [10]. The use of several ROs acting in parallel and feeding a XOR gate is proposed in [22] and reproduced in Figure 3. This is an interesting proposal, because the use of several ROs in parallel introduces more variability, as we will also have different d AVG and d PV values for each RO. On the other hand, in [23]; it is reported that a problem can arise if the XOR gate cannot change its output at the same rate that ROs are changing their states. In this same work, the introduction of a flip-flop between each RO and the input of the XOR gate, thus controlling the input rate to the XOR gate, is proposed to overcome this issue. Other solution was recently proposed in [13], consisting of introducing a latch in the feedback path of each RO. Although these solutions introduce some limitations in the throughput, the additional variability provided by the parallelism of the structure enables compact implementations of TRNGs, as will be studied in Section 4.

Study of a Basic Sampled RO
As a previous step to building compact TRNGs with multiple ring oscillators, we will carry out a detailed study of a basic RO sampled with a D flip-flop. As in the rest of studies presented in this work, the sampled outputs of the entropy source will be sent to a microprocessor-based platform, named Dracon [24]. This platform is in charge of collecting and sending the generated numbers to a personal computer (PC) to be statistically analyzed, as shown in Figure 4. Additionally, as the developed TRNGs are intended to be part of a more complex system integrated into an FPGA device, we will add an xenable input for switching off the RNG when it is not used, in order to reduce the power consumption. Indeed, in [25], the high power consumption generated by ROs is shown, where they are even used to generate power noise or to extract information from the inside of FPGAs [26]. The resulting structure is presented in Figure 5, and it has a single source of entropy: an RO with two elements (the AND and NOT gates), a so-called SARO(1) (single-ANDed ring oscillator with 1 inverter). The complete structure is then named SARO(1)-FF. For our implementations, we used an Artix 7 device from Xilinx [27], which includes six-input LUTs as basic logic elements, and the Vivado 2020.2 software. The exact device was an Artix 7 XC7A35T-1CPG236C on a Cmod A7-35T board from Digilent Inc., powered at 5 V from a laboratory power supply. All the experiments were carried out at a temperature of 26 • C and a relative humidity of 33%. Since the objective was to achieve a compact TRNG with a reasonable throughput, a sampling frequency of f samp = 50 kHz was considered. Indeed, 50 Kbps suffices to generate the required random numbers in secure IoT applications or other cryptographic implementations over low-cost FPGA devices, and it implies a period long enough to accumulate the required jitter with a reduced number of inverting elements in the RO [10]. In order to validate our results, we have generated a set of bitstreams and have analyzed the statistical properties for their use in cryptography by means of the SP 800-22 suite [28], developed by the National Institute for Standardization (NIST). This suite includes a set of tests to analyze the randomness of bitstreams. The purpose, parameters, and interpretation of these tests are briefly described in the following [29]: must be in the range of 500 ≤ M L ≤ 5000, and the length of the entire sequence should be n ≥ 10 6 . Additionally, in [29], it is established that the p-value must be >0.01 to accept the hypothesis of randomness, and a minimum of "55 bitstreams should be processed to derive statistically meaningful results for the uniformity of p-values". Taking all of the above into account, we generated 125 bitstreams of 1,500,000 bits, each with the following set of values for the required parameters: n = 1.5 × 10 6 , M = 30,000, m NO = 9, m O = 9, m e = 10, m s = 16, M L = 500. Table 1 presents the results obtained for SARO(1)-FF, showing that it does not pass NIST tests when operating at a sampling frequency of f samp = 100 kHz. The limited variability introduced by only one RO and the high sampling frequency used produce a significant difference between the number of '1's and '0's generated, thus not passing the frequency-based tests. Moreover, a high sampling frequency generates a series of repeated '0's and '1's, thus not passing tests such as "Runs". If we decrease the sampling frequency, it is possible to improve the test performance, thus obtaining the results in Table 2 for f samp = 50 kHz.    The situation has thus been improved, but the asymmetry of the signal generated by a single RO makes it impossible to pass any frequency-based test. Moreover, the behavior of each RO depends on the process parameters of each transistor, which are different for each LUT used for its implementation and, of course, for each device. In this sense, we performed experiments with different placements in the same device, as well as using different devices, obtaining significant deviations in the probability of generating '0's and '1's by a SARO. These deviations sometimes imply P(0) > P(1), and others P(1) > P(0). The maximum deviation we measured for SARO(1)-FF at 50 kHz was ∆P(0) = |P(0) − 1/2| = 0.0034, i.e., 0.34%, as shown in Table 3. Note that since P(0) + P(1) = 1, ∆P(1) = ∆P(0). This deviation can be alleviated by using two SAROs in parallel and combining the two outputs by means of an XOR gate [22], as will be studied in the next section. Regarding the number of inverting elements k in a SARO(k)-FF, we observed that SARO(0)-FF (just a NAND gate) provides bad results in terms of frequency tests due to the lack of stabilization of high and low levels at the NAND output. If the length is increased, the results in Table 3 show better values for ∆P(0), at the cost of a lower throughput (the oscillation frequency of the resulting RO is lower). In column k, the type of gate used for enabling or disabling each RO (AND or NAND), as well as the number of NOT gates in the ROs, are specified in parentheses. Note that SARO(k)-FF requires k + 1 LUTs and 1 FF to be implemented in an Artix-7 device, as pointed out in the LUTs+FF column in Table 3. Table 3. Probability deviation for SARO(k)-FF ring oscillators at f samp = 50 kHz (k is the number of inverting elements). On the other hand, at a sampling frequency of f samp = 50 kHz, SARO(2)-FF or SARO(3)-FF do not present advantages with respect to SARO(1)-FF, thus SARO(1)-FF will be the basic RO that we will be consider for building multiple SAROs in order to carry out compact structures passing NIST tests.

Multiple XORed Ring Oscillators
As outlined in the previous section, asymmetry in the probabilities of obtaining a '0' or a '1' at the output of a SARO(k)-FF can be compensated using an XOR gate. Indeed, if we consider a two-input XOR gate and let P(i 0 = 0) and P(i 1 = 0) be the probabilities of having a '0' at inputs i 0 and i 1 of the XOR gate, respectively, while P(i 0 = 1) and P(i 1 = 1) are the probabilities of having '1' at those same inputs, respectively, we find that the probabilities of obtaining a '0' at the output will be P(o = 0) = P(i 0 = 0)P(i 1 = 0) + P(i 0 = 1)P(i 1 = 1) As an example, if P(i 0 = 0) = P(i 1 = 0) = 0.6, we will have P(o = 0) = 0.52, which improves the tendency of generating more '0's than '1's by the two SAROs feeding a 2-input XOR gate. The expected deviation is then ∆P 2 (o = 0) = 0.02, a 2%, which is excessive for passing any frequency test. In the case of considering a N-input XOR gate, and assuming that P(i 0 = 0) = P(i 1 = 0) = . . . = P(i N−1 ) P(0), consequently (P(i 0 = 1) = P(i 1 = 1) = . . . = P(i N−1 = 1) P(1), and then the probability P N (o = 0) can be computed as If we represent the deviation ∆P N (o = 0) = |P N (o = 0) − 1/2| as a function of N for P(0) = 0.60, the graph in Figure 6a is obtained. Figure 6b presents the same graph, but in logarithmic scale, showing a clear linear relationship. As a consequence, it can be written as: where a and b can be determined by linear regression or analytically. Indeed, where Figure 6. ∆P N (0) as a function of N for P(0) = 0.6 in linear (a) and logarithmic (b) scales.
The exponential dependency of ∆P N (o = 0) with N implies that an XOR gate with a high number of inputs (and therefore, a high number or SAROs) is not required to compensate for the generation probabilities of the '0's and '1's. The structure for building and testing such TRNGs for different values of N is shown in Figure 7. This structure is basically the one presented in [10], where it is called MURO, and which, in turn, is based on [22]. Only one difference is introduced: sampling is performed by a well-defined clock source instead of using a RO for this task, as in [23] (however, this structure does not include the enable signal). Although it reduces variability at the output of the structure, it enables its behavior to be studied in terms of the sampling frequency. Using this structure, which we have named Multiple XORed SARO(k)-FF with N ROs (MX-N-SARO(k)), we performed the generation of bitstreams for different values of N. In our experiments, the maximum deviation measured with a single SARO at 50 kHz was ∆P(0) = 3.34 × 10 −3 , and in this case, theoretically, from Equations (4) and (6) for N = 2, it would be ∆P 2 (o = 0) = 2.31 × 10 −5 . However, the maximum measured deviation was ∆P 2 (o = 0) = 3.34 × 10 −3 (E(∆P 2 (o = 0)) = 2.2 × 10 −4 ) at 50 kHz. This indicates that a single SARO(1)-FF can present a deviation of around 3.5% in Artix 7 devices. Therefore, we consider ∆P(0) = 0.1 for our estimations. Figure 8 shows deviations for several values of ∆P(0), where it can be noted that considering a maximum deviation in a SARO(1)-FF of ∆P(0) ≤ 0.1, minimum values of N of 4 or 5 are required in order to obtain acceptable statistical results. It is also interesting to note that an increase of 5% in ∆P(0) implies a significant increase in the number of inputs of the XOR gate required for achieving the deviations below 10 −3 . Indeed, Figure 9 shows an exponential dependence of N with ∆P(0) for maintaining a maximum deviation P N (o = 0) ≤ 10 −3 . As ∆P(0) depends on the characteristic delay of LUTs, and this delay depends on the FPGA technology used, different types of FPGA can lead to quite different minimum N values for passing NIST tests at a given sampling frequency. As an example, Spartan 6 (45nm technology) grade-3 devices from Xilinx report a delay of 0.21 ns from An-Dn LUT inputs to A-D outputs [30], while Artix 7 (28 nm technology) grade-3 devices report a delay of 0.10 ns between the same points [31]. This directly affects the time required for generating variability in these delays at a given sampling frequency, and consequently, ∆P(0) at this frequency.  In order to pass NIST tests for the Artix 7 device used in this work, if we introduce the experimental values obtained for P(0) and introduce them in Equations (4)-(6), we can see that N = 3 implies a deviation of ∆P 3 (0) = 5 × 4 × 10 −3 at f samp = 50 KHz, which is in the limit of passing NIST frequency-based tests. In the case of N = 4, we obtain ∆P 4 (0) = 1.1 × 10 −3 at f samp = 50 kHz, while theoretically, it should be around 0.8 × 10 −3 . Table 4 shows the NIST test results for this sampling frequency and parameters. As has been presented, increasing the values of N will improve frequency-based tests, but at the cost of an area increase. In order to achieve more compact implementations, we have explored decreasing N and k. To achieve an implementation passing NIST tests with N = 3, the sampling frequency needs to be decreased. Table 5 shows different implementations and sampling frequencies for MX-N-SARO(k). Note that it is possible to build a TRNG with N = 3 using SARO(1)-FF at a 33 kHz sampling frequency, and that N = 4 enables there to be TRNGs with a throughput of 100 Kbps (MX-4-SARO(1)) and a compact implementation requiring only four LUTs (MX-4-SARO(0)) in Artix 7 devices.  The area results regarding MX-4-SARO(0) from Table 5 require a detailed explanation. Indeed, since a SARO(0)-FF includes a NAND gate, MX-4-SARO(0) is expected to require at least five LUTS: one LUT per two-input NAND gate and one LUT for the four-input XOR gate. Nevertheless, in seven-series devices from Xilinx, each six-input LUT has two independent outputs, named O5 and O6 [32], thus being possible to implement the four SARO(0)-FFs in only two LUTs, as well as the four-input XOR in an additional LUT. Figure 10 shows the mapping of MX-4-SARO(0) requiring three LUTs and five FFs. Similarly, MX-4-SARO(1) can be implemented using seven LUTs. Note that MX-3-SARO(1) fits in one slice, while MX-SARO(0) and MX-4-SARO(1) require two slices due to the five FFs to be placed. From the results above, and taking into account Equations (4) and (6), it is possible to formulate a procedure to design and implement a TRNG based on the MX-N-SARO(k) structure:

1.
Implement a SARO(1)-FF ring oscillator operating at the target sampling frequency corresponding to the desired throughput, following the scheme in Figure 5.

2.
Capture a bitstream with a statistically significant size (n ≥ 10 6 ), and analyze the frequency of '0's and '1's. The frequency test from the NIST suite can be used for this purpose. From this analysis, estimate the deviation probability of '0's, ∆P(0).

4.
Implement MX-N-SARO(1) and perform NIST tests following the recommendations described in [29] and summarized in Section 3.

5.
In case NIST tests are not passed, increment N and go to 4.
This procedure enables compact TRNGs to be implemented on different FPGA technologies, optimizing the number of tries to achieve a low-cost design passing NIST tests.

Comparison to Other TRNGs for FPGAs
As commented in Section 2, there are several proposals of TRNGs in the literature, mainly oriented to achieve high-throughput figures. In the case of systems with restrictions on area and/or performance, as is the case for Internet of Things (IoT) devices implemented on low-cost FPGAs, including cryptographic operations, the generation of 50 Kbps random streams is enough for the majority of applications. In this sense, our designs provide very compact TRNGs while ensuring randomness of the generated bitstreams. Table 6 presents a comparison of MX-N-SARO(k) to other compact implementations in the literature. In all cases, although they show contained area requirements, they are oriented to highperformance systems, where a large number of random numbers are required to be available continuously.
In this sense, the design in [33] proposes the use of a set of multiple XORed ROs, followed by postprocessing based on a von Neumann corrector to improve entropy and statistical figures. As a result, it can achieve a throughput of 6 Mbps on a Spartan 3A device, from Xilinx Inc. at the cost of 528 LUT4s (Spartan 3A devices include 4-input LUTS). In the case of [34], a different approach is used based on self-timed rings (STRs), and controlling dephases with the digital clock management (DCM) features included in Xilinx's FPGAs. The results show a high throughput, 100 Mbps, with contained area requirements of 56 LUTs and 16 FFs on a Virtex 6 device, but in any case, this is far from the area figures presented in this work. Regarding [35], this is also a high-throughput oriented TRNG, based on the use of DCMs to generate metastability, and bit-flipping postprocessing, which enables the improvement of throughput and area with respect to [34], requiring around 38 LUTs on Zynq devices. The work in [23] proposes two designs based on multiple XORed ROs, one with 25 ROs and the other one with 50 ROs, both implemented on Cyclone II devices from Intel. These designs can operate with a throughput of 100 Mbps, at the cost of more area requirements. In addition to the designs presented in Table 6, the proposal in [13] reports area requirements of four LUTs, similar to our proposal, but the required control logic is not included in those results, and it does not pass NIST tests in all cases [35]. For these reasons, it has not been included in Table 6. Regarding power consumption, the reduced number of ROs required by our proposal enables a contained power consumption of only 25 mW in the case of MX-3-SARO(1), and around 40 mW in the case of our designs with four ROs. These figures are clearly better than those carried out by the other works in Table 6, except in the case of [34]. Nevertheless, it should be noted that our results were obtained through a detailed analysis of the synthesized circuits using a N6705C DC Power Analyzer from Keysight, while the result in [34] is an estimation provided by the ISE design tools from Xilinx. This estimation may be not reliable due to the difficulty that software tools have to estimate power consumption in feedback structures, as is the case of ROs.

Conclusions
In this work, the design of true random number generators for FPGAs based on multiple XORed ring oscillators has been revisited. Traditionally, in this type of design, a large quantity of parallel ring oscillators are used to achieve enough entropy and to pass the NIST SP 800-22 test suite, thus resulting in high area requirements. Our proposal shows that it is possible to pass NIST tests with a low number of ring oscillators when the sampling frequency is reduced, thus enabling the implementation of ultracompact TRNGs on low-cost FPGAs. Concretely, a design with three ring oscillators, requiring only three LUTs in Xilinx's Artix 7 devices and providing a random bitstream at 33 Kbps, was implemented. With four ring oscillators of the MX-4-SARO(0) type, which also require three LUTs on Artix 7 devices, it is possible to achieve a throughput of 50 Kbps, while MX-4-SARO(1) achieves 100 Kbps, requiring only seven LUTs. Moreover, a procedure to migrate the designs to other FPGA technologies, optimizing the number of designs to test, was carried out. Finally, it should be noted that although the throughput presented by our designs is lower than other proposals in the literature, the area requirements are minimal, thus enabling their implementation on low-cost FPGAs for implementing secure IoT devices, including cryptographic algorithms.