1. Introduction
Nowadays, an increasing number of electronic devices are interconnected through the Internet to exchange information either with people or other devices enabling automation and data-driven decision-making in diverse domains such as healthcare, transportation, and industrial systems. This pervasive inter connectivity, known as the Internet of Things (IoT), has led to an exponential growth in the volume and sensitivity of transmitted data. Consequently, ensuring the protection, reliability, confidentiality, and integrity of this information has become a fundamental challenge for both researchers and industry practitioners [
1,
2].
The cryptographic strength of many widely used public-key schemes relies on the computational infeasibility of solving certain number-theoretic problems within a practical time frame. These cryptographic primitives underpin secure communication protocols such as TLS and SSL widely used today. However, the reliability and robustness of such algorithms critically depend on the quality of the random numbers used for key generation, nonce creation, and initialization vectors. Weak or biased randomness can lead to predictable keys or reproducible cryptographic material compromising overall system security [
3,
4]. Therefore, assessing the statistical quality of random number generators (RNGs) is essential across domains such as cryptography, simulation, and embedded-systems verification.
RNGs underpin stochastic modelling and Monte-Carlo methods in scientific computing, where any bias or structure in the output can compromise research validity [
5]. In cryptographic applications, weak entropy sources jeopardize key generation, nonces, and initialization vectors, weakening even mathematically robust algorithms [
6]. In embedded and IoT devices, hardware and resource limitations often expose RNGs to non-ideal conditions, making rigorous statistical evaluation crucial for reliable operation [
7] since we depend on high-quality RNGs and their continuous evaluation as proposed in this work, especially in critical ecosystems such as IoT.
A true random number generator (TRNG) relies on the inherent unpredictability of physical noise sources to produce non-deterministic bit sequences. Different implementations use a variety of entropy sources, including electronic noise [
8], hybrid architectures combining memristors with microcontrollers [
9], random telegraphic noise (RTN) [
10] among others. These approaches show that TRNGs can be realized through diverse physical phenomena offering robust and flexible solutions for security-critical applications. To ensure their reliability, TRNGs must produce sequences indistinguishable from ideal randomness, while pseudorandom number generators (PRNGs) emulate such behavior deterministically [
11]. Rigorous evaluation of both types ensures compliance with international randomness standards.
The National Institute of Standards and Technology (NIST), part of the U.S. Department of Commerce, developed a comprehensive suite of 15 statistical tests known as NIST SP 800-22 [
12], designed to evaluate whether a binary sequence exhibits properties consistent with randomness. Alongside the specification document, NIST released an original implementation in the C programming language. Despite its robustness, this reference implementation developed more than two decades ago, poses challenges when compiled or executed in modern environments. Its reliance on legacy tools, rigid output structure, and limited configurability often hinder integration into automated workflows. Moreover, the post-processing of test results requires additional utilities to interpret and summarize data stored in static directory structures.
Several authors have proposed enhancements to the original algorithms and implementations [
13,
14], including versions rewritten in modern programming languages such as Python or C++ [
15,
16]. Among these, the implementation known as
Fast NIST [
17] stands out for significantly reducing the execution time of computationally intensive tests. However, it still inherits certain limitations from the original software, particularly regarding the fixed organization of results and the lack of a graphical interface.
Recent efforts have produced modern reimplementations of the NIST test suite in languages such as Python, providing enhanced usability and integration with existing data analysis pipelines. For example, ref. [
18] compared three open-source Python-based implementations, highlighting the challenges of adherence to the original test specifications and the impact on randomness assessments. Optimizations for computational efficiency have also been proposed, such as in [
19], which introduces fast software implementations of the Serial Test and Approximate Entropy Test by replacing bit-level operations with byte-level operations, achieving speedups exceeding 2× over the original methods. Combining test parameters further enhanced performance with speedups above 4× compared to individual test implementations. Beyond standalone software, several web-based platforms have been developed to perform statistical randomness testing directly through web browsers, eliminating the need for local installation or configuration. However, the review of available implementations [
20,
21,
22] reveals that many of these systems either limit their functionality to a subset of the NIST SP 800-22 tests, offer incomplete result reporting, or are no longer actively maintained. Although such platforms improve accessibility and user experience, they often lack transparency regarding parameter configuration, reproducibility of results, or full adherence to NIST specifications, issues that still constrain their suitability for research and validation workflows. In contrast, the testing platform presented in this work provides a fully accessible and user-friendly framework that can be used from any computer with an Internet connection, without requiring installation or specialized software or hardware. The system supports multiple simultaneous users through isolated session handling, ensuring independent and conflict-free execution of experiments. All tests included in the NIST SP 800-22 suite are implemented with complete compatibility in output format and statistical interpretation, enabling reproducibility and adherence to established standards. Furthermore, the platform exposes a dedicated API that allows both users and IoT devices to submit binary sequences, execute the full set of tests programmatically, and retrieve structured results, thereby broadening its applicability for automated validation workflows and large-scale experimentation.
In April 2022, the National Institute of Standards and Technology (NIST) announced its decision to revise Special Publication 800-22 Rev. 1a, following multiple rounds of public feedback and evaluation [
23]. This has not materialized into a public draft to date.
The reviewed literature and practical implementations indicate the importance of refining and adapting the original NIST SP 800-22 suite for modern use cases. Interoperability of test results, adherence to specifications, scalability, and integration with contemporary software ecosystems remain active areas of advancement. Online implementations particularly represent a growing trend towards democratizing access to cryptographic randomness evaluation tools.
To address these issues, this work presents a web-based framework that automates the test execution and analysis of randomness under the NIST SP 800-22 specification. The proposed system provides an intuitive user interface for uploading data files, selecting specific tests, and obtaining detailed results almost instantly. Its backend adopts a Model–View–Controller (MVC) architecture that allows parallel execution of independent tests, thereby reducing total processing time.
Beyond NIST SP 800-22, the application also incorporates an entropy estimation module based on the ent utility, which is widely available in most GNU/Linux distributions. This opens the possibility of efficiently adding new tests to the server in the future, allowing the researchers to quickly and easily assess the quality of random numbers without installing additional software or performing manual result processing.
2. Materials and Methods
The proposed system consists of two main components: a Python-based backend and a frontend for user interaction. The general architecture is shown in
Figure 1.
The platform integrates the compiled
Fast NIST executable—slightly modified to accept an output directory as a runtime parameter—within a Python-based backend that acts as a wrapper for the optimized implementation presented in [
17]. The backend exposes a set of endpoints for data upload, test execution, and result retrieval, which can be accessed either from the web-based application or directly by IoT devices through HTTPS requests. Upon receiving test parameters, it executes the corresponding NIST SP 800-22 tests using the compiled executable, collects and organizes the results, and performs a preliminary statistical evaluation to determine whether each test passes or fails, thereby streamlining the overall workflow and ensuring compatibility with the original software output format.
The frontend, developed with HTML5, CSS3, and Vanilla JavaScript, provides a lightweight and intuitive interface. Users can upload binary sequence files, configure parameters such as the number of sequences, bit length, and block size for tests requiring it. Progress and results are displayed almost immediately. All requests to the backend are managed via Representational State Transfer (REST) endpoints, enabling simultaneous test execution and significantly improving system performance.
To isolate execution contexts when multiple users use the tool concurrently, session handling was implemented using Redis, an in-memory key-value database. The backend’s static content is served through Nginx web server with SSL encryption via Let’s Encrypt certificates. The endpoints were implemented in Python 3.13 using the Flask framework, and all components were deployed on a server with a 16-core Intel Xeon processor @2.8 Ghz, 128 GB RAM, and 500 GB HDD, running GNU/Linux Ubuntu Server 24.04 LTS. The domain name was registered and managed through the No-IP service. Access to the tool requires a computer with Internet connectivity and a browser compatible with the mentioned standards. Our online NIST testing tool is publicly accessible at
https://mindub.ddns.net/ (accessed on 4 March 2026). The
Figure 2 shows a screenshoot of the graphical user interface.
To evaluate our NIST testing tool, a representative sample of random numbers was generated using an ESP32-C3 RISC-V microcontroller developed by Espressif
®, a low-cost device featuring integrated Wi-Fi and Bluetooth, widely employed in prototyping, research projects, and commercial IoT applications. A custom program was implemented leveraging the Espressif IoT Development Framework framework (ESP-IDF), following the workflow depicted in
Figure 3, to produce sufficiently long binary sequences in accordance with the NIST SP 800-22 specifications (i.e., a minimum of 55 sequences of 1,000,000 bits each). For the experiments reported in this study, a dataset of 2 GB was generated utilizing the hardware-based TRNG of the ESP32-C3.
From this dataset, several experiments were conducted, each consisting of 1000 sequences of 1,000,000 bits. The generated sequences were analyzed using the proposed web-based tool. To validate the correctness and reliability of the tool, the same sequences were also evaluated using the original NIST software, and the outcomes were found to be consistent, confirming that the tool faithfully reproduces the results of the original implementation while providing enhanced usability, automated processing, and real-time reporting.
To demonstrate the practical applicability of the proposed framework, a prototype emulation of an IoT ecosystem was implemented, comprising multiple interconnected devices. Each emulated device periodically generates local random bit sequences, which are subsequently validated in real time by the framework before being used as seeds for cryptographic operations during data transmission.
The emulation was conducted using a Python-based script that simulates the validation requests of N IoT devices. Each device is represented by an independent thread that sends requests to the backend at randomized intervals, emulating asynchronous operation in a real-world network. Validation results are logged and visualized in real time, enabling assessment of how the dynamic quality of local entropy sources impacts the security of communications within a multi-node, concurrent environment.
The system ensures simultaneous evaluation of multiple devices without interference, maintaining data integrity and consistency of results. In addition, the framework allows execution of supplementary entropy tests via the GNU/Linux
ent utility through the same interface, and its modular design facilitates the integration of new randomness tests in future expansions. The topology of the emulated IoT ecosystem is depicted in
Figure 4.
To further evaluate the framework’s capability to distinguish between high- and low-quality entropy sources, the IoT emulation included a set of four devices that implement software-based pseudo-random number generators (PRNGs) and the hardware TRNG of the GNU/Linux. Each device continuously transmitted generated sequences to the backend for real-time NIST SP 800-22 testing. For testing purposes, the number of sequences was limited to 55 and the length of 1,000,000 bits per sequence was maintained according to NIST specification. The results demonstrate that the framework can reliably detect the statistical deficiencies of PRNG-generated sequences in comparison to true random sequences. This highlights the system’s potential for immediate identification of inadequate entropy sources in cryptographic applications, providing actionable insights for developers and security engineers in IoT deployments.
3. Results
Table 1 presents the results of applying the NIST SP 800-22 suite to 1000 sequences of 1,000,000 bits from the 2 GB dataset generated by the ESP32-C3 as described above.
The table includes Chi-square
p-values (Chi
2), Kolmogorov-Smirnov (KS)
p-values, number of sequence
n, number of approved sequences
, success proportion
proportion, pass rate
passrate, and global results for each test. The Chi-square statistic quantifies the deviation between the observed and expected frequency distributions, serving as an indicator of how well the random data conforms to a uniform distribution [
24]. The KS test measures the maximum absolute difference between the empirical cumulative distribution function (ECDF) of the observed data and the cumulative distribution function (CDF) of the reference theoretical distribution, which in the context of randomness testing is typically the uniform distribution. This statistic quantifies the largest deviation between the empirical and ideal behavior expected from a perfectly random sequence. Owing to its non-parametric nature, the KS test is particularly valuable because it does not assume any specific form of the underlying distribution, making it sensitive to a wide range of deviations from uniformity [
25,
26].
According to NIST SP 800-22, each statistical test is applied to a set of
n independent binary sequences produced by the generator under evaluation. Every execution of a given test on a single sequence yields a
p-value; that sequence is considered to have passed the test when its
p-value satisfies
, where
is the significance level (the probability of a Type I error). NIST recommends
for the suite, i.e., a 1% nominal probability of incorrectly rejecting a truly random sequence. After processing all
n sequences, the observed success proportion is compared to the expected value and a goodness-of-fit analysis is performed on the collection of
p-values. In this second stage the Kolmogorov–Smirnov (KS) test (or a chi-square test on binned
p-values) is used to assess whether the distribution of the
p-values is consistent with the uniform distribution on
. The KS
p-value is not used as a per-sequence decision rule against
; rather it provides a global measure of uniformity: a large KS
p-value indicates that the ensemble of
p-values behaves as expected for randomness, while a small KS
p-value signals systematic deviation from uniformity and therefore potential non-randomness in the generator. This two-tier procedure (per-sequence acceptance with
, followed by global uniformity testing) combines sensitivity to individual failures with a test for collective anomalies in the
p-value distribution [
25,
27]. The observed pass proportion is then given by the Equation (
1)
where
is the number of sequences that passed the test.
In order to determine whether the generator passes a given statistical test, NIST defines a minimum acceptable pass rate that accounts for statistical variability due to the finite sample size. This value is given by the Equation (
2)
where the constant
adopted by NIST corresponds approximately to the
confidence interval of the normal distribution. Therefore, a test is considered successful if the observed pass proportion satisfies Equation (
3)
This expression ensures that minor statistical fluctuations due to limited sample size do not incorrectly classify a test as failed. In other words, Equation (
2) defines a statistically justified lower bound on the expected pass rate with a confidence level of 99%.
Within the NIST SP 800-22 statistical test suite, certain tests produce more than one p-value per analyzed sequence. This occurs because these tests evaluate the same statistical property under multiple conditions or subpatterns. For instance, the Cumulative Sums test performs two independent evaluations (forward and backward), while the Random Excursions and Random Excursions Variant tests generate several p-values for each valid state of the random walk, depending on the number of state visits in each sequence. Similarly, the Serial test computes two p-values corresponding to patterns of length m and , and the Non-overlapping Template Matching test can generate up to 148 p-values, one for each analyzed template.
4. Discussion
In the original NIST implementation, each of these p-values is treated as an independent statistical observation, and the overall test result is obtained by analyzing the combined distribution of all p-values. Specifically, the uniformity of the p-values is evaluated using the chi-square goodness-of-fit test, and the proportion of sequences passing the significance criterion () is computed over the entire consolidated set.
Following this approach, in the present work all p-values generated by a multi-output test were aggregated into a single column vector, which was subsequently used to calculate the chi-square and Kolmogorov–Smirnov statistics, the pass proportion, and the overall pass rate. This ensures that the final result of each NIST test corresponds to a single aggregated metric, thus maintaining consistency with the methodology applied to single-output tests.
Regarding performance,
Table 2 presents the execution times obtained for the NIST SP 800-22 tests using three different implementations: the original NIST implementation, the optimized
Fast NIST version, and the proposed system described in this work. The test was performed using 55 independent sequences, each containing 1,000,000 bits, extracted from a 2 GB dataset of random binary sequences generated by the ESP32-C3 microcontroller. This setup ensures reproducibility and provides a representative sample for evaluating the statistical properties of the generated data. The results demonstrate a significant reduction in execution time compared to the official implementation, particularly in computationally intensive tests such as
Linear Complexity and
Non-overlapping Template. The optimized Fast NIST offers substantial improvement with regards original NIST. The proposed architecture of this work partially takes advantage of this improvement, but a substantial additional performance gains through asynchronous task management, concurrent test execution, and efficient server-side processing.
It should be noted that in
Table 2, the total execution time reported for this work (1750.60 ms) differs from the NIST and Fast NIST columns, where tests are executed sequentially. In our framework, the reported time is not the arithmetic sum of individual test durations because multiple tests are executed in parallel across independent threads. This cumulative value, retrieved from the web application’s Activity Log, represents the total wall-clock time from the initiation of the first task to the completion of the last one. Furthermore, these results may exhibit subtle variations depending on the server’s instantaneous workload and CPU availability at the time of execution, as concurrent requests and resource scheduling in a multi-user environment can influence the final response time and computational overhead.
The observed performance gain in the Non-overlapping Template test is primarily due to the multi-core task distribution and the asynchronous execution model. These results reflect an optimal resource allocation by the server’s task manager during the case study.
Additionally, the developed tool executes an entropy test using the
ent utility package available for GNU/Linux systems. This test complements the NIST SP 800-22 suite by measuring the average amount of information produced by a random source. The
ent utility estimates the randomness of a file through several statistical indicators to evaluate the quality of random sequences: the entropy (measured in bits per byte) represents the average information content of each byte, with a perfectly random sequence approaching 8.0 bits per byte; lower values indicate redundancy or predictability. The
optimum compression percentage estimates how much a file could be compressed without information loss, with values near 0% corresponding to highly random data. The
chi-square test compares the distribution of byte values (0–255) against a uniform distribution, outputting a percentage representing how often a truly random sequence would exceed the observed chi-square value. The
arithmetic mean value of data bytes reflects the average byte value, expected to be close to 127.5 in random sequences, with deviations revealing potential bias. The
Monte Carlo estimate of uses byte pairs as coordinates in a unit square to assess the uniformity of data distribution, with estimates closer to
indicating higher uniformity. Finally, the
serial correlation coefficient measures the relationship between consecutive bytes; values near 0 indicate no correlation, whereas values approaching +1 or –1 reveal strong correlation or anti-correlation, exposing non-random structures.
Table 3 show the results obtained for the 2GB dataset file generated as indicated above.
On the other hand, to emulate the behavior of an IoT ecosystem using the framework, a Python script was developed to simulate multiple IoT devices performing randomness–validation requests to the backend via HTTPS. For experimental purposes, each simulated device randomly selects, at every execution cycle, between two internal sources: (i) a TRNG-like source based on the GNU/Linux
/dev/urandom interface, and (ii) a PRNG implemented using Python’s
random module, which relies on the Mersenne Twister algorithm. Since Mersenne Twister is deterministic and not intended for cryptographic use, it provides a suitable contrast against the higher-entropy output typically obtained from
/dev/urandom. Each emulated client generates in memory 55 sequences of 1,000,000 bits (the minimum recommended by NIST) and periodically submits them for validation using independent threads and randomized time intervals, effectively reproducing the asynchronous and concurrent nature of real IoT communication. For each validation request, five representative NIST statistical tests were executed, as summarized in
Table 4. It is important to emphasize that the use of a PRNG does not necessarily imply that statistical tests will always fail. When the number of sequences is small or when only a limited subset of tests is applied, certain deterministic patterns may remain undetected, allowing the PRNG to temporarily produce outputs that satisfy the NIST SP 800-22 criteria. This behavior is expected, as the statistical power of the tests increases with both sample size and test diversity. Consequently, passing a small set of tests should not be interpreted as evidence of cryptographic strength, but rather as a reminder that deterministic generators may mimic randomness under restricted testing conditions.
For each client request, the backend processes the data and returns the results in real time, including key statistical metrics and the execution time of each test. The execution time reflects the computational performance and responsiveness of the system under varying workloads. Representative results obtained from four emulated IoT clients in one experiment are summarized in
Table 5, demonstrating the system’s scalability, responsiveness, and robustness when handling concurrent, high-load scenarios. As can be observed in the table, all the statistical tests applied to the true random number generator (TRNG) passed successfully, achieving proportions above the minimum pass rate threshold. However, for the pseudo-random number generator (PRNG), the
Frequency (IoT Device Number 3) and
Longest Run of Ones in a Block (IoT Device Number 2) tests reported a
FAILURE indicating that, for the corresponding device, the generated sequence did not exhibit the expected uniformity. This is because a pseudo random number generator (PRNG) relies on deterministic algorithms, which means its output inevitably contains structural patterns that can cause certain NIST SP 800-22 tests to fail, especially when the internal state or seeding process is not ideal. In contrast, a true random number generator (TRNG) extracts randomness from a physical source of entropy, such as thermal noise or quantum effects, which naturally produces sequences with no algorithmic structure. As a result, TRNG outputs tend to pass the NIST statistical tests more consistently, since they more closely resemble the behavior of an ideal random process. This highlights the importance of studying and developing high-quality entropy sources.
5. Conclusions
This work presented the development of an automated framework for executing and analyzing the randomness tests defined in the NIST SP 800-22 standard, designed to evaluate entropy sources used in security applications within the Internet of Things (IoT). The proposed solution integrates a Python-based backend and a lightweight, browser-accessible frontend, enabling remote, concurrent, and reproducible execution of statistical tests on binary sequences generated by different devices or systems.
The system’s modular architecture, based on REST endpoints and Redis-managed sessions, allows the simultaneous management of multiple users or devices while ensuring data isolation and integrity of results. Deployment on a GNU/Linux environment with Flask and Nginx provides stability, security, and scalability, allowing the framework to be adapted easily to both experimental and production contexts.
Experimental validation was conducted using 2 GB of data generated by the hardware TRNG of the ESP32-C3, from which 1000 sequences of 1,000,000 bits were evaluated. All fifteen statistical tests defined in NIST SP 800-22 were successfully passed, with observed proportions exceeding the minimum required pass rates and global p-value distributions consistent with theoretical expectations. Cross-validation against the official NIST implementation and Fast NIST confirmed statistical equivalence of results, demonstrating the correctness of the proposed framework.
In terms of performance, the proposed architecture reduced total execution time to approximately 1.75 s for 55 sequences of 1 Mbit, outperforming both the original NIST implementation (∼49.3 s) and Fast NIST (∼2.27 s). This improvement is achieved through asynchronous execution and efficient task parallelization, making the framework suitable for large-scale and near real-time evaluations. Complementary entropy analysis using the ent tool yielded an estimated entropy of 7.999999 bits per byte with negligible serial correlation, further supporting the statistical quality of the evaluated source.
Validation in a simulated IoT environment demonstrated the framework’s ability to manage concurrent requests from heterogeneous devices and to detect statistical deficiencies in deterministic generators when compared with true entropy sources. These results confirm the robustness, scalability, and diagnostic capability of the system for distributed and resource-constrained environments.
Finally, the tool has been made publicly accessible through an open and easy-to-use framework that promotes transparency, reproducibility, and broader adoption of statistical testing practices. Its modular architecture makes it easily extensible, enabling future integration of additional statistical test suites such as Dieharder and deployment in real IoT testbeds for continuous entropy monitoring in emerging IoT and cryptographic applications.