# Evaluation of Pseudo-Random Number Generation on GPU Cards

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methodology

#### 2.1. GPU Architecture

#### 2.2. PRNG Parameters

#### 2.3. Experimental Setting

`clock_t`,

`cudaEvent_t`,

`nvprof`, and

`nsys`profiling tools. All calculations are made in single-precision floating point format.

`G_imp`), where the uniform PRNs are generated and then transformed into non-uniform distribution (i.e., Rayleigh, Beta, and Gamma distributions [59,60,61]) using the AR method on the GPU only. (2) The CPU implementation (

`C_imp`), where all calculations take place on a single core of the CPU. (3) The hybrid implementation (

`H_imp`), in which GPU generates uniform PRNs, then transfers them to the RAM, where the AR method is applied to obtain non-uniform distribution on a single core of the CPU, sequentially. (4) The GPU implementation with memory copy from the device to the host (

`G_imp_mcpy`), to account for the time spent on transferring data from the GPU to the CPU at the end of the calculation.

`G_imp`implementation: using the host API and device API. For the former, the library functions are called on the CPU, but the computations are performed on the GPU. The PRNs can then be stored on the device’s global memory or can be copied to the CPU for subsequent use. The device API allows us to call the device functions to set up PRNG parameters, such as seed, state, sequence, etc., inside a GPU kernel. The generated PRNs can be used immediately in the GPU kernel without storing them in global memory. To minimize the data movement within the GPU memory, we implement all PRNG steps (such as seed setup, state update, etc.) in the device API using a single kernel. In Section 3.3, we present a detailed comparison of these two API implementations.

## 3. Results

#### 3.1. Comparison of GPU and CPU Implementations

`C_imp`is faster

`G_imp`and

`H_imp`by up to two orders of magnitude. This difference in speed is caused by low occupancy of the GPU of $\sim 3\%$ (The GPU occupancy is defined as the number of active warps to the maximum number of warps that can run on the GPU). For small N, in contrast to the CPU cores, the GPU cores are not fully utilized. Since a single GPU core is slower than a single CPU core, partially-utilized GPU leads to slower execution compared to a single CPU. The

`H_imp`implementation, in which the PRNs are generated on the GPU, but AR-selected in the CPU, performs similarly to the

`G_imp`implementation. This is again caused by the low occupancy of the GPU, which becomes a bottleneck for this implementation.

`H_imp`implementation performs faster than

`C_imp`by a factor of seven, roughly, but performs slower than the

`G_imp`implementation by a factor of $\simeq 46$.

`G_imp_mcpy`and the

`G_imp`are similar to each other at $N\phantom{\rule{3.33333pt}{0ex}}\lesssim \phantom{\rule{3.33333pt}{0ex}}{10}^{6}$, but the two diverge at larger N. This happens because the transfer of data from GPU to CPU becomes the bottleneck for the

`G_imp_mcpy`implementation for $N\phantom{\rule{3.33333pt}{0ex}}\gtrsim \phantom{\rule{3.33333pt}{0ex}}{10}^{6}$, whereas at lower loads the time spent on data transfer is negligible compared to the other components of the computation (e.g., seed setup, PRNG state update, and API function calls).

`C_imp`implementation is constant and independent of N, which shows that the CPU core is fully utilized for any N (according to

`htop`utility). In the

`G_imp`implementation, due to the availability of many cores, the lowest execution time is reached at much larger N of $\sim {10}^{8}$. For

`G_imp_mcpy`, the execution time per candidate point reaches its minimum at $N\sim {10}^{7}$ due to the overhead caused by the GPU-to-CPU data transfer. For

`H_imp`, this execution time decreases with N until $N\sim {10}^{5}$. At $N\phantom{\rule{3.33333pt}{0ex}}\gtrsim \phantom{\rule{3.33333pt}{0ex}}{10}^{5}$, it becomes independent of N. Since for this implementation the AR algorithm is performed on the CPU, and since the saturation is reached at much higher N for the GPU implementation

`G_imp`, the independence of N suggests that the CPU becomes the bottleneck of the computation for $N\phantom{\rule{3.33333pt}{0ex}}\gtrsim \phantom{\rule{3.33333pt}{0ex}}{10}^{5}$ for the

`H_imp`implementation.

`nvidia-smi`. For the CPU, since it is fully loaded for any value of N according to the data from

`htop`utility, we use the maximum wattage of 10 W for a single AMD Threadripper 3990X core. The produce of power and the execution time yields the energy usage for various N. From Figure 2, we can conclude that for $N\phantom{\rule{3.33333pt}{0ex}}\gtrsim \phantom{\rule{3.33333pt}{0ex}}2\phantom{\rule{3.33333pt}{0ex}}\times \phantom{\rule{3.33333pt}{0ex}}{10}^{5}$ the

`G_imp`is preferable in terms of speed and energy efficiency, while for smaller N, the GPU does not offer an advantage neither in terms of energy usage nor execution time.

#### 3.2. Operation-Wise Load Analysis

`G_imp`implementation, and Figure 3b shows the corresponding percentile fraction with respect to the total computation time.

**$N\phantom{\rule{3.33333pt}{0ex}}\lesssim \phantom{\rule{3.33333pt}{0ex}}{10}^{5}$**but then grows with N gradually and reaches $\sim 50\%$ for $N\phantom{\rule{3.33333pt}{0ex}}\sim \phantom{\rule{3.33333pt}{0ex}}{10}^{9}$. The rest of the execution time goes to API function calls, the fraction of which is $\sim 40\%$ for

**$N\phantom{\rule{3.33333pt}{0ex}}\lesssim \phantom{\rule{3.33333pt}{0ex}}{10}^{2}$**, but drops below $\sim 20\%$ for $N\phantom{\rule{3.33333pt}{0ex}}\gtrsim \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$. However, as we will see below, these findings depend on the PRNG parameters, including the seed setup and the state size. The

`G_imp_mcpy`implementation, shown in Figure 3c,d, exhibits qualitatively similar breakdown of the total cost for $N\phantom{\rule{3.33333pt}{0ex}}\lesssim \phantom{\rule{3.33333pt}{0ex}}{10}^{6}$. However, the contribution of data transfer to the GPU becomes dominant for $N>{10}^{6}$ (e.g., at $N\phantom{\rule{3.33333pt}{0ex}}\sim \phantom{\rule{3.33333pt}{0ex}}{10}^{8}$, it reaches $\sim 80\%$). Hence, the relative contributions of the rest of the computations, including the AR algorithm, become smaller compared to those in the

`G_imp`implementation.

#### 3.3. Device API and Host API Implementations

`G_imp`implementation: using the host API and device API. Figure 4a shows the execution time for the Beta distribution in the host and device API implementations for the MRG32k3a PRNG. The device API is faster than the host API implementation by about an order of magnitude for $N\phantom{\rule{3.33333pt}{0ex}}\lesssim \phantom{\rule{3.33333pt}{0ex}}{10}^{6}$. This is attributed to the PRNG seed setup time (shown in dashed blue line), which is a bottleneck for host API implementation for $N\lesssim {10}^{6}$. The host API implementation of the MRG32k3a generator by default initializes $32,768$ threads for any number of generated PRNs. Since each thread has its own state, the host API implementation has the same seed setup time of $\sim 6.1$ ms for any N. It is possible to speed up the PRNG seed setup time by varying the ordering parameter, i.e., the parameter that determines how the PRNs are ordered in device memory [65]. For example, using

`CURAND_ORDERING_PSEUDO_LEGACY`option for MRG32k3a can speed up the seed setup by a factor of $\sim 6$ compared to the default parameter option. This is possible because fewer (4096) threads are initialized for this option. In the device API, for any N, we can choose a suitable number of initialized threads. As a result, the PRNG seed setup time in the device API implementation does not stay the same for any N. Instead, it increases with N from $\sim 0.116$ ms at $N\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}10$ to $\sim 5.36$ ms at $N\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}{10}^{9}$ (red dashed line in Figure 4a). Acceleration of the seed setup by changing the ordering option has also been discussed in [15,62].

#### 3.4. Comparison of Different PRNGs

#### 3.5. Comparison of Different GPU Cards

#### 3.6. Dependence on Distributions

`G_imp`implementation as a function of the number of “accepted” PRNs (${N}_{a}$). Since the uniform distribution does not require the application of the AR algorithm, it is the fastest to compute. For example, for ${N}_{a}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}{10}^{6}$, it is faster than the Gamma, Rayleigh, and Beta distributions by a factor of 2.78, 1.90, and 1.60, respectively. The different execution times of the non-uniform distributions are the direct result of the different acceptance rates of each distribution to construct non-uniform distributions (cf. Section 2.3). The execution times of different distributions diverge further with growing N. The reason for this behavior is that at smaller ${N}_{a}$, the contribution of state setup and API function calls to the overall execution time dominates over the contributions of the AR algorithm and the PRN state update algorithm. At large ${N}_{a}$, the contributions of the latter two become more pronounced, leading to stronger differences. For example, at ${N}_{a}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}{10}^{8}$, the uniform distribution is faster than the Gamma, Rayleigh, and Beta distributions by factors of $10.65$, $4.25$, and $2.49$, respectively.

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Acknowledgments

## Conflicts of Interest

## References

- Xanthis, C.; Venetis, I.; Chalkias, A.; Aletras, A. MRISIMUL: A GPU-based parallel approach to MRI simulations. IEEE Trans. Med. Imaging
**2014**, 33, 607–617. [Google Scholar] [CrossRef] [PubMed] - Yudanov, D.; Shaaban, M.; Melton, R.; Reznik, L. GPU-based simulation of spiking neural networks with real-time performance & high accuracy. In Proceedings of the International Joint Conference on Neural Networks, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
- Dolan, R.; DeSouza, G. GPU-based simulation of cellular neural networks for image processing. In Proceedings of the International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 730–735. [Google Scholar] [CrossRef] [Green Version]
- Heimlich, A.; Mol, A.; Pereira, C. GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation. Prog. Nucl. Energy
**2011**, 53, 229–239. [Google Scholar] [CrossRef] - Liang, Y.; Xing, X.; Li, Y. A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions. J. Comput. Phys.
**2017**, 338, 252–268. [Google Scholar] [CrossRef] [Green Version] - Wang, L.; Spurzem, R.; Aarseth, S.; Giersz, M.; Askar, A.; Berczik, P.; Naab, T.; Schadow, R.; Kouwenhoven, M. The DRAGON simulations: Globular cluster evolution with a million stars. Mon. Not. R. Astron. Soc.
**2016**, 458, 1450–1465. [Google Scholar] [CrossRef] [Green Version] - Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes 3rd Edition: The Art of Scientific Computing, 3rd ed.; Cambridge University Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Hastings, W. Monte carlo sampling methods using Markov chains and their applications. Biometrika
**1970**, 57, 97–109. [Google Scholar] [CrossRef] - Kroese, D.; Brereton, T.; Taimre, T.; Botev, Z. Why the Monte Carlo method is so important today. Wiley Interdiscip. Rev. Comput. Stat.
**2014**, 6, 386–392. [Google Scholar] [CrossRef] - Abdikamalov, E.; Burrows, A.; Ott, C.; Löffler, F.; O’Connor, E.; Dolence, J.; Schnetter, E. A new monte carlo method for time-dependent neutrino radiation transport. Astrophys. J.
**2012**, 755, 111. [Google Scholar] [CrossRef] [Green Version] - Richers, S.; Kasen, D.; O’Connor, E.; Fernández, R.; Ott, C. Monte Carlo Neutrino Transport Through Remnant Disks from Neutron Star Mergers. Astrophys. J.
**2015**, 813, 38. [Google Scholar] [CrossRef] [Green Version] - Murchikova, E.; Abdikamalov, E.; Urbatsch, T. Analytic closures for M1 neutrino transport. Mon. Not. R. Astron. Soc.
**2017**, 469, 1725–1737. [Google Scholar] [CrossRef] [Green Version] - Foucart, F.; Duez, M.; Hebert, F.; Kidder, L.; Pfeiffer, H.; Scheel, M. Monte-Carlo Neutrino Transport in Neutron Star Merger Simulations. Astrophys. J. Lett.
**2020**, 902, L27. [Google Scholar] [CrossRef] - Richers, S. Rank-3 moment closures in general relativistic neutrino transport. Phys. Rev. D
**2020**, 102, 083017. [Google Scholar] [CrossRef] - Fatica, M.; Phillips, E. Pricing American options with least squares Monte Carlo on GPUs. In Proceedings of the WHPCF 2013: 6th Workshop on High Performance Computational Finance—Held in Conjunction with SC 2013: The International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–22 November 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Karl, A.T.; Eubank, R.; Milovanovic, J.; Reiser, M.; Young, D. Using RngStreams for parallel random number generation in C++ and R. Comput. Stat.
**2014**, 29, 1301–1320. [Google Scholar] [CrossRef] [Green Version] - Entacher, K.; Uhl, A.; Wegenkittl, S. Parallel random number generation: Long-range correlations among multiple processors. In International Conference of the Austrian Center for Parallel Computation; Springer: Berlin/Heidelberg, Germany, 1999; pp. 107–116. [Google Scholar]
- Entacher, K. On the CRAY-system random number generator. Simulation
**1999**, 72, 163–169. [Google Scholar] [CrossRef] - Coddington, P.D. Random number generators for parallel computers. Northeast. Parallel Archit. Cent.
**1997**, 2. Available online: https://surface.syr.edu/cgi/viewcontent.cgi?article=1012&context=npac (accessed on 3 November 2021). - De Matteis, A.; Pagnutti, S. Parallelization of random number generators and long-range correlations. Numer. Math.
**1988**, 53, 595–608. [Google Scholar] [CrossRef] - l’Ecuyer, P. Random number generation with multiple streams for sequential and parallel computing. In Proceedings of the 2015 Winter Simulation Conference (WSC), Huntington Beach, CA, USA, 6–9 December 2015; pp. 31–44. [Google Scholar]
- Manssen, M.; Weigel, M.; Hartmann, A.K. Random number generators for massively parallel simulations on GPU. Eur. Phys. J. Spec. Top.
**2012**, 210, 53–71. [Google Scholar] [CrossRef] [Green Version] - Kirk, D.; Wen-Mei, W.H. Programming Massively Parallel Processors: A Hands-On Approach; Morgan Kaufmann: Burlington, MA, USA, 2016. [Google Scholar]
- L’Ecuyer, P.; Oreshkin, B.; Simard, R. Random Numbers for Parallel Computers: Requirements and Methods. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.434.9223&rep=rep1&type=pdf (accessed on 3 November 2021).
- Wadden, J.; Brunelle, N.; Wang, K.; El-Hadedy, M.; Robins, G.; Stan, M.; Skadron, K. Generating efficient and high-quality pseudo-random behavior on Automata Processors. In Proceedings of the 2016 IEEE 34th International Conference on Computer Design (ICCD), Scottsdale, AZ, USA, 2–5 October 2016; pp. 622–629. [Google Scholar]
- Ciglarič, T.; Češnovar, R.; Štrumbelj, E. An OpenCL library for parallel random number generators. J. Supercomput.
**2019**, 75, 3866–3881. [Google Scholar] [CrossRef] - Demchik, V. Pseudorandom numbers generation for Monte Carlo simulations on GPUs: OpenCL approach. In Numerical Computations with GPUs; Springer: Berlin/Heidelberg, Germany, 2014; pp. 245–271. [Google Scholar]
- Kim, Y.; Hwang, G. Efficient Parallel CUDA Random Number Generator on NVIDIA GPUs. J. KIISE
**2015**, 42, 1467–1473. [Google Scholar] [CrossRef] - Mohanty, S.; Mohanty, A.; Carminati, F. Efficient pseudo-random number generation for monte-carlo simulations using graphic processors. J. Phys.
**2012**, 368, 012024. [Google Scholar] [CrossRef] [Green Version] - Barash, L.Y.; Shchur, L.N. PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs. Comput. Phys. Commun.
**2014**, 185, 1343–1353. [Google Scholar] [CrossRef] [Green Version] - Bradley, T.; du Toit, J.; Tong, R.; Giles, M.; Woodhams, P. Parallelization techniques for random number generators. In GPU Computing Gems Emerald Edition; Elsevier: Amsterdam, The Netherlands, 2011; pp. 231–246. [Google Scholar]
- Sussman, M.; Crutchfield, W.; Papakipos, M. Pseudorandom number generation on the GPU. In Proceedings of the SIGGRAPH/Eurographics Workshop on Graphics Hardware, Vienna, Austria, 3–4 September 2006; pp. 87–94. [Google Scholar] [CrossRef]
- Abeywardana, N. Efficient Random Number Generation for Fermi Class GPUs. Available online: https://www.proquest.com/openview/e4cd0bc00b2dd0572824fe304b5851e4/1?pq-origsite=gscholar&cbl=18750 (accessed on 3 November 2021).
- Howes, L.; Thomas, D. Efficient random number generation and application using CUDA. GPU Gems
**2007**, 3, 805–830. [Google Scholar] - Preis, T.; Virnau, P.; Paul, W.; Schneider, J. GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model. J. Comput. Phys.
**2009**, 228, 4468–4477. [Google Scholar] [CrossRef] - Thomas, D.B.; Howes, L.; Luk, W. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2009; pp. 63–72. [Google Scholar]
- Anker, M. Pseudo Random Number Generators on Graphics Processing Units, with Applications in Finance. Mémoire de maîtrise à l’Université d’Edinburgh. 2013. Available online: https://static.epcc.ed.ac.uk/dissertations/hpc-msc/2012-2013/Pseudo (accessed on 3 November 2021).
- Jia, X.; Gu, X.; Sempau, J.; Choi, D.; Majumdar, A.; Jiang, S. Development of a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport. Phys. Med. Biol.
**2010**, 55, 3077–3086. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Alerstam, E.; Svensson, T.; Andersson-Engels, S. Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration. J. Biomed. Opt.
**2008**, 13, 060504. [Google Scholar] [CrossRef] [Green Version] - Bert, J.; Perez-Ponce, H.; Bitar, Z.; Jan, S.; Boursier, Y.; Vintache, D.; Bonissent, A.; Morel, C.; Brasse, D.; Visvikis, D. Geant4-based Monte Carlo simulations on GPU for medical applications. Phys. Med. Biol.
**2013**, 58, 5593–5611. [Google Scholar] [CrossRef] [PubMed] - Okada, S.; Murakami, K.; Incerti, S.; Amako, K.; Sasaki, T. MPEXS-DNA, a new GPU-based Monte Carlo simulator for track structures and radiation chemistry at subcellular scale. Med. Phys.
**2019**, 46, 1483–1500. [Google Scholar] [CrossRef] [Green Version] - Spiechowicz, J.; Kostur, M.; Machura, L. GPU accelerated Monte Carlo simulation of Brownian motors dynamics with CUDA. Comput. Phys. Commun.
**2015**, 191, 140–149. [Google Scholar] [CrossRef] [Green Version] - Ayubian, S.; Alawneh, S.; Thijssen, J. GPU-based monte-carlo simulation for a sea ice load application. In Proceedings of the Summer Computer Simulation Conference, Montreal, QC, Canada, 24–27 July 2016; pp. 1–8. [Google Scholar]
- Langdon, W.B. PRNG Random Numbers on GPU; Technical Report; University of Essex: Colchester, UK, 2007. [Google Scholar]
- Passerat-Palmbach, J.; Mazel, C.; Hill, D.R. Pseudo-random number generation on GP-GPU. In Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation, Nice, France, 14–17 June 2011; pp. 1–8. [Google Scholar]
- Fog, A. Pseudo-random number generators for vector processors and multicore processors. J. Mod. Appl. Stat. Methods
**2015**, 14, 23. [Google Scholar] [CrossRef] [Green Version] - Beliakov, G.; Johnstone, M.; Creighton, D.; Wilkin, T. An efficient implementation of Bailey and Borwein’s algorithm for parallel random number generation on graphics processing units. Computing
**2013**, 95, 309–326. [Google Scholar] [CrossRef] - Gong, C.; Liu, J.; Chi, L.; Hu, Q.; Deng, L.; Gong, Z. Accelerating Pseudo-Random Number Generator for MCNP on GPU. AIP Conf. Proc.
**2010**, 1281, 1335–1337. [Google Scholar] - Gao, S.; Peterson, G.D. GASPRNG: GPU accelerated scalable parallel random number generator library. Comput. Phys. Commun.
**2013**, 184, 1241–1249. [Google Scholar] [CrossRef] [Green Version] - Monfared, S.K.; Hajihassani, O.; Kiarostami, M.S.; Zanjani, S.M.; Rahmati, D.; Gorgin, S. BSRNG: A High Throughput Parallel BitSliced Approach for Random Number Generators. In Proceedings of the 49th International Conference on Parallel Processing-ICPP, Workshops, Edmonton, AB, Canada, 17–20 August 2020; pp. 1–10. [Google Scholar]
- Pang, W.M.; Wong, T.T.; Heng, P.A. Generating massive high-quality random numbers using GPU. In Proceedings of the 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; pp. 841–847. [Google Scholar]
- Yang, B.; Hu, Q.; Liu, J.; Gong, C. GPU optimized Pseudo Random Number Generator for MCNP. In Proceedings of the IEEE Conference Anthology, Shanghai, China, 1–8 January 2013; pp. 1–6. [Google Scholar]
- Nandapalan, N.; Brent, R.P.; Murray, L.M.; Rendell, A.P. High-performance pseudo-random number generation on graphics processing units. In International Conference on Parallel Processing and Applied Mathematics; Springer: Berlin/Heidelberg, Germany, 2011; pp. 609–618. [Google Scholar]
- Kargaran, H.; Minuchehr, A.; Zolfaghari, A. The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran. AIP Adv.
**2016**, 6, 045101. [Google Scholar] [CrossRef] [Green Version] - Riesinger, C.; Neckel, T.; Rupp, F.; Hinojosa, A.P.; Bungartz, H.J. Gpu optimization of pseudo random number generators for random ordinary differential equations. Procedia Comput. Sci.
**2014**, 29, 172–183. [Google Scholar] [CrossRef] [Green Version] - Jun, S.; Canal, P.; Apostolakis, J.; Gheata, A.; Moneta, L. Vectorization of random number generation and reproducibility of concurrent particle transport simulation. J. Phys.
**2020**, 1525, 012054. [Google Scholar] [CrossRef] - Amadio, G.; Canal, P.; Piparo, D.; Wenzel, S. Speeding up software with VecCore. J. Phys. Conf. Ser.
**2018**, 1085, 032034. [Google Scholar] [CrossRef] - Gregg, C.; Hazelwood, K. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software, Austin, TX, USA, 10–12 April 2011; pp. 134–144. [Google Scholar]
- Hoffman, D.; Karst, O.J. The theory of the Rayleigh distribution and some of its applications. J. Ship Res.
**1975**, 19, 172–191. [Google Scholar] [CrossRef] - Theodoridis, S. Chapter 2—Probability and Stochastic Processes. In Machine Learning, 2nd ed.; Theodoridis, S., Ed.; Academic Press: Cambridge, MA, USA, 2020; pp. 19–65. [Google Scholar] [CrossRef]
- Papoulis, A. Probability, Random Variables and Stochastic Processes. IEEE Trans. Acoust. Speech Signal Process.
**1985**, 33, 1637. [Google Scholar] [CrossRef] - Fatica, M.; Ruetsch, G. CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming; Elsevier Inc.: Amsterdam, The Netherlands, 2013; pp. 1–323. [Google Scholar] [CrossRef]
- Nvidia, C. CUDA C Programming Guide, Version 11.2; NVIDIA Corp.: 2020. Available online: https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (accessed on 3 November 2021).
- Nvidia, C. CUDA C Best Practices Guide; NVIDIA Corp.: 2020. Available online: https://www.clear.rice.edu/comp422/resources/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf (accessed on 3 November 2021).
- Nvidia, C. Toolkit 11.0 CURAND Guide. 2020. Available online: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html (accessed on 3 November 2021).
- Marsaglia, G. Xorshift RNGs. J. Stat. Softw.
**2003**, 8, 1–6. [Google Scholar] [CrossRef] - Saito, M.; Matsumoto, M. Variants of Mersenne twister suitable for graphic processors. ACM Trans. Math. Softw.
**2013**, 39, 1–20. [Google Scholar] [CrossRef] [Green Version] - L’ecuyer, P. Good parameters and implementations for combined multiple recursive random number generators. Oper. Res.
**1999**, 47, 159–164. [Google Scholar] [CrossRef] [Green Version] - Salmon, J.K.; Moraes, M.A.; Dror, R.O.; Shaw, D.E. Parallel random numbers: As easy as 1, 2, 3. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, USA, 12–18 November 2011; pp. 1–12. [Google Scholar]
- Matsumoto, M.; Nishimura, T. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. (TOMACS)
**1998**, 8, 3–30. [Google Scholar] [CrossRef] [Green Version] - Fog, A. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs; Technical Report; Copenhagen University College of Engineering: Ballerup, Denmark, 2021. [Google Scholar]

**Figure 1.**The probability density functions of the Beta, Gamma, and Rayleigh distributions (normalized to $f\left(x\right)\le 1$). The areas under the curves of Beta, Gamma, and Rayleigh distributions are 0.67, 0.36, and 0.41, respectively.

**Figure 2.**Execution time of computation (

**a**), execution time per candidate point (

**b**), and (

**c**) an approximate energy consumption by different implementations as a function of (N) for Beta distribution. The solid black line represents the CPU implementation (

`C_imp`); the dashed red line is for GPU implementation (

`G_imp`), while the solid red line is for

`G_imp_mcpy`implementation; the blue solid line is for hybrid implementation (

`H_imp`).

**Figure 3.**Execution time and percentage ratio of different computational parts as a function of N for the

`G_imp`(

**a**,

**b**) and

`G_imp_mcpy`(

**c**,

**d**) implementations. The solid line represents the total computation time, the dashed line is PRNG seed setup time, the dotted line is AR algorithm time; the dashed-dotted line is the PRNG state update, and the dotted line with triangle marker is for the API functions. The green dotted line with star marker shows the memory copy time from device to host.

**Figure 4.**The execution time (

**a**) and the GPU global memory usage (

**b**) as a function of N using host and device API implementations. In (

**a**) the solid lines represent the total execution time, the dashed lines show seed setup time, and the dashed-dotted lines show the PRNG state update time; the dotted lines display time spent to apply the AR algorithm.

**Figure 5.**Execution time for different PRNGs as a function of N using host API (

**a**) and device API (

**b**). Note that since the MT19937 implementation does not support device API [65], it is omitted from (

**b**).

**Figure 6.**Percentage ratio of different computational parts as a function of N for various PRNGs. (

**a**) MTGP32 (

**b**) PHILOX4_32_10 (

**c**) MT19937 (only supported by host API) (

**d**) XORWOW. The dashed lines represent the PRNG seed setup time; the dashed-dot lines are PRNG state update time, and the dotted lines with triangle markers are for API functions.

**Figure 7.**Normalized execution time versus the GPU occupancy for different PRNGs using device API implementation. (

**a**) MRG32k3a; (

**b**) PHILOX4_32_10; (

**c**) XORWOW; (

**d**) MTGP32. The time is normalized with respect to the fastest time that can be achieved for a given N for RTX3090 card and device API implementation. The various lines correspond to different numbers of generated PRN with the uniform distribution. The GPU occupancy is the ratio of active warps on a streaming multiprocessor to the maximum possible warps that can be run simultaneously.

**Figure 8.**GPU occupancy for different PRNGs to generate various numbers N of PRNs implemented in device API.

**Figure 9.**Execution time to generate PRNs as a function of N for the Beta distribution and MRG32k3a generator for different GPU cards.

**Figure 10.**Execution time to obtain PRNs with different distributions using the device API implementation and the MRG32k3a generator.

XORWOW | MTGP32 | MRG32K3A | PHILOX4_32_10 | MT19937 | |
---|---|---|---|---|---|

Algorithm | Linear feedback shift registers [66] | Twisted generalized feedback shift register generator [67] | Combined Multiple Recursive [68] | Counter-Based Random Number Generation [69] | Twisted generalized feedback shift register generator [70] |

Period | ${2}^{192}-1$ | ${2}^{11,214}$ | ${2}^{191}$ | ${2}^{128}$ | ${2}^{19,937}-1$ |

Sub-sequence length | ${2}^{67}$ | − | ${2}^{67}$ | ${2}^{64}$ | ${2}^{1000}$ |

State size | 48 bytes | 4120 bytes | 48 bytes | 64 bytes | 2500 bytes |

Parallelization method | Sequence splitting | Parameterization | Sequence splitting | Sequence splitting, parameterization | Sequence splitting |

GTX1080 | GTX1080Ti | RTX3080 | RTX3090 | |
---|---|---|---|---|

SMs | 20 | 28 | 68 | 82 |

CUDA cores | 2560 | 3584 | 4352 | $10,496$ |

Max clock rate | $1.73$ GHz | $1.58$ GHz | $1.8$ GHz | $1.7$ GHz |

Global memory | 8 GB | 11 GB | 10 GB | 24 GB |

Theoretical performance | $8.873$ TFLOPS (FP32) | $11.34$ TFLOPS (FP32) | $29.77$ TFLOPS (FP32) | $35.58$ TFLOPS (FP32) |

Bandwidth | $320.3$ GB/s | $484.4$ GB/s | $760.3$ GB/s | $936.2$ GB/s |

**Table 3.**Default fiducial parameters of our experimental setup. The values of each of these parameters are varied to study the dependence of performance on these parameters.

Parameter | Value |
---|---|

Implementation | G_imp |

API | device API |

GPU | RTX 3090 |

PRNG | MRG32k3a |

Distribution | Beta distribution |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Askar, T.; Shukirgaliyev, B.; Lukac, M.; Abdikamalov, E.
Evaluation of Pseudo-Random Number Generation on GPU Cards. *Computation* **2021**, *9*, 142.
https://doi.org/10.3390/computation9120142

**AMA Style**

Askar T, Shukirgaliyev B, Lukac M, Abdikamalov E.
Evaluation of Pseudo-Random Number Generation on GPU Cards. *Computation*. 2021; 9(12):142.
https://doi.org/10.3390/computation9120142

**Chicago/Turabian Style**

Askar, Tair, Bekdaulet Shukirgaliyev, Martin Lukac, and Ernazar Abdikamalov.
2021. "Evaluation of Pseudo-Random Number Generation on GPU Cards" *Computation* 9, no. 12: 142.
https://doi.org/10.3390/computation9120142