Evaluation of GPU-Accelerated Edge Platforms for Stochastic Simulations: Performance and Energy Efficiency Analysis
Abstract
1. Introduction
- Extensive Benchmarking of SSA on Modern Edge GPUs: The SSA algorithm is implemented and optimized on the latest Jetson devices, including the Volta-based Xavier NX [12,13], the Ampere-based Orin Nano and Orin NX [14,15]. Unlike prior studies that evaluated the old-generation Jetson Nano [5,6], our work extends the analysis to newer architectures, quantifying the impact of increased CUDA cores, improved memory bandwidth, and higher GPU clock speeds on SSA performance.
- Energy and Cost Efficiency Analysis: Beyond measuring execution time, energy efficiency (ms/W) and cost-effectiveness (ms/USD) are evaluated in detail. Instead of relying on hardware-centric metrics like GFLOPS/W (giga floating point operations per second per Watt), these practical measures are adopted to capture algorithm-level efficiency and deployment relevance. This extends prior studies by incorporating realistic and application-driven metrics, enabling a more holistic assessment of GPU-accelerated edge devices for compute-intensive scientific workloads.
- Comparison of Edge GPUs with High-End Desktop GPUs: The performance of edge GPUs is evaluated in a broader context by comparing them with the RTX 3080, a high-performance desktop GPU. This allows us to assess the trade-offs between computational efficiency, power consumption, and cost-effectiveness, highlighting both the advantages and inherent limitations of edge GPUs for scientific computing workloads.
2. Background
2.1. NVIDIA Jetson Edge Devices
2.2. Stochastic Simulation Algorithm
- Calculate Propensities: Compute the propensity functions for all M reactions and the total propensity , defined as
- Generate Reaction Time: Draw a random number from a uniform distribution and compute the time interval until the next reaction using
- Select Reaction: Draw a second random number from and determine the index j of the next reaction such that
- Update System State: Update the state vector and the simulation time t according to
- Repeat or Terminate: If the simulation time t exceeds the final time , the simulation terminates; otherwise, the algorithm returns to Step 1 to continue the simulation.
2.3. Theoretical Performance Model of GPU Architectures
2.3.1. Theoretical Compute Performance
- Xavier NX: GFLOPS
- Orin Nano: GFLOPS
- Orin NX: GFLOPS
2.3.2. Performance Implications
3. Related Work
3.1. GPU-Accelerated Edge Devices for Scientific Computing
3.2. Parallel Programming Models and Benchmarks for Edge Systems
4. Stochastic Simulation Algorithm on GPUs
- The system state, including species populations, reaction rate constants, and random number generator states, is initialized. (line 10).
- Each thread independently computes reaction propensities based on the current system state. (lines 19–21).
- A reaction is selected by drawing a uniform random number and scanning the cumulative propensities until the running sum exceeds the threshold; the loop stops one position past the target, and the subsequent rxn–– yields the correct channel index. Species populations are updated accordingly. (lines 24–28).
- The simulation time is updated based on a randomly sampled time step (lines 31–32).
- The simulation continues until the predefined final time is reached. (line 35).
Listing 1. Simplified CUDA kernel for SSA (conceptual excerpt; full code in the repository). |
5. Experimental Results and Evaluation
5.1. Execution Time Analysis
5.2. Cost-Performance and Power-Performance Evaluation
5.3. Discussion on Performance Trends
- Impact of GPU Architecture on Execution Efficiency: The performance differences among Xavier NX, Orin Nano, and Orin NX highlight how architectural features—such as core count, clock frequency, and memory bandwidth—affect SSA execution efficiency. The transition from Volta to Ampere architecture delivers substantial gains in computational throughput. For instance, at the largest problem size (), Orin NX achieves a 1.58× speedup over Xavier NX and a 1.20× speedup over Orin Nano. While memory bandwidth remains a contributing factor—59.7 GB/s for Xavier NX versus 102.4 GB/s for Orin NX—the higher execution efficiency is primarily attributable to increased CUDA core counts and improved compute performance. Notably, unlike earlier studies that required MPI-based multi-device setup to achieve reasonable performance on Jetson Nano [5], the Orin NX achieves significantly higher throughput as a standalone device. This shift underscores the practical implications of architectural advancements in modern edge GPUs for scientific computing.
- Cost–Performance Trade-offs: The cost-performance results (Figure 3) reveal that the Orin NX provides the most consistently favorable execution time per dollar across all problem sizes, with values ranging from 7.5 ms/USD at to 105.3 ms/USD at . Although the Orin Nano is more affordable ($299), its cost-efficiency degrades significantly for larger workloads—reaching 169.5 ms/USD at . Interestingly, it achieves a minimum of 32.9 ms/USD at , temporarily outperforming Xavier NX at smaller scales. However, its instability at higher problem sizes limits its suitability for demanding simulations. The Xavier NX, despite having a similar price point to the Orin NX, consistently exhibits poorer cost-performance, particularly at large problem sizes ( ms/USD at ), making it a less attractive option for computationally intensive scientific workloads.
- Power Efficiency and Suitability for Edge Applications: The power-performance results (Figure 4) show that the Orin NX consistently delivers the most favorable execution time per watt (ms/W), starting at 150.2 ms/W for and reaching 2102.7 ms/W at . This demonstrates its strong suitability for power-constrained scientific environments. The Orin Nano also performs reasonably well at small to mid-range workloads, achieving 654.9 ms/W at , but its efficiency deteriorates significantly as the workload increases, with values rising to 3379.2 ms/W at . While minor variations exist—such as Xavier NX briefly outperforming Orin Nano in mid-range power efficiency—these do not alter the overall conclusion: the Orin NX offers the best balance of performance and energy efficiency among Jetson devices. Prior work requiring multi-device setups can now be matched or exceeded by a single Orin NX, simplifying edge deployment for scientific workloads.
Throughput in Operations per Second (GFLOPS ≡ FP-Only GOPS)
- Propensity calculation (Step 1): three multiplications;
- Total propensity and reaction selection (Steps 2–3): two additions and approximately four fused operations (e.g., comparisons and accumulations);
- Time-step computation (Step 4): one logarithm, one division, one negation;
- State update (Step 5): three additions;
- Random number generation: two calls to curand_uniform, each estimated at two floating point (FP) operations.
5.4. Limitations and Generality
6. Conclusions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
- Varghese, B.; Wang, N.; Barbhuiya, S.; Kilpatrick, P.; Nikolopoulos, D.S. Challenges and Opportunities in Edge Computing. In Proceedings of the 2016 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA, 18–20 November 2016; pp. 20–26. [Google Scholar]
- Faqir-Rhazoui, Y.; García, C. SYCL in the Edge: Performance and Energy Evaluation for Heterogeneous Acceleration. J. Supercomput. 2024, 80, 14203–14223. [Google Scholar] [CrossRef]
- Antonini, M.; Vu, T.H.; Min, C.; Montanari, A.; Mathur, A.; Kawsar, F. Resource Characterisation of Personal-Scale Sensing Models on Edge Accelerators. In Proceedings of the First International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, New York, NY, USA, 10–13 November 2019; pp. 49–55. [Google Scholar]
- Lim, S.; Kang, P. Implementing Scientific Simulations on GPU-accelerated Edge Devices. In Proceedings of the 34th International Conference on Information Networking (ICOIN), Barcelona, Spain, 7–10 January 2020; pp. 756–760. [Google Scholar] [CrossRef]
- Kang, P.; Lim, S. A Taste of Scientific Computing on the GPU-Accelerated Edge Device. IEEE Access 2020, 8, 208337–208347. [Google Scholar] [CrossRef]
- Gillespie, D.T. Exact Stochastic Simulation of Coupled Chemical Reactions. J. Phys. Chem. 1977, 81, 2340–2361. [Google Scholar] [CrossRef]
- Chen, J.; Ran, X. Deep Learning with Edge Computing: A Review. Proc. IEEE 2019, 107, 1655–1674. [Google Scholar] [CrossRef]
- Li, E.; Zeng, L.; Zhou, Z.; Chen, X. Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing. IEEE Trans. Wirel. Commun. 2020, 19, 447–457. [Google Scholar] [CrossRef]
- Bailey, D.; Barszcz, E.; Barton, J.; Browning, D.; Carter, R.; Dagum, L.; Fatoohi, R.; Frederickson, P.; Lasinski, T.; Schreiber, R.; et al. The NAS Parallel Benchmarks. Int. J. High Perform. Comput. Appl. 1991, 5, 63–73. [Google Scholar] [CrossRef]
- Nickolls, J.; Buck, I.; Garland, M.; Skadron, K. Scalable Parallel Programming with CUDA. Queue 2008, 6, 40–53. [Google Scholar] [CrossRef]
- NVIDIA. NVIDIA Tesla V100 GPU Architecture v1.0. 2017. Available online: http://www.nvidia.com/content/gated-pdfs/Volta-Architecture-Whitepaper-v1.0.pdf (accessed on 15 March 2025).
- NVIDIA. Jetson Xavier NX Series. 2020. Available online: https://developer.nvidia.com/embedded/jetson-xavier-nx (accessed on 15 March 2025).
- NVIDIA. NVIDIA Ampere Architecture In-Depth. 2020. Available online: https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth (accessed on 15 March 2025).
- NVIDIA. Jetson Orin Series. 2022. Available online: https://developer.nvidia.com/embedded/jetson-orin (accessed on 15 March 2025).
- Gillespie, D.T. A General Method for Numerically Simulating the Stochastic Time Evolution of Coupled Chemical Reactions. J. Comput. Phys. 1976, 22, 403–434. [Google Scholar] [CrossRef]
- Komarov, I.; D’Souza, R.M. Accelerating the Gillespie Exact Stochastic Simulation Algorithm using Hybrid Parallel Execution on Graphics Processing Units. PLoS ONE 2012, 7, e46693. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Xie, Y.; Song, L.; Chen, F.; Tang, T. A Survey of Accelerator Architectures for Deep Neural Networks. Engineering 2020, 6, 264–274. [Google Scholar] [CrossRef]
- Chang, Z.; Liu, S.; Xiong, X.; Cai, Z.; Tu, G. A Survey of Recent Advances in Edge-Computing-Powered Artificial Intelligence of Things. IEEE Internet Things J. 2021, 8, 13849–13875. [Google Scholar] [CrossRef]
- Hua, H.; Li, Y.; Wang, T.; Dong, N.; Li, W.; Cao, J. Edge Computing with Artificial Intelligence: A Machine Learning Perspective. ACM Comput. Surv. 2023, 55, 184. [Google Scholar] [CrossRef]
- Kong, X.; Wu, Y.; Wang, H.; Xia, F. Edge Computing for Internet of Everything: A Survey. IEEE Internet Things J. 2022, 9, 23472–23485. [Google Scholar] [CrossRef]
- Varghese, B.; Wang, N.; Bermbach, D.; Hong, C.H.; Lara, E.D.; Shi, W.; Stewart, C. A Survey on Edge Performance Benchmarking. ACM Comput. Surv. 2021, 54, 66. [Google Scholar] [CrossRef]
- Kang, P. Programming for High-Performance Computing on Edge Accelerators. Mathematics 2023, 11, 1055. [Google Scholar] [CrossRef]
- Stone, J.E.; Gohara, D.; Shi, G. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Comput. Sci. Eng. 2010, 12, 66–73. [Google Scholar] [CrossRef] [PubMed]
- Cecilia, J.M.; Cano, J.C.; Morales-Garcia, J.; Llanes, A.; Imbernon, B. Evaluation of Clustering Algorithms on GPU-Based Edge Computing Platforms. Sensors 2020, 20, 6335. [Google Scholar] [CrossRef] [PubMed]
- Slepoy, A.; Thompson, A.P.; Plimpton, S.J. A Constant-time Kinetic Monte Carlo Algorithm for Simulation of Large Biochemical Reaction Networks. J. Chem. Phys. 2008, 128, 205101. [Google Scholar] [CrossRef] [PubMed]
- Dematté, L.; Prandi, D. GPU Computing for Systems Biology. Briefings Bioinform. 2010, 11, 323–333. [Google Scholar] [CrossRef]
- Google Coral Dev Board. 2019. Available online: https://coral.ai/products/dev-board (accessed on 15 March 2025).
- Intel. Intel® Movidius™ Neural Compute Stick. Available online: https://www.intel.com/content/www/us/en/products/sku/125743/intel-movidius-neural-compute-stick/specifications.html (accessed on 15 March 2025).
- Intel. Intel® Distribution of OpenVINO™ Toolkit. Available online: https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html (accessed on 15 March 2025).
- AMD. AMD Ryzen™ Embedded Family. Available online: https://www.amd.com/en/products/embedded/ryzen.html (accessed on 15 March 2025).
- Nam, D. A Performance Comparison of Parallel Programming Models on Edge Devices. IEMEK J. Embed. Syst. Appl. 2023, 18, 165–172. [Google Scholar]
- Dagum, L.; Menon, R. OpenMP: An Industry Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng. 1998, 5, 46–55. [Google Scholar] [CrossRef]
- OpenACC. Available online: http://www.openacc.org (accessed on 15 March 2025).
- Alpay, A.; Soproni, B.; Wünsche, H.; Heuveline, V. Exploring the Possibility of a hipSYCL-based Implementation of oneAPI. In Proceedings of the 10th International Workshop on OpenCL, Bristol, UK, 10–12 May 2022. [Google Scholar]
- Hoffmann, R.B.; Griebler, D.; da Rosa Righi, R.; Fernandes, L.G. Benchmarking Parallel Programming for Single-Board Computers. Future Gener. Comput. Syst. 2024, 161, 119–134. [Google Scholar] [CrossRef]
- Pheatt, C. Intel® Threading Building Blocks. J. Comput. Sci. Coll. 2008, 23, 298. [Google Scholar]
- Choi, J.; You, H.; Kim, C.; Young Yeom, H.; Kim, Y. Comparing Unified, Pinned, and Host/Device Memory Allocations for Memory-intensive Workloads on Tegra SoC. Concurr. Comput. Pract. Exp. 2021, 33, e6018. [Google Scholar] [CrossRef]
- Seo, S.; Jo, G.; Lee, J. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, USA, 6–8 November 2011; pp. 137–148. [Google Scholar]
- Araujo, G.; Griebler, D.; Rockenbach, D.A.; Danelutto, M.; Fernandes, L.G. NAS Parallel Benchmarks with CUDA and Beyond. Softw. Pract. Exp. 2021, 53, 53–80. [Google Scholar] [CrossRef]
- NVIDIA. cuRAND. Available online: https://docs.nvidia.com/cuda/curand/index.html (accessed on 15 March 2025).
- NVIDIA. CUDA Toolkit Documentation. 2019. Available online: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#zero-copy (accessed on 15 March 2025).
- Cao, Y.; Gillespie, D.T.; Petzold, L.R. The Slow-Scale Stochastic Simulation Algorithm. J. Chem. Phys. 2005, 122, 014116. [Google Scholar] [CrossRef]
- Erban, R.; Chapman, S.J. Reactive Boundary Conditions for Stochastic Simulations of Reaction–Diffusion Processes. Phys. Biol. 2007, 4, 16. [Google Scholar] [CrossRef]
- Macal, C.M.; North, M.J. Tutorial on Agent-Based Modelling and Simulation. J. Simul. 2010, 4, 151–162. [Google Scholar] [CrossRef]
- Railsback, S.F.; Grimm, V. Agent-Based and Individual-Based Modeling: A Practical Introduction; Princeton University Press: Princeton, NJ, USA, 2011. [Google Scholar]
Xavier NX | Orin Nano | Orin NX | RTX 3080 | |
---|---|---|---|---|
GPU Architecture | Volta | Ampere | Ampere | Ampere |
GPU Parallelism | 6-SM 384 cores 48 Tensor cores @1.10 GHz 21 TOPS | 8-SM 1024 cores 32 Tensor cores @0.62 GHz 40 TOPS | 8-SM 1024 cores 32 Tensor cores @0.76 GHz 70 TOPS | 68-SM 8704 cores @1.44 GHz 238 TOPS |
CPU | ARM V8 @1.9 GHz | ARM Cortex A78AE @1.5 GHz | ARM Cortex A78AE @2.0 GHz | AMD Ryzen 7 5700X @3.4 GHz |
DLA | 2x NVDLA 1.1 GHz | - | 1x NVDLA v2.0 0.61 GHz | - |
Memory | 8 GB LPDDR4x 59.7 GB/s | 8 GB LPDDR5 68 GB/s | 8 GB LPDDR5 102.4 GB/s | 12 GB GDDR6x 760 GB/s |
Storage | Samsung SSD 128 GB PM991a | Samsung SSD 128 GB PM991a | Samsung SSD 128 GB PM991a | Samsung SSD 980 PRO 1TB |
Execution Model | CUDA 11.4 on Jetpack 5.1.3 | CUDA 12.2 on Jetpack 6.0.0 | CUDA 12.2 on Jetpack 6.0.0 | CUDA 12.2 |
OS | Ubuntu 20.04 | Ubuntu 22.04 | Ubuntu 22.04 | Ubuntu 22.04 LTS |
TDP | 15 W | 15 W | 20 W | 350 W |
Price (USD) | 399 | 299 | 399 | 799 |
(A) Simulation Parameters | |
---|---|
Reaction network | |
Reaction rates | , , |
Initial species counts | , , at |
Simulation duration | 1000 time units |
(B) Execution Parameters | |
Number of trajectories | to |
Iterations per trajectory † | 2,381,417 (average, fixed) |
FLOPs per iteration | 19 (see Section “Throughput in Operations per Second (GFLOPS ≡ FP-Only GOPS)”) |
Measurement method | On-device timing (cudaEvent_t) |
Kernel type | CUDA global kernel |
Evaluation metrics | ms/W, ms/USD, est. throughput (GFLOPS ≡ FP-only GOPS) |
Size () | Xavier NX | Orin Nano | Orin NX | RTX 3080 |
---|---|---|---|---|
14 | 6730.1 | 6080.1 | 3004.5 | 1043.3 |
15 | 21,570.5 | 9823.6 | 4840.4 | 1742.0 |
16 | 22,958.0 | 28,701.1 | 24,280.5 | 3059.0 |
17 | 23,673.1 | 33,826.4 | 21,385.2 | 3989.9 |
18 | 21,971.0 | 41,177.1 | 27,332.9 | 4218.3 |
19 | 40,718.1 | 35,946.9 | 34,381.2 | 13,619.6 |
20 | 66,402.5 | 50,687.9 | 42,053.7 | 17,148.9 |
Device | Execution Time (ms) | Effective GFLOPS | Utilization (%) |
---|---|---|---|
Xavier NX | 66,402.5 | 714.50 | 84.6 |
Orin Nano | 50,687.9 | 936.02 | 73.7 |
Orin NX | 42,053.7 | 1128.20 | 72.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kang, P. Evaluation of GPU-Accelerated Edge Platforms for Stochastic Simulations: Performance and Energy Efficiency Analysis. Mathematics 2025, 13, 3305. https://doi.org/10.3390/math13203305
Kang P. Evaluation of GPU-Accelerated Edge Platforms for Stochastic Simulations: Performance and Energy Efficiency Analysis. Mathematics. 2025; 13(20):3305. https://doi.org/10.3390/math13203305
Chicago/Turabian StyleKang, Pilsung. 2025. "Evaluation of GPU-Accelerated Edge Platforms for Stochastic Simulations: Performance and Energy Efficiency Analysis" Mathematics 13, no. 20: 3305. https://doi.org/10.3390/math13203305
APA StyleKang, P. (2025). Evaluation of GPU-Accelerated Edge Platforms for Stochastic Simulations: Performance and Energy Efficiency Analysis. Mathematics, 13(20), 3305. https://doi.org/10.3390/math13203305