Hardware-Assisted Low-Latency NPU Virtualization Method for Multi-Sensor AI Systems
Abstract
:1. Introduction
2. Simulation Environment and Methodology
2.1. NPU Virtualization Operation Flow
2.2. Experimental Setup
2.2.1. Neural Processing Unit (NPU) Architecture
2.2.2. SPM and DRAM Configuration
2.2.3. Deep-Learning Models Used
2.3. NPU Virtualization System
2.3.1. Hypervisor Design and Implementation
2.3.2. Data Prefetching Algorithms via Hardware Scheduler
Algorithm 1: Data Prefetching for Hardware-Assisted NPU Virtualization |
3. Results
3.1. Memory Access Cycles Under Different Burst Sizes and SA Counts
3.1.1. Effect of the Hardware Scheduler by Changing Burst Size
3.1.2. Effect of Hardware Scheduler by Changing a Limited Number of SA Resources
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sun, T.; Feng, B.; Huo, J.; Xiao, Y.; Wang, W.; Peng, J.; Li, Z.; Du, C.; Wang, W.; Zou, G.; et al. Artificial Intelligence Meets Flexible Sensors: Emerging Smart Flexible Sensing Systems Driven by Machine Learning and Artificial Synapses. Nano-Micro Lett. 2024, 16, 14. [Google Scholar] [CrossRef] [PubMed]
- Javaid, M.; Haleem, A.; Rab, S.; Singh, R.P.; Suman, R. Sensors for Daily Life: A Review. Sens. Int. 2021, 2, 100121. [Google Scholar] [CrossRef]
- Weiss, G.M.; Yoneda, K.; Hayajneh, T. Smartphone and smartwatch-based biometrics using activities of daily living. IEEE Access 2019, 7, 133190–133202. [Google Scholar] [CrossRef]
- Méndez Gómez, J. Efficient Sensor Fusion of LiDAR and Radar for Autonomous Vehicles. Ph.D. Thesis, Universidad de Granada, Granada, Spain, 2022. [Google Scholar]
- Qureshi, S.A.; Hsiao, W.W.-W.; Hussain, L.; Aman, H.; Le, T.-N.; Rafique, M. Recent development of fluorescent nanodiamonds for optical biosensing and disease diagnosis. Biosensors 2022, 12, 1181. [Google Scholar] [CrossRef] [PubMed]
- Kadian, S.; Kumari, P.; Shukla, S.; Narayan, R. Recent advancements in machine learning enabled portable and wearable biosensors. Talanta Open 2023, 8, 100267. [Google Scholar] [CrossRef]
- Flynn, C.D.; Chang, D. Artificial Intelligence in Point-of-Care Biosensing: Challenges and Opportunities. Diagnostics 2024, 14, 1100. [Google Scholar] [CrossRef] [PubMed]
- Samsung Electronics. Samsung Electronics Introduces A High-Speed, Low-Power NPU Solution for AI Deep Learning. Samsung Semiconductor. Available online: https://semiconductor.samsung.com/news-events/tech-blog/samsung-electronics-introduces-a-high-speed-low-power-npu-solution-for-ai-deep-learning/ (accessed on 22 September 2024).
- Xue, Y.; Liu, Y.; Nai, L.; Huang, J. V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness. In Proceedings of the 50th Annual International Symposium on Computer Architecture, Orlando, FL, USA, 17–21 June 2023; pp. 1–15. [Google Scholar]
- Xue, Y.; Liu, Y.; Huang, J. System Virtualization for Neural Processing Units. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, Providence, RI, USA, 22–24 June 2023; pp. 80–86. [Google Scholar]
- Xue, Y.; Liu, Y.; Nai, L.; Huang, J. Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms. arXiv 2024, arXiv:2408.04104. [Google Scholar]
- Yoo, H.J. Deep learning processors for on-device intelligence. In Proceedings of the 2020 on Great Lakes Symposium on VLSI, Virtual Event, China, 7–9 September 2020; pp. 1–8. [Google Scholar]
- Merenda, M.; Porcaro, C.; Iero, D. Edge machine learning for ai-enabled iot devices: A review. Sensors 2020, 20, 2533. [Google Scholar] [CrossRef] [PubMed]
- Yu, H.; Peters, A.M.; Akshintala, A.; Rossbach, C.J. AvA: Accelerated virtualization of accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 16–20 March 2020; pp. 807–825. [Google Scholar]
- Jouppi, N.; Kurian, G.; Li, S.; Ma, P.; Nagarajan, R.; Nai, L.; Patil, N.; Subramanian, S.; Swing, A.; Towles, B.; et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, Orlando, FL, USA, 17–21 June 2023; pp. 1–14. [Google Scholar]
- Milovanovic, I.Z.; Tokic, T.I.; Milovanovic, E.I.; Stojcev, M.K. Determining the number of processing elements in systolic arrays. Facta Univ. Ser. Math. Inform. 2000, 15, 123–132. [Google Scholar]
- Chen, Y.X.; Ruan, S.J. A throughput-optimized channel-oriented processing element array for convolutional neural networks. IEEE Trans. Circuits Syst. II Express Briefs 2020, 68, 752–756. [Google Scholar] [CrossRef]
- Avissar, O.; Barua, R.; Stewart, D. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst. (TECS) 2002, 1, 6–26. [Google Scholar] [CrossRef]
- Hwang, S.; Lee, S.; Kim, J.; Kim, H.; Huh, J. mnpusim: Evaluating the effect of sharing resources in multi-core npus. In Proceedings of the 2023 IEEE International Symposium on Workload Characterization (IISWC), Ghent, Belgium, 1–3 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 167–179. [Google Scholar]
- Kao, S.C.; Kwon, H.; Pellauer, M.; Parashar, A.; Krishna, T. A Formalism of DNN Accelerator Flexibility. Proc. ACM Meas. Anal. Comput. Syst. 2022, 6, 1–23. [Google Scholar] [CrossRef]
- Lozano, S.; Lugo, T.; Carretero, J. A Comprehensive Survey on the Use of Hypervisors in Safety-Critical Systems. IEEE Access 2023, 11, 36244–36263. [Google Scholar] [CrossRef]
- Paolino, M.; Pinneterre, S.; Raho, D. FPGA virtualization with accelerators overcommitment for network function virtualization. In Proceedings of the 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 4–6 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
- Doddamani, S.; Sinha, P.; Lu, H.; Cheng, T.H.K.; Bagdi, H.H.; Gopalan, K. Fast and live hypervisor replacement. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Providence, RI, USA, 13–14 April 2019; pp. 45–58. [Google Scholar]
- Patel, A.; Daftedar, M.; Shalan, M.; El-Kharashi, M.W. Embedded hypervisor xvisor: A comparative analysis. In Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, Finland, 4–5 March 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 682–691. [Google Scholar]
- Dummler, J.; Kunis, R.; Runger, G. Layer-based scheduling algorithms for multiprocessor-tasks with precedence constraints. In Proceedings of the International Conference on Parallel Computing: Architectures, Algorithms and Applications (ParCo 2007), Advances in Parallel Computing; IOS Press: Amsterdam, The Netherlands, 2007; Volume 5, pp. 321–328. [Google Scholar]
- Jiang, W.; Liu, P.; Jin, H.; Peng, J. An Efficient Data Prefetch Strategy for Deep Learning Based on Non-volatile Memory. In Green, Pervasive, and Cloud Computing: 15th International Conference, GPC 2020, Xi’an, China, 13–15 November 2020; Proceedings 15; Springer International Publishing: Cham, Switzerland, 2020; pp. 101–114. [Google Scholar]
- Aivaliotis, V.; Tsantikidou, K.; Sklavos, N. IoT-based multi-sensor healthcare architectures and a lightweight-based privacy scheme. Sensors 2022, 22, 4269. [Google Scholar] [CrossRef]
- El-Hajj, M.; Mousawi, H.; Fadlallah, A. Analysis of lightweight cryptographic algorithms on iot hardware platform. Future Internet 2023, 15, 54. [Google Scholar] [CrossRef]
- Kim, K.; Jang, S.J.; Park, J.; Lee, E.; Lee, S.S. Lightweight and energy-efficient deep learning accelerator for real-time object detection on edge devices. Sensors 2023, 23, 1185. [Google Scholar] [CrossRef] [PubMed]
Parameter | Value |
---|---|
Data flow type | Output stationary |
Systolic height | 128 |
Systolic width | 128 |
Tile ifmap size (byte) | 786,432 |
Tile filter size (byte) | 786,432 |
Tile ofmap size (byte) | 786,432 |
Parameter | Value |
---|---|
Tlb assoc | 8 |
Tlb entrynum | 2048 |
Npu clock speed (GHz) | 2 |
Dram clock speed (GHz) | 2 |
SPM size (bytes) | 37,748,736 |
SPM latency | 1 |
Data block size (bytes) | 64 |
Parameter | Value |
---|---|
Channels | 8 |
Bus Width (bit) | 128 |
Bank Groups | 4 |
Banks per Group | 4 |
Rows per Bank | 32,768 |
Columns per Row | 64 |
Device Width (bit) | 128 |
Burst Length (BL) | 4 |
tCK (ns) | 1 |
CL (CAS Latency) | 14 |
tRCD (Row-to-Column Delay) | 14 |
tRP (Row Precharge Time) | 14 |
tRAS (Row Active Time) | 34 |
tRFC (Refresh Cycle Time) | 260 |
tWR (Write Recovery Time) | 16 |
VDD (V) | 1.2 |
IDD0 (Active Power) | 65 mA |
IDD4R (Read Power) | 390 mA |
Channel Size (MB) | 1024 |
Row Buffer Policy | Open Page |
Model | 1-Before | 2-After | Reduction (%) |
---|---|---|---|
Alexnet | 253,764.0000 | 221,712.0000 | 12.63% |
Resnet-50 | 73,282.0000 | 62,847.0000 | 14.24% |
NCF | 93,940.0000 | 59,763.0000 | 36.38% |
Yolo-tiny | 25,996.0000 | 21,558.0000 | 17.07% |
DLRM | 18,656.0000 | 4324.0000 | 76.82% |
Model | 1-Before | 2-After | Reduction (%) |
---|---|---|---|
Alexnet | 239,652.0000 | 206,201.0000 | 13.96% |
Resnet-50 | 71,220.0000 | 60,198.0000 | 15.48% |
NCF | 91,020.0000 | 57,837.0000 | 36.46% |
Yolo-tiny | 24,046.0000 | 20,211.0000 | 15.95% |
DLRM | 16,521.0000 | 2762.0000 | 83.28% |
Model | 30,000 SA Before | 30,000 SA After | Reduction (%) |
---|---|---|---|
Alexnet | 253,764.0000 | 221,712.0000 | 12.63% |
Resnet-50 | 73,282.0000 | 62,847.0000 | 14.24% |
NCF | 93,940.0000 | 59,763.0000 | 36.38% |
Yolo-tiny | 25,996.0000 | 21,558.0000 | 17.07% |
DLRM | 18,656.0000 | 4324.0000 | 76.82% |
Model | 40,000 SA Before | 40,000 SA After | Reduction (%) |
---|---|---|---|
Alexnet | 195,700.0000 | 179,400.0000 | 8.33% |
Resnet-50 | 40,473.0000 | 35,650.0000 | 11.92% |
NCF | 86,524.0000 | 57,811.0000 | 33.19% |
Yolo-tiny | 22,326.0000 | 19,802.0000 | 11.31% |
DLRM | 14,254.0000 | 3816.0000 | 73.22% |
Model | 50,000 SA Before | 50,000 SA After | Reduction (%) |
---|---|---|---|
Alexnet | 145,632.0000 | 141,027.0000 | 3.16% |
Resnet-50 | 27,889.0000 | 25,620.0000 | 8.14% |
NCF | 82,242.0000 | 56,864.0000 | 30.86% |
Yolo-tiny | 19,124.0000 | 17,256.0000 | 9.77% |
DLRM | 11,412.0000 | 3224.0000 | 71.7% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jean, J.-H.; Kim, D.-S. Hardware-Assisted Low-Latency NPU Virtualization Method for Multi-Sensor AI Systems. Sensors 2024, 24, 8012. https://doi.org/10.3390/s24248012
Jean J-H, Kim D-S. Hardware-Assisted Low-Latency NPU Virtualization Method for Multi-Sensor AI Systems. Sensors. 2024; 24(24):8012. https://doi.org/10.3390/s24248012
Chicago/Turabian StyleJean, Jong-Hwan, and Dong-Sun Kim. 2024. "Hardware-Assisted Low-Latency NPU Virtualization Method for Multi-Sensor AI Systems" Sensors 24, no. 24: 8012. https://doi.org/10.3390/s24248012
APA StyleJean, J.-H., & Kim, D.-S. (2024). Hardware-Assisted Low-Latency NPU Virtualization Method for Multi-Sensor AI Systems. Sensors, 24(24), 8012. https://doi.org/10.3390/s24248012