Next Article in Journal
Development and Evaluation of an Immersive Metaverse-Based Meditation System for Psychological Well-Being Using LLM-Driven Scenario Generation
Previous Article in Journal / Special Issue
Fault Detection and Diagnosis of Rolling Bearings in Automated Container Terminals Using Time–Frequency Domain Filters and CNN-KAN
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Performance and Efficiency Gains of NPU-Based Servers over GPUs for AI Model Inference †

Department of Industrial and Information Systems Engineering, Soongsil University, Seoul 06978, Republic of Korea
*
Author to whom correspondence should be addressed.
This paper is a substantially extended version of a preliminary abstract presented at the 19th International Conference on Innovative Computing, Information and Control (ICICIC 2025), Kitakyushu, Japan, 29 August 2025.
Systems 2025, 13(9), 797; https://doi.org/10.3390/systems13090797
Submission received: 29 July 2025 / Revised: 3 September 2025 / Accepted: 5 September 2025 / Published: 11 September 2025
(This article belongs to the Special Issue Data-Driven Analysis of Industrial Systems Using AI)

Abstract

The exponential growth of AI applications has intensified the demand for efficient inference hardware capable of delivering low-latency, high-throughput, and energy-efficient performance. This study presents a systematic, empirical comparison of GPU- and NPU-based server platforms across key AI inference domains: text-to-text, text-to-image, multimodal understanding, and object detection. We configure representative models—LLama-family for text generation, Stable Diffusion variants for image synthesis, LLaVA-NeXT for multimodal tasks, and YOLO11 series for object detection—on a dual NVIDIA A100 GPU server and an eight-chip RBLN-CA12 NPU server. Performance metrics including latency, throughput, power consumption, and energy efficiency are measured under realistic workloads. Results demonstrate that NPUs match or exceed GPU throughput in many inference scenarios while consuming 35–70% less power. Moreover, optimization with the vLLM library on NPUs nearly doubles the tokens-per-second and yields a 92% increase in power efficiency. Our findings validate the potential of NPU-based inference architectures to reduce operational costs and energy footprints, offering a viable alternative to the prevailing GPU-dominated paradigm.
Keywords: AI inference; Neural Processing Unit (NPU); Graphics Processing Unit (GPU); performance benchmarking; energy efficiency; heterogeneous computing; vLLM optimization AI inference; Neural Processing Unit (NPU); Graphics Processing Unit (GPU); performance benchmarking; energy efficiency; heterogeneous computing; vLLM optimization

Share and Cite

MDPI and ACS Style

Hong, Y.; Kim, D. Performance and Efficiency Gains of NPU-Based Servers over GPUs for AI Model Inference. Systems 2025, 13, 797. https://doi.org/10.3390/systems13090797

AMA Style

Hong Y, Kim D. Performance and Efficiency Gains of NPU-Based Servers over GPUs for AI Model Inference. Systems. 2025; 13(9):797. https://doi.org/10.3390/systems13090797

Chicago/Turabian Style

Hong, Youngpyo, and Dongsoo Kim. 2025. "Performance and Efficiency Gains of NPU-Based Servers over GPUs for AI Model Inference" Systems 13, no. 9: 797. https://doi.org/10.3390/systems13090797

APA Style

Hong, Y., & Kim, D. (2025). Performance and Efficiency Gains of NPU-Based Servers over GPUs for AI Model Inference. Systems, 13(9), 797. https://doi.org/10.3390/systems13090797

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop