Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (24)

Search Parameters:
Keywords = heterogeneous CPU-GPU acceleration

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
12 pages, 1880 KB  
Article
Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture
by Mohamed Aziz Zeroual, Natalia Dudysheva, Vincent Gras, Franck Mauconduit, Karyna Isaieva, Pierre-André Vuissoz and Freddy Odille
Appl. Sci. 2025, 15(11), 5840; https://doi.org/10.3390/app15115840 - 22 May 2025
Viewed by 900
Abstract
Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate [...] Read more.
Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate diagnostics. Graphics processing units (GPUs) have demonstrated substantial performance gains in various reconstruction tasks. In this work, we present a GPU implementation of the reconstruction kernel for the generalized reconstruction by inversion of coupled systems (GRICS), an iterative joint optimization approach that enables 3D high-resolution image reconstruction with motion correction. Three implementations were compared: (i) a C++ CPU version, (ii) a Matlab–GPU version (with minimal code modifications allowing data storage in GPU memory), and (iii) a native GPU version using CUDA. Six distinct datasets, including various motion types, were tested. The results showed that the Matlab–GPU approach achieved speedups ranging from 1.2× to 2.0× compared to the CPU implementation, whereas the native CUDA version attained speedups of 9.7× to 13.9×. Across all datasets, the normalized root mean square error (NRMSE) remained on the order of 106 to 104, indicating that the CUDA-accelerated method preserved image quality. Furthermore, a roofline analysis was conducted to quantify the kernel’s performance on one of the evaluated datasets. The kernel achieved 250 GFLOP/s, representing a 15.6× improvement over the performance of the Matlab–GPU version. These results confirm that GPU-based implementations of GRICS can drastically reduce reconstruction times while maintaining diagnostic fidelity, paving the way for more efficient clinical motion-compensated MRI workflows. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

17 pages, 4496 KB  
Article
Accelerated Method for Simulating the Solidification Microstructure of Continuous Casting Billets on GPUs
by Jingjing Wang, Xiaoyu Liu, Yuxin Li and Ruina Mao
Materials 2025, 18(9), 1955; https://doi.org/10.3390/ma18091955 - 25 Apr 2025
Cited by 1 | Viewed by 682
Abstract
Microstructure simulations of continuous casting billets are vital for understanding solidification mechanisms and optimizing process parameters. However, the commonly used CA (Cellular Automaton) model is limited by grid anisotropy, which affects the accuracy of dendrite morphology simulations. While the DCSA (Decentered Square Algorithm) [...] Read more.
Microstructure simulations of continuous casting billets are vital for understanding solidification mechanisms and optimizing process parameters. However, the commonly used CA (Cellular Automaton) model is limited by grid anisotropy, which affects the accuracy of dendrite morphology simulations. While the DCSA (Decentered Square Algorithm) reduces anisotropy, its high computational cost due to the use of fine grids and dynamic liquid/solid interface tracking hinders large-scale applications. To address this, we propose a high-performance CA-DCSA method on GPUs (Graphic Processing Units). The CA-DCSA algorithm is first refactored and implemented on a CPU–GPU heterogeneous architecture for efficient acceleration. Subsequently, key optimizations, including memory access management and warp divergence reduction, are proposed to enhance GPU utilization. Finally, simulated results are validated through industrial experiments, with relative errors of 2.5% (equiaxed crystal ratio) and 2.3% (average secondary dendrite arm spacing) in 65# steel, and 2.1% and 0.7% in 60# steel. The maximum temperature difference in 65# steel is 1.8 °C. Compared to the serial implementation, the GPU-accelerated method achieves a 1430× higher speed using two GPUs. This work has provided a powerful tool for detailed microstructure observation and process parameter optimization in continuous casting billets. Full article
Show Figures

Figure 1

16 pages, 12134 KB  
Article
Intelligent Dynamic Multi-Dimensional Heterogeneous Resource Scheduling Optimization Strategy Based on Kubernetes
by Jialin Cai, Hui Zeng, Feifei Liu and Junming Chen
Mathematics 2025, 13(8), 1342; https://doi.org/10.3390/math13081342 - 19 Apr 2025
Cited by 3 | Viewed by 1919
Abstract
In this paper, we tackle the challenge of optimizing resource utilization and demand-driven allocation in dynamic, multi-dimensional heterogeneous environments. Traditional containerized task scheduling systems, like Kubernetes, typically rely on default schedulers that primarily focus on CPU and memory, overlooking the multi-dimensional nature of [...] Read more.
In this paper, we tackle the challenge of optimizing resource utilization and demand-driven allocation in dynamic, multi-dimensional heterogeneous environments. Traditional containerized task scheduling systems, like Kubernetes, typically rely on default schedulers that primarily focus on CPU and memory, overlooking the multi-dimensional nature of heterogeneous resources such as GPUs, network I/O, and disk I/O. This results in suboptimal scheduling and underutilization of resources. To address this, we propose a dynamic scheduling method for heterogeneous resources using an enhanced Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) algorithm that adjusts weights in real time and applies nonlinear normalization. Leveraging parallel computing, approximation, incremental computation, local updates, and hardware acceleration, the method minimizes overhead and ensures efficiency. Experimental results showed that, under low-load conditions, our method reduced task response times by 31–36%, increased throughput by 20–50%, and boosted resource utilization by over 20% compared to both the default Kubernetes scheduler and the Kubernetes Container Scheduling Strategy (KCSS) algorithm. These improvements were tested across diverse workloads, utilizing CPU, memory, GPU, and I/O resources, in a large-scale cluster environment, demonstrating the method’s robustness. These enhancements optimize cluster performance and resource efficiency, offering valuable insights for task scheduling in containerized cloud platforms. Full article
Show Figures

Figure 1

35 pages, 6933 KB  
Article
Matrix-Based ACO for Solving Parametric Problems Using Heterogeneous Reconfigurable Computers and SIMD Accelerators
by Vladimir Sudakov and Yuri Titov
Mathematics 2025, 13(8), 1284; https://doi.org/10.3390/math13081284 - 14 Apr 2025
Viewed by 980
Abstract
This paper presents a new matrix representation of ant colony optimization (ACO) for solving parametric problems. This representation allows us to perform calculations using matrix processors and single-instruction multiple-data (SIMD) calculators. To solve the problem of stagnation of the method without a priori [...] Read more.
This paper presents a new matrix representation of ant colony optimization (ACO) for solving parametric problems. This representation allows us to perform calculations using matrix processors and single-instruction multiple-data (SIMD) calculators. To solve the problem of stagnation of the method without a priori information about the system, a new probabilistic formula for choosing the parameter value is proposed, based on the additive convolution of the number of pheromone weights and the number of visits to the vertex. The method can be performed as parallel calculations, which accelerates the process of determining the solution. However, the high speed of determining the solution should be correlated with the high speed of calculating the objective function, which can be difficult when using complex analytical and simulation models. Software has been developed in Python 3.12 and C/C++ 20 to study the proposed changes to the method. With parallel calculations, it is possible to separate the matrix modification of the method into SIMD and multiple-instruction multiple-data (MIMD) components and perform calculations on the appropriate equipment. According to the results of this research, when solving the problem of optimizing benchmark functions of various dimensions, it was possible to accelerate the method by more than 12 times on matrix SIMD central processing unit (CPU) accelerators. When calculating on the graphics processing unit (GPU), the acceleration was about six times due to the difficulties of implementing a pseudo-random number stream. The developed modifications were used to determine the optimal values of the SARIMA parameters when forecasting the volume of transportation by airlines of the Russian Federation. Mathematical dependencies of the acceleration factors on the algorithm parameters and the number of components were also determined, which allows us to estimate the possibilities of accelerating the algorithm by using a reconfigurable heterogeneous computer. Full article
(This article belongs to the Special Issue Optimization Algorithms, Distributed Computing and Intelligence)
Show Figures

Figure 1

20 pages, 732 KB  
Article
VCONV: A Convolutional Neural Network Accelerator for FPGAs
by Srikanth Neelam and A. Amalin Prince
Electronics 2025, 14(4), 657; https://doi.org/10.3390/electronics14040657 - 8 Feb 2025
Cited by 2 | Viewed by 3206
Abstract
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give [...] Read more.
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs. Full article
(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 3rd Edition)
Show Figures

Figure 1

27 pages, 12035 KB  
Article
Numerical Study on Hydrodynamic Performance and Vortex Dynamics of Multiple Cylinders Under Forced Vibration at Low Reynolds Number
by Fulong Shi, Chuanzhong Ou, Jianjian Xin, Wenjie Li, Qiu Jin, Yu Tian and Wen Zhang
J. Mar. Sci. Eng. 2025, 13(2), 214; https://doi.org/10.3390/jmse13020214 - 23 Jan 2025
Cited by 3 | Viewed by 1601
Abstract
Flow around clustered cylinders is widely encountered in engineering applications such as wind energy systems, pipeline transport, and marine engineering. To investigate the hydrodynamic performance and vortex dynamics of multiple cylinders under forced vibration at low Reynolds numbers, with a focus on understanding [...] Read more.
Flow around clustered cylinders is widely encountered in engineering applications such as wind energy systems, pipeline transport, and marine engineering. To investigate the hydrodynamic performance and vortex dynamics of multiple cylinders under forced vibration at low Reynolds numbers, with a focus on understanding the interference characteristics in various configurations, this study is based on a self-developed radial basis function iso-surface ghost cell computing platform, which improves the implicit iso-surface interface representation method to track the moving boundaries of multiple cylinders, and employs a self-constructed CPU/GPU heterogeneous parallel acceleration technique for efficient numerical simulations. This study systematically investigates the interference characteristics of multiple cylinder configurations across various parameter domains, including spacing ratios, geometric arrangements, and oscillation modes. A quantitative analysis of key parameters, such as aerodynamic coefficients, dimensionless frequency characteristics, and vorticity field evolution, is performed. This study reveals that, for a dual-cylinder system, there exists a critical gap ratio between X/D = 2.5 and 3, which leads to an increase in the lift and drag coefficients of both cylinders, a reduction in the vortex shedding periodicity, and a disruption of the wake structure. For a three-cylinder system, the lift and drag coefficients of the two upstream cylinders decrease with increasing spacing. On the other hand, this increased spacing results in a rise in the drag of the downstream cylinder. In the case of a four-cylinder system, the drag coefficients of the cylinders located on either side of the flow direction are relatively high. A significant increase in the lift coefficient occurs when the spacing ratio is less than 2.0, while the drag coefficient of the downstream cylinder is minimized. The findings establish a comprehensive theoretical framework for the optimal configuration design and structural optimization of multicylinder systems, while also providing practical guidelines for engineering applications. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

14 pages, 2657 KB  
Article
Accelerating Batched Power Flow on Heterogeneous CPU-GPU Platform
by Jiao Hao, Zongbao Zhang, Zonglin He, Zhengyuan Liu, Zhengdong Tan and Yankan Song
Energies 2024, 17(24), 6269; https://doi.org/10.3390/en17246269 - 12 Dec 2024
Viewed by 1316
Abstract
As the scale of China’s interconnected power grid continues to expand, traditional serial computing methods are no longer sufficient for the rapid analysis and computation of electrical networks with tens of thousands of nodes due to their small scale and low efficiency. To [...] Read more.
As the scale of China’s interconnected power grid continues to expand, traditional serial computing methods are no longer sufficient for the rapid analysis and computation of electrical networks with tens of thousands of nodes due to their small scale and low efficiency. To enhance the capability of online grid analysis, this paper introduces an accelerating batched power flow calculation method based on a heterogeneous CPU-GPU platform. This method, based on the fast decoupled method, combined with the tremendous parallel computing capability of GPUs with the multi-threaded parallel processing of CPUs, efficiently resolves the exceeding bus type conversion issues in GPU-batched power flow calculation and improves the accuracy of the power flow calculations. Then, a binary-based power flow data exchange format was introduced, which utilizes a single binary file for data exchange. This format significantly minimizes I/O time overhead and reduces file size, further enhancing the method’s efficiency. Case studies on real-world power grids demonstrate its high accuracy and reliability. Compared to the traditional single-threaded power flow calculation method, this method dramatically reduces time consumption in batch power flow calculations. It proves the significant advantages of dealing with large-scale power flow calculations. Full article
Show Figures

Figure 1

20 pages, 305 KB  
Article
Revisiting Database Indexing for Parallel and Accelerated Computing: A Comprehensive Study and Novel Approaches
by Maryam Abbasi, Marco V. Bernardo, Paulo Váz, José Silva and Pedro Martins
Information 2024, 15(8), 429; https://doi.org/10.3390/info15080429 - 24 Jul 2024
Cited by 1 | Viewed by 4615
Abstract
While the importance of indexing strategies for optimizing query performance in database systems is widely acknowledged, the impact of rapidly evolving hardware architectures on indexing techniques has been an underexplored area. As modern computing systems increasingly leverage parallel processing capabilities, multi-core CPUs, and [...] Read more.
While the importance of indexing strategies for optimizing query performance in database systems is widely acknowledged, the impact of rapidly evolving hardware architectures on indexing techniques has been an underexplored area. As modern computing systems increasingly leverage parallel processing capabilities, multi-core CPUs, and specialized hardware accelerators, traditional indexing approaches may not fully capitalize on these advancements. This comprehensive experimental study investigates the effects of hardware-conscious indexing strategies tailored for contemporary and emerging hardware platforms. Through rigorous experimentation on a real-world database environment using the industry-standard TPC-H benchmark, this research evaluates the performance implications of indexing techniques specifically designed to exploit parallelism, vectorization, and hardware-accelerated operations. By examining approaches such as cache-conscious B-Tree variants, SIMD-optimized hash indexes, and GPU-accelerated spatial indexing, the study provides valuable insights into the potential performance gains and trade-offs associated with these hardware-aware indexing methods. The findings reveal that hardware-conscious indexing strategies can significantly outperform their traditional counterparts, particularly in data-intensive workloads and large-scale database deployments. Our experiments show improvements ranging from 32.4% to 48.6% in query execution time, depending on the specific technique and hardware configuration. However, the study also highlights the complexity of implementing and tuning these techniques, as they often require intricate code optimizations and a deep understanding of the underlying hardware architecture. Additionally, this research explores the potential of machine learning-based indexing approaches, including reinforcement learning for index selection and neural network-based index advisors. While these techniques show promise, with performance improvements of up to 48.6% in certain scenarios, their effectiveness varies across different query types and data distributions. By offering a comprehensive analysis and practical recommendations, this research contributes to the ongoing pursuit of database performance optimization in the era of heterogeneous computing. The findings inform database administrators, developers, and system architects on effective indexing practices tailored for modern hardware, while also paving the way for future research into adaptive indexing techniques that can dynamically leverage hardware capabilities based on workload characteristics and resource availability. Full article
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)
16 pages, 3939 KB  
Article
Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence
by Yong Liang, Junwen Tan, Zhisong Xie, Zetao Chen, Daoqian Lin and Zhenhao Yang
Sensors 2024, 24(1), 240; https://doi.org/10.3390/s24010240 - 31 Dec 2023
Cited by 6 | Viewed by 5088
Abstract
In recent years, edge intelligence (EI) has emerged, combining edge computing with AI, and specifically deep learning, to run AI algorithms directly on edge devices. In practical applications, EI faces challenges related to computational power, power consumption, size, and cost, with the primary [...] Read more.
In recent years, edge intelligence (EI) has emerged, combining edge computing with AI, and specifically deep learning, to run AI algorithms directly on edge devices. In practical applications, EI faces challenges related to computational power, power consumption, size, and cost, with the primary challenge being the trade-off between computational power and power consumption. This has rendered traditional computing platforms unsustainable, making heterogeneous parallel computing platforms a crucial pathway for implementing EI. In our research, we leveraged the Xilinx Zynq 7000 heterogeneous computing platform, employed high-level synthesis (HLS) for design, and implemented two different accelerators for LeNet-5 using loop unrolling and pipelining optimization techniques. The experimental results show that when running at a clock speed of 100 MHz, the PIPELINE accelerator, compared to the UNROLL accelerator, experiences an 8.09% increase in power consumption but speeds up by 14.972 times, making the PIPELINE accelerator superior in performance. Compared to the CPU, the PIPELINE accelerator reduces power consumption by 91.37% and speeds up by 70.387 times, while compared to the GPU, it reduces power consumption by 93.35%. This study provides two different optimization schemes for edge intelligence applications through design and experimentation and demonstrates the impact of different quantization methods on FPGA resource consumption. These experimental results can provide a reference for practical applications, thereby providing a reference hardware acceleration scheme for edge intelligence applications. Full article
Show Figures

Figure 1

22 pages, 6594 KB  
Article
Massively Parallel Monte Carlo Sampling for Xinanjiang Hydrological Model Parameter Optimization Using CPU-GPU Computer Cluster
by Guangyuan Kan, Chenliang Li, Depeng Zuo, Xiaodi Fu and Ke Liang
Water 2023, 15(15), 2810; https://doi.org/10.3390/w15152810 - 3 Aug 2023
Cited by 2 | Viewed by 2337
Abstract
The Monte Carlo sampling (MCS) method is a simple and practical way for hydrological model parameter optimization. The MCS procedure is used to generate a large number of data points. Therefore, its computational efficiency is a key issue when applied to large-scale problems. [...] Read more.
The Monte Carlo sampling (MCS) method is a simple and practical way for hydrological model parameter optimization. The MCS procedure is used to generate a large number of data points. Therefore, its computational efficiency is a key issue when applied to large-scale problems. The MCS method is an internally concurrent algorithm that can be parallelized. It has the potential to execute on massively parallel hardware systems such as multi-node computer clusters equipped with multiple CPUs and GPUs, which are known as heterogeneous hardware systems. To take advantage of this, we parallelize the algorithm and implement it on a multi-node computer cluster that hosts multiple INTEL multi-core CPUs and NVIDIA many-core GPUs by using C++ programming language combined with the MPI, OpenMP, and CUDA parallel programming libraries. The parallel parameter optimization method is coupled with the Xinanjiang hydrological model to test the acceleration efficiency when tackling real-world applications that have a very high computational burden. Numerical experiments indicate, on the one hand, that the computational efficiency of the massively parallel parameter optimization method is significantly improved compared to single-core CPU code, and the multi-GPU code achieves the fastest speed. On the other hand, the scalability property of the proposed method is also satisfactory. In addition, the correctness of the proposed method is also tested using sensitivity and uncertainty analysis of the model parameters. Study results indicate good acceleration efficiency and reliable correctness of the proposed parallel optimization methods, which demonstrates excellent prospects in practical applications. Full article
Show Figures

Figure 1

15 pages, 1158 KB  
Article
A Flexible and General-Purpose Platform for Heterogeneous Computing
by Jose Juan Garcia-Hernandez, Miguel Morales-Sandoval and Erick Elizondo-Rodríguez
Computation 2023, 11(5), 97; https://doi.org/10.3390/computation11050097 - 11 May 2023
Cited by 4 | Viewed by 2927
Abstract
In the big data era, processing large amounts of data imposes several challenges, mainly in terms of performance. Complex operations in data science, such as deep learning, large-scale simulations, and visualization applications, can consume a significant amount of computing time. Heterogeneous computing is [...] Read more.
In the big data era, processing large amounts of data imposes several challenges, mainly in terms of performance. Complex operations in data science, such as deep learning, large-scale simulations, and visualization applications, can consume a significant amount of computing time. Heterogeneous computing is an attractive alternative for algorithm acceleration, using not one but several different kinds of computing devices (CPUs, GPUs, or FPGAs) simultaneously. Accelerating an algorithm for a specific device under a specific framework, i.e., CUDA/GPU, provides a solution with the highest possible performance at the cost of a loss in generality and requires an experienced programmer. On the contrary, heterogeneous computing allows one to hide the details pertaining to the simultaneous use of different technologies in order to accelerate computation. However, effective heterogeneous computing implementation still requires mastering the underlying design flow. Aiming to fill this gap, in this paper we present a heterogeneous computing platform (HCP). Regarding its main features, this platform allows non-experts in heterogeneous computing to deploy, run, and evaluate high-computational-demand algorithms following a semi-automatic design flow. Given the implementation of an algorithm in C with minimal format requirements, the platform automatically generates the parallel code using a code analyzer, which is adapted to target a set of available computing devices. Thus, while an experienced heterogeneous computing programmer is not required, the process can run over the available computing devices on the platform as it is not an ad hoc solution for a specific computing device. The proposed HCP relies on the OpenCL specification for interoperability and generality. The platform was validated and evaluated in terms of generality and efficiency through a set of experiments using the algorithms of the Polybench/C suite (version 3.2) as the input. Different configurations for the platform were used, considering CPUs only, GPUs only, and a combination of both. The results revealed that the proposed HCP was able to achieve accelerations of up to 270× for specific classes of algorithms, i.e., parallel-friendly algorithms, while its use required almost no expertise in either OpenCL or heterogeneous computing from the programmer/end-user. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

29 pages, 1605 KB  
Article
Energy-Efficient Parallel Computing: Challenges to Scaling
by Alexey Lastovetsky and Ravi Reddy Manumachu
Information 2023, 14(4), 248; https://doi.org/10.3390/info14040248 - 20 Apr 2023
Cited by 11 | Viewed by 5493
Abstract
The energy consumption of Information and Communications Technology (ICT) presents a new grand technological challenge. The two main approaches to tackle the challenge include the development of energy-efficient hardware and software. The development of energy-efficient software employing application-level energy optimization techniques has become [...] Read more.
The energy consumption of Information and Communications Technology (ICT) presents a new grand technological challenge. The two main approaches to tackle the challenge include the development of energy-efficient hardware and software. The development of energy-efficient software employing application-level energy optimization techniques has become an important category owing to the paradigm shift in the composition of digital platforms from single-core processors to heterogeneous platforms integrating multicore CPUs and graphics processing units (GPUs). In this work, we present an overview of application-level bi-objective optimization methods for energy and performance that address two fundamental challenges, non-linearity and heterogeneity, inherent in modern high-performance computing (HPC) platforms. Applying the methods requires energy profiles of the application’s computational kernels executing on the different compute devices of the HPC platform. Therefore, we summarize the research innovations in the three mainstream component-level energy measurement methods and present their accuracy and performance tradeoffs. Finally, scaling the optimization methods for energy and performance is crucial to achieving energy efficiency objectives and meeting quality-of-service requirements in modern HPC platforms and cloud computing infrastructures. We introduce the building blocks needed to achieve this scaling and conclude with the challenges to scaling. Briefly, two significant challenges are described, namely fast optimization methods and accurate component-level energy runtime measurements, especially for components running on accelerators. Full article
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)
Show Figures

Figure 1

15 pages, 2573 KB  
Article
F-LSTM: FPGA-Based Heterogeneous Computing Framework for Deploying LSTM-Based Algorithms
by Bushun Liang, Siye Wang, Yeqin Huang, Yiling Liu and Linpeng Ma
Electronics 2023, 12(5), 1139; https://doi.org/10.3390/electronics12051139 - 26 Feb 2023
Cited by 16 | Viewed by 5015
Abstract
Long Short-Term Memory (LSTM) networks have been widely used to solve sequence modeling problems. For researchers, using LSTM networks as the core and combining it with pre-processing and post-processing to build complete algorithms is a general solution for solving sequence problems. As an [...] Read more.
Long Short-Term Memory (LSTM) networks have been widely used to solve sequence modeling problems. For researchers, using LSTM networks as the core and combining it with pre-processing and post-processing to build complete algorithms is a general solution for solving sequence problems. As an ideal hardware platform for LSTM network inference, Field Programmable Gate Array (FPGA) with low power consumption and low latency characteristics can accelerate the execution of algorithms. However, implementing LSTM networks on FPGA requires specialized hardware and software knowledge and optimization skills, which is a challenge for researchers. To reduce the difficulty of deploying LSTM networks on FPGAs, we propose F-LSTM, an FPGA-based framework for heterogeneous computing. With the help of F-LSTM, researchers can quickly deploy LSTM-based algorithms to heterogeneous computing platforms. FPGA in the platform will automatically take up the computation of the LSTM network in the algorithm. At the same time, the CPU will perform the pre-processing and post-processing in the algorithm. To better design the algorithm, compress the model, and deploy the algorithm, we also propose a framework based on F-LSTM. The framework also integrates Pytorch to increase usability. Experimental results on sentiment analysis tasks show that deploying algorithms to the F-LSTM hardware platform can achieve a 1.8× performance improvement and a 5.4× energy efficiency improvement compared to GPU. Experimental results also validate the need to build heterogeneous computing systems. In conclusion, our work reduces the difficulty of deploying LSTM on FPGAs while guaranteeing algorithm performance compared to traditional work. Full article
(This article belongs to the Special Issue FPGA-Based Accelerators of Deep Learning and Neuromorphic Computing)
Show Figures

Figure 1

13 pages, 3015 KB  
Article
GPU-Based Cellular Automata Model for Multi-Orient Dendrite Growth and the Application on Binary Alloy
by Jingjing Wang, Hongji Meng, Jian Yang and Zhi Xie
Crystals 2023, 13(1), 105; https://doi.org/10.3390/cryst13010105 - 6 Jan 2023
Cited by 4 | Viewed by 2929
Abstract
To simulate dendrite growth with different orientations more efficiently, a high-performance cellular automata (CA) model based on heterogenous central processing unit (CPU)+ graphics processing unit (GPU) architecture has been proposed in this paper. Firstly, the decentered square algorithm (DCSA) is used to simulate [...] Read more.
To simulate dendrite growth with different orientations more efficiently, a high-performance cellular automata (CA) model based on heterogenous central processing unit (CPU)+ graphics processing unit (GPU) architecture has been proposed in this paper. Firstly, the decentered square algorithm (DCSA) is used to simulate the morphology of dendrite with different orientations. Secondly, parallel algorithms are proposed to take full advantage of many cores by maximizing computational parallelism. Thirdly, in order to further improve the calculation efficiency, the task scheduling scheme using multi-stream is designed to solve the waiting problem among independent tasks, improving task parallelism. Then, the present model was validated by comparing its steady dendrite tip velocity with the Lipton–Glicksman–Kurz (LGK) analytical model, which shows great agreement. Finally, it is applied to simulate the dendrite growth of the binary alloy, which proves that the present model can not only simulate the clear dendrite morphology with different orientations and secondary arms, but also show a good agreement with the in situ experiment. In addition, compared with the traditional CPU model, the speedup of this model is up to 158×, which provides a great acceleration. Full article
(This article belongs to the Special Issue Intermetallic Compound (Volume II))
Show Figures

Figure 1

12 pages, 4992 KB  
Article
Heterogeneous CPU-GPU Accelerated Subgridding in the FDTD Modelling of Microwave Breakdown
by Jian Feng, Kaihong Song, Ming Fang, Wei Chen, Guoda Xie, Zhixiang Huang and Xianliang Wu
Electronics 2022, 11(22), 3725; https://doi.org/10.3390/electronics11223725 - 14 Nov 2022
Cited by 1 | Viewed by 2033
Abstract
Microwave breakdown is crucial to the transmission of high-power microwave (HPM) devices, where a growing number of studies have analyzed the complex interactions between electromagnetic waves and the evolving plasma from theoretical and analytical perspectives. In this paper, we propose a finite-difference time-domain [...] Read more.
Microwave breakdown is crucial to the transmission of high-power microwave (HPM) devices, where a growing number of studies have analyzed the complex interactions between electromagnetic waves and the evolving plasma from theoretical and analytical perspectives. In this paper, we propose a finite-difference time-domain (FDTD) scheme to numerically solve Maxwell’s equation, coupled with a fluid plasma equation for simulating the plasma formation during HPM air breakdown. A subgridding method is adopted to obtain accurate results with lower computational resources. Moreover, the three-dimensional subgridding Maxwell–plasma algorithm is efficiently accelerated by utilizing heterogeneous computing technique based on graphics processing units (GPUs) and multiple central processing units (CPUs), which can be applied as an efficient method for the investigation of the HPM air breakdown phenomena. Full article
Show Figures

Figure 1

Back to TopTop