MDPI - Publisher of Open Access Journals

9 pages, 705 KB

Open AccessArticle

Energy Efficiency of a New Parallel PIC Code for Numerical Simulation of Plasma Dynamics in Open Trap

by Igor Chernykh, Igor Kulikov, Vitaly Vshivkov, Ekaterina Genrikh, Dmitry Weins, Galina Dudnikova, Ivan Chernoshtanov and Marina Boronina

Mathematics 2022, 10(19), 3684; https://doi.org/10.3390/math10193684 - 8 Oct 2022

Cited by 5 | Viewed by 2274

Abstract

The generation of energy-efficient parallel scientific codes became very important in the time of carbon footprint reduction. In this paper, we briefly present our latest particle-in-cell code with the results of a numerical simulation of plasma dynamics in an open trap. This code [...] Read more.

The generation of energy-efficient parallel scientific codes became very important in the time of carbon footprint reduction. In this paper, we briefly present our latest particle-in-cell code with the results of a numerical simulation of plasma dynamics in an open trap. This code can be auto-vectorized by the Fortran compiler for Intel Xeon processors with AVX-512 instructions such as Intel Xeon Phi and the highest series of all generations of Intel Xeon Scalable processors. Efficient use of processor architecture is the main feature of an energy-efficient solution. We present a step-by-step methodology of energy consumption calculation using Intel hardware features and Intel VTune software. We also give an estimated value of carbon footprint with the impact of high-performance water cooled hardware. The Power Usage Effectiveness (PUE) in the case of high-performance water cooled hardware is equal to 1.03–1.05, and is up to 1.3 in the case of air-cooled systems. This means that power consumption of liquid cooled systems is lower than that air-cooled ones by up to 25%. All these factors play an important role in the carbon footprint reduction problem. Full article

(This article belongs to the Special Issue Parallel Computing and Applications)

► Show Figures

Figure 1

17 pages, 4675 KB

Open AccessArticle

Performance Evaluation of Massively Parallel Systems Using SPEC OMP Suite

by Dheya Mustafa

Computers 2022, 11(5), 75; https://doi.org/10.3390/computers11050075 - 5 May 2022

Cited by 1 | Viewed by 4839

Abstract

Performance analysis plays an essential role in achieving a scalable performance of applications on massively parallel supercomputers equipped with thousands of processors. This paper is an empirical investigation to study, in depth, the performance of two of the most common High-Performance Computing architectures [...] Read more.

Performance analysis plays an essential role in achieving a scalable performance of applications on massively parallel supercomputers equipped with thousands of processors. This paper is an empirical investigation to study, in depth, the performance of two of the most common High-Performance Computing architectures in the world. IBM has developed three generations of Blue Gene supercomputers—Blue Gene/L, P, and Q—that use, at a large scale, low-power processors to achieve high performance. Better CPU core efficiency has been empowered by a higher level of integration to gain more parallelism per processing element. On the other hand, the Intel Xeon Phi coprocessor armed with 61 on-chip x86 cores, provides high theoretical peak performance, as well as software development flexibility with existing high-level programming tools. We present an extensive evaluation study of the performance peaks and scalability of these two modern architectures using SPEC OMP benchmarks. Full article

(This article belongs to the Topic Innovation of Applied System)

► Show Figures

Graphical abstract

17 pages, 6553 KB

Open AccessArticle

Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing

by Yosang Jeong and Hoon Ryu

Electronics 2021, 10(3), 253; https://doi.org/10.3390/electronics10030253 - 22 Jan 2021

Viewed by 2382

Abstract

The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core [...] Read more.

The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core numerical operation involves a recursive process of matrix multiplication. Major techniques adopted for performance enhancement are data restructuring, matrix tiling, thread scheduling, and offload computing, and we present technical details on how they are applied to optimize the performance of simulations in computing hardware, including Intel Xeon Phi Knights Landing (KNL) systems and NVIDIA general purpose graphic processing unit (GPU) devices. With a target structure of a silicon nanowire that consists of 100,000 atoms and is described with an atomistic tight-binding model, the effects of optimization techniques on the performance of simulations are rigorously tested in a KNL node equipped with two Quadro GV100 GPU devices, and we observe that computation is accelerated by a factor of up to ∼20 against the unoptimized case. The feasibility of handling large-scale workloads in a huge computing environment is also examined with nanowire simulations in a wide energy range, where good scalability is procured up to 2048 KNL nodes. Full article

(This article belongs to the Special Issue High-Performance Computer Architectures and Applications)

► Show Figures

Figure 1

24 pages, 6911 KB

Open AccessArticle

Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

by Junghoon Woo, Hyeonseong Choi and Jaehwan Lee

Appl. Sci. 2020, 10(19), 6717; https://doi.org/10.3390/app10196717 - 25 Sep 2020

Cited by 4 | Viewed by 4715

Abstract

To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new [...] Read more.

To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a C-based library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%. Full article

(This article belongs to the Special Issue Operating System Issues in Emerging Systems and Applications)

► Show Figures

Figure 1

14 pages, 1364 KB

Open AccessArticle

Performance and Energy Assessment of a Lattice Boltzmann Method Based Application on the Skylake Processor

by Ivan Girotto, Sebastiano Fabio Schifano, Enrico Calore, Gianluca Di Staso and Federico Toschi

Computation 2020, 8(2), 44; https://doi.org/10.3390/computation8020044 - 8 May 2020

Cited by 1 | Viewed by 4049

Abstract

This paper presents the performance analysis for both the computing performance and the energy efficiency of a Lattice Boltzmann Method (LBM) based application, used to simulate three-dimensional multicomponent turbulent systems on massively parallel architectures for high-performance computing. Extending results reported in previous works, [...] Read more.

This paper presents the performance analysis for both the computing performance and the energy efficiency of a Lattice Boltzmann Method (LBM) based application, used to simulate three-dimensional multicomponent turbulent systems on massively parallel architectures for high-performance computing. Extending results reported in previous works, the analysis is meant to demonstrate the impact of using optimized data layouts designed for LBM based applications on high-end computer platforms. A particular focus is given to the Intel Skylake processor and to compare the target architecture with other models of the Intel processor family. We introduce the main motivations of the presented work as well as the relevance of its scientific application. We analyse the measured performances of the implemented data layouts on the Skylake processor while scaling the number of threads per socket. We compare the results obtained on several CPU generations of the Intel processor family and we make an analysis of energy efficiency on the Skylake processor compared with the Intel Xeon Phi processor, finally adding our interpretation of the presented results. Full article

(This article belongs to the Special Issue Energy-Efficient Computing on Parallel Architectures)

► Show Figures

Figure 1

42 pages, 1564 KB

Open AccessArticle

A Comparative Study of Methods for Measurement of Energy of Computing

by Muhammad Fahad, Arsalan Shahid, Ravi Reddy Manumachu and Alexey Lastovetsky

Energies 2019, 12(11), 2204; https://doi.org/10.3390/en12112204 - 10 Jun 2019

Cited by 70 | Viewed by 12624

Abstract

Energy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical [...] Read more.

Energy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical measurements using external power meters; (b) Measurements using on-chip power sensors and (c) Energy predictive models. In this work, we present a comprehensive study comparing the accuracy of state-of-the-art on-chip power sensors and energy predictive models against system-level physical measurements using external power meters, which we consider to be the ground truth. We show that the average error of the dynamic energy profiles obtained using on-chip power sensors can be as high as 73% and the maximum reaches 300% for two scientific applications, matrix-matrix multiplication and 2D fast Fourier transform for a wide range of problem sizes. The applications are executed on three modern Intel multicore CPUs, two Nvidia GPUs and an Intel Xeon Phi accelerator. The average error of the energy predictive models employing performance monitoring counters (PMCs) as predictor variables can be as high as 32% and the maximum reaches 100% for a diverse set of seventeen benchmarks executed on two Intel multicore CPUs (one Haswell and the other Skylake). We also demonstrate that using inaccurate energy measurements provided by on-chip sensors for dynamic energy optimization can result in significant energy losses up to 84%. We show that, owing to the nature of the deviations of the energy measurements provided by on-chip sensors from the ground truth, calibration can not improve the accuracy of the on-chip sensors to an extent that can allow them to be used in optimization of applications for dynamic energy. Finally, we present the lessons learned, our recommendations for the use of on-chip sensors and energy predictive models and future directions. Full article

(This article belongs to the Special Issue Model Coupling and Energy Systems)

► Show Figures

Figure 1

15 pages, 1832 KB

Open AccessArticle

Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors

by Enrico Calore, Alessandro Gabbana, Sebastiano Fabio Schifano and Raffaele Tripiccione

J. Low Power Electron. Appl. 2018, 8(2), 18; https://doi.org/10.3390/jlpea8020018 - 11 Jun 2018

Cited by 9 | Viewed by 8829

Abstract

Energy consumption of processors and memories is quickly becoming a limiting factor in the deployment of large computing systems. For this reason, it is important to understand the energy performance of these processors and to study strategies allowing their use in the most [...] Read more.

Energy consumption of processors and memories is quickly becoming a limiting factor in the deployment of large computing systems. For this reason, it is important to understand the energy performance of these processors and to study strategies allowing their use in the most efficient way. In this work, we focus on the computing and energy performance of the Knights Landing Xeon Phi, the latest Intel many-core architecture processor for HPC applications. We consider the 64-core Xeon Phi 7230 and profile its performance and energy efficiency using both its on-chip MCDRAM and the off-chip DDR4 memory as the main storage for application data. As a benchmark application, we use a lattice Boltzmann code heavily optimized for this architecture and implemented using several different arrangements of the application data in memory (data-layouts, in short). We also assess the dependence of energy consumption on data-layouts, memory configurations (DDR4 or MCDRAM) and the number of threads per core. We finally consider possible trade-offs between computing performance and energy efficiency, tuning the clock frequency of the processor using the Dynamic Voltage and Frequency Scaling (DVFS) technique. Full article

(This article belongs to the Special Issue Energy Aware Scientific Computing on Low Power and Heterogeneous Architectures)

► Show Figures

Figure 1

21 pages, 1178 KB

Open AccessArticle

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

by Abdullah Al Hasib, Lasse Natvig, Per Gunnar Kjeldsberg and Juan M. Cebrián

J. Low Power Electron. Appl. 2017, 7(1), 5; https://doi.org/10.3390/jlpea7010005 - 22 Feb 2017

Cited by 6 | Viewed by 9687

Abstract

Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data [...] Read more.

Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi- and many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel^® Core^TM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel^® Xeon Phi^TM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction sets. Results using a single-threaded execution on the Haswell and Sandy Bridge systems show that performance and EDP (Energy Delay Product) can be improved through data reuse transformations on the scalar code by a factor of ≈3× and ≈6×, respectively. Compared to scalar code without data reuse optimization, the SSE/AVX2 version achieves ≈10×/17× better performance and ≈92×/307× better EDP, respectively. These results can be improved by 10% to 15% using data reuse techniques. Finally, the most optimized version using data reuse and AVX512 achieves a speedup of ≈35× and an EDP improvement of ≈1192× on the Xeon Phi system. While single-threaded execution serves as a common reference point for all architectures to analyze the effects of data reuse on both scalar and vector codes, scalability with thread count is also discussed in the paper. Full article

(This article belongs to the Special Issue Emerging Network-on-Chip Architectures for Low Power Embedded Systems)

► Show Figures

Figure 1

22 pages, 5689 KB

Open AccessArticle

OpenCL Implementation of a Parallel Universal Kriging Algorithm for Massive Spatial Data Interpolation on Heterogeneous Systems

by Fang Huang, Shuanshuan Bu, Jian Tao and Xicheng Tan

ISPRS Int. J. Geo-Inf. 2016, 5(6), 96; https://doi.org/10.3390/ijgi5060096 - 17 Jun 2016

Cited by 18 | Viewed by 7650

Abstract

In some digital Earth engineering applications, spatial interpolation algorithms are required to process and analyze large amounts of data. Due to its powerful computing capacity, heterogeneous computing has been used in many applications for data processing in various fields. In this study, we [...] Read more.

In some digital Earth engineering applications, spatial interpolation algorithms are required to process and analyze large amounts of data. Due to its powerful computing capacity, heterogeneous computing has been used in many applications for data processing in various fields. In this study, we explore the design and implementation of a parallel universal kriging spatial interpolation algorithm using the OpenCL programming model on heterogeneous computing platforms for massive Geo-spatial data processing. This study focuses primarily on transforming the hotspots in serial algorithms, i.e., the universal kriging interpolation function, into the corresponding kernel function in OpenCL. We also employ parallelization and optimization techniques in our implementation to improve the code performance. Finally, based on the results of experiments performed on two different high performance heterogeneous platforms, i.e., an NVIDIA graphics processing unit system and an Intel Xeon Phi system (MIC), we show that the parallel universal kriging algorithm can achieve the highest speedup of up to 40× with a single computing device and the highest speedup of up to 80× with multiple devices. Full article

► Show Figures

Figure 1

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI