# Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Lattice Boltzmann Methods

#### Experimental Methodology

## 3. The Knights Landing Architecture

## 4. Measuring the Energy Consumption of the KNL

## 5. Energy Optimization of Data Structures

## 6. Energy Efficiency Optimization Using DVFS

#### 6.1. Function Benchmarks

#### 6.2. Full Application Results

## 7. Conclusions and Future Works

- Applications previously developed for ordinary x86 multi-core CPUs can be easily ported and run on KNL processors. However, the performance is strongly related to the level of vectorization and core parallelism that applications are able to exploit;
- For LB (and for many other) applications, appropriate data layouts play a relevant role to allow for vectorization and for an efficient use of the memory sub-system, improving both computing and energy efficiency;
- If application data fit within the MCDRAM, the performance of KNL is very competitive with that of recent GPUs in terms of both computing and energy efficiency; unfortunately, if this is not the case, computing performance is strongly reduced;
- Given the machine-balance reduction when using DDR4, instead of MCDRAM, functions, which are commonly compute-bound on most architectures, may become memory-bound in this condition;
- To simulate large lattices that do not fit in the MCDRAM, it is then important to be able to split them across several KNLs to let every sub-lattice fit in the MCDRAM, as is commonly done when running LB applications on multiple GPUs [23];
- If it is not possible to split the data domain across several processors, the performance degradation could be compensated by an energy savings of up to $20\%$ using DVFS to reduce the cores’ frequency;
- As for GPU devices [12], also on the KNL, due to the time needed to change the frequency of all the cores, a function by function selection of core frequencies is not viable for LB applications.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Ge, R.; Feng, X.; Song, S.; Chang, H.C.; Li, D.; Cameron, K.W. Powerpack: Energy profiling and analysis of high-performance systems and applications. IEEE Trans. Paral. Distrib. Syst.
**2010**, 21, 658–671. [Google Scholar] [CrossRef] - Attig, N.; Gibbon, P.; Lippert, T. Trends in supercomputing: The European path to exascale. Comput. Phys. Commun.
**2011**, 182, 2041–2046. [Google Scholar] [CrossRef] [Green Version] - Calore, E.; Gabbana, A.; Schifano, S.F.; Tripiccione, R. Early experience on using Knights Landing processors for Lattice Boltzmann applications. In Proceedings of the 12th International Parallel Processing and Applied Mathematics Conference, Lublin, Poland, 10–13 September 2017; Volume 1077, pp. 1–12. [Google Scholar] [CrossRef]
- Bernard, C.; Christ, N.; Gottlieb, S.; Jansen, K.; Kenway, R.; Lippert, T.; Lüscher, M.; Mackenzie, P.; Niedermayer, F.; Sharpe, S.; et al. Panel discussion on the cost of dynamical quark simulations. Nuclear Phys. B Proc. Suppl.
**2002**, 106, 199–205. [Google Scholar] [CrossRef] - Bilardi, G.; Pietracaprina, A.; Pucci, G.; Schifano, F.; Tripiccione, R. The Potential of on-Chip Multiprocessing for QCD Machines; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2005; Volume 3769, pp. 386–397. [Google Scholar] [CrossRef]
- Bonati, C.; Calore, E.; Coscetti, S.; D’Elia, M.; Mesiti, M.; Negro, F.; Schifano, S.F.; Tripiccione, R. Development of scientific software for HPC architectures using OpenACC: the case of LQCD. In Proceedings of the 2015 International Workshop on Software Engineering for High Performance Computing in Science (SE4HPCS), Florence, Italy, 18 May 2015; pp. 9–15. [Google Scholar] [CrossRef]
- Bonati, C.; Coscetti, S.; D’Elia, M.; Mesiti, M.; Negro, F.; Calore, E.; Schifano, S.F.; Silvi, G.; Tripiccione, R. Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C
**2017**, 28. [Google Scholar] [CrossRef] - Bonati, C.; Calore, E.; D’Elia, M.; Mesiti, M.; Negro, F.; Sanfilippo, F.; Schifano, S.; Silvi, G.; Tripiccione, R. Portable multi-node LQCD Monte Carlo simulations using OpenACC. Int. J. Mod. Phys. C
**2018**, 29. [Google Scholar] [CrossRef] - Peng, I.B.; Gioiosa, R.; Kestor, G.; Cicotti, P.; Laure, E.; Markidis, S. Exploring the Performance Benefit of Hybrid Memory System on HPC Environments. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lake Buena Vista, FL, USA, 29 May–2 June 2017; pp. 683–692. [Google Scholar] [CrossRef]
- Allen, T.; Daley, C.S.; Doerfler, D.; Austin, B.; Wright, N.J. Performance and Energy Usage of Workloads on KNL and Haswell Architectures. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation; Jarvis, S., Wright, S., Hammond, S., Eds.; Springer: New York, NY, USA, 2018; pp. 236–249. [Google Scholar] [CrossRef]
- Calore, E.; Gabbana, A.; Schifano, S.F.; Tripiccione, R. Energy-efficiency evaluation of Intel KNL for HPC workloads. In Parallel Computing is Everywhere; Advances in Parallel Computing; IOS: Amsterdam, The Netherlands, 2018; Volume 32, pp. 733–742. [Google Scholar] [CrossRef]
- Calore, E.; Gabbana, A.; Schifano, S.F.; Tripiccione, R. Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications. Concurr. Comput. Pract. Exp.
**2017**, 29, 1–19. [Google Scholar] [CrossRef] - Succi, S. The Lattice-Boltzmann Equation; Oxford University Press: Oxford, UK, 2001. [Google Scholar]
- Biferale, L.; Mantovani, F.; Sbragaglia, M.; Scagliarini, A.; Toschi, F.; Tripiccione, R. Second-order closure in stratified turbulence: Simulations and modeling of bulk and entrainment regions. Phys. Rev. E
**2011**, 84, 016305. [Google Scholar] [CrossRef] [PubMed] - Biferale, L.; Mantovani, F.; Pivanti, M.; Sbragaglia, M.; Scagliarini, A.; Schifano, S.F.; Toschi, F.; Tripiccione, R. Lattice Boltzmann fluid-dynamics on the QPACE supercomputer. Procedia Comput. Sci.
**2010**, 1, 1075–1082. [Google Scholar] [CrossRef] - Sbragaglia, M.; Benzi, R.; Biferale, L.; Chen, H.; Shan, X.; Succi, S. Lattice Boltzmann method with self-consistent thermo-hydrodynamic equilibria. J. Fluid Mech.
**2009**, 628, 299–309. [Google Scholar] [CrossRef] - Scagliarini, A.; Biferale, L.; Sbragaglia, M.; Sugiyama, K.; Toschi, F. Lattice Boltzmann methods for thermal flows: Continuum limit and applications to compressible Rayleigh–Taylor systems. Phys. Fluids
**2010**, 22, 055101. [Google Scholar] [CrossRef] [Green Version] - Biferale, L.; Mantovani, F.; Sbragaglia, M.; Scagliarini, A.; Toschi, F.; Tripiccione, R. Reactive Rayleigh-Taylor systems: Front propagation and non-stationarity. EPL
**2011**, 94, 54004. [Google Scholar] [CrossRef] [Green Version] - Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Scagliarini, A.; Schifano, S.F.; Toschi, F.; Tripiccione, R. A Multi-GPU Implementation of a D2Q37 Lattice Boltzmann Code. In Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics, Torun, Poland, 11–14 September 2011; Revised Selected Papers, Part I; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2012; pp. 640–650. [Google Scholar] [CrossRef]
- Calore, E.; Schifano, S.F.; Tripiccione, R. On Portability, Performance and Scalability of an MPI OpenCL Lattice Boltzmann Code. In Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, 25–26 August 2014; Revised Selected Papers, Part II; Lecture Notes in Computer Science; Springer: Cham, Switherland, 2014; pp. 438–449. [Google Scholar] [CrossRef]
- Calore, E.; Schifano, S.F.; Tripiccione, R. Energy-Performance Tradeoffs for HPC Applications on Low Power Processors; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2015; Volume 9523, pp. 737–748. [Google Scholar] [CrossRef]
- Calore, E.; Gabbana, A.; Kraus, J.; Schifano, S.F.; Tripiccione, R. Performance and portability of accelerated lattice Boltzmann applications with OpenACC. Concurr. Comput. Pract. Exp.
**2016**, 28, 3485–3502. [Google Scholar] [CrossRef] [Green Version] - Calore, E.; Gabbana, A.; Kraus, J.; Pellegrini, E.; Schifano, S.F.; Tripiccione, R. Massively parallel lattice-Boltzmann codes on large GPU clusters. Paral. Comput.
**2016**, 58, 1–24. [Google Scholar] [CrossRef] - Mantovani, F.; Pivanti, M.; Schifano, S.F.; Tripiccione, R. Performance issues on many-core processors: A D2Q37 Lattice Boltzmann scheme as a test-case. Comput. Fluids
**2013**, 88, 743–752. [Google Scholar] [CrossRef] - Crimi, G.; Mantovani, F.; Pivanti, M.; Schifano, S.F.; Tripiccione, R. Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-phi Co-Processor. Procedia Comput. Sci.
**2013**, 18, 551–560. [Google Scholar] [CrossRef] - Calore, E.; Demo, N.; Schifano, S.F.; Tripiccione, R. Experience on Vectorizing Lattice Boltzmann Kernels for Multi- and Many-Core Architectures. In Proceedings of the 11th International Conference on Parallel Processing and Applied Mathematics, Krakow, Poland, 6–9 September 2015; Revised Selected Papers, Part I; Lecture Notes in Computer Science. Springer: Cham, Switherland, 2016; pp. 53–62. [Google Scholar] [CrossRef]
- McCalpin, J.D. STREAM: Sustainable Memory Bandwidth in High Performance Computers; University of Virginia: Charlottesville, VA, USA, A Continually Updated Technical Report; Available online: http://www.cs.virginia.edu/stream/ (accessed on 3 June 2018).
- Colfax. Clustering Modes in Knights Landing Processors. Available online: https://colfaxresearch.com/knl-numa/ (accessed on 3 June 2018).
- Colfax. MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developers Guide. Available online: https://colfaxresearch.com/knl-mcdram/ (accessed on 3 June 2018).
- Sodani, A.; Gramunt, R.; Corbal, J.; Kim, H.S.; Vinod, K.; Chinthamani, S.; Hutsell, S.; Agarwal, R.; Liu, Y.C. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro
**2016**, 36, 34–46. [Google Scholar] [CrossRef] - Dongarra, J.; London, K.; Moore, S.; Mucci, P.; Terpstra, D. Using PAPI for hardware performance monitoring on Linux systems. In Proceedings of the Conference on Linux Clusters: The HPC Revolution, Champaign, IL, USA, 25–27 June 2001; Volume 5. [Google Scholar]
- Weaver, V.; Johnson, M.; Kasichayanula, K.; Ralph, J.; Luszczek, P.; Terpstra, D.; Moore, S. Measuring Energy and Power with PAPI. In Proceedings of the 1st International Conference on Parallel Processing Workshops (ICPPW), Pittsburgh, PA, USA, 10–13 September 2012; pp. 262–268. [Google Scholar] [CrossRef]
- Hackenberg, D.; Schone, R.; Ilsche, T.; Molka, D.; Schuchart, J.; Geyer, R. An Energy Efficiency Feature Survey of the Intel Haswell Processor. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), Hyderabad, India, 25–29 May 2015; pp. 896–904. [Google Scholar] [CrossRef]
- Desrochers, S.; Paradis, C.; Weaver, V.M. A Validation of DRAM RAPL Power Measurements. In Proceedings of the Second International Symposium on Memory Systems, Alexandria, VA, USA, 3–6 October 2016; pp. 455–470. [Google Scholar] [CrossRef]
- Calore, E.; Gabbana, A.; Schifano, S.F.; Tripiccione, R. Optimization of lattice Boltzmann simulations on heterogeneous computers. Int. J. High Perform. Comput. Appl.
**2017**. [Google Scholar] [CrossRef] - Etinski, M.; Corbalán, J.; Labarta, J.; Valero, M. Understanding the future of energy-performance trade-off via DVFS in HPC environments. J. Paral. Distrib. Comput.
**2012**, 72, 579–590. [Google Scholar] [CrossRef] - Lawson, G.; Sosonkina, M.; Shen, Y. Performance and Energy Evaluation of CoMD on Intel Xeon Phi Co-processors. In Proceedings of the 2014 Hardware-Software Co-Design for High Performance Computing, New Orleans, LA, USA, 17–17 November 2014; pp. 49–54. [Google Scholar] [CrossRef]
- Lawson, G.; Sundriyal, V.; Sosonkina, M.; Shen, Y. Runtime Power Limiting of Parallel Applications on Intel Xeon Phi Processors. In Proceedings of the 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC), Salt Lake City, UT, USA, 14–14 November 2016; pp. 39–45. [Google Scholar] [CrossRef]
- Haidar, A.; Jagode, H.; YarKhan, A.; Vaccaro, P.; Tomov, S.; Dongarra, J. Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi. In Proceedings of the 2017 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 12–14 September 2017; pp. 1–7. [Google Scholar] [CrossRef]
- Williams, S.; Waterman, A.; Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM
**2009**, 52, 65–76. [Google Scholar] [CrossRef] - McCalpin, J.D. Memory Bandwidth and Machine Balance in Current High Performance Computers. In Proceedings of the IEEE Technical Committee on Computer Architecture (TCCA) Newsletter, Santa Margherita Ligure, Italy, 22–24 June 1995; pp. 19–25. [Google Scholar]
- Doerfler, D.; Deslippe, J.; Williams, S.; Oliker, L.; Cook, B.; Kurth, T.; Lobet, M.; Malas, T.; Vay, J.L.; Vincenti, H. Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor. In High Performance Computing; Taufer, M., Mohr, B., Kunkel, J.M., Eds.; Kluwer Academic/Plenum Press: Dordrecht, The Netherlands, 2016; pp. 339–353. [Google Scholar] [CrossRef]
- Valero-Lara, P.; Igual, F.D.; Prieto-Matías, M.; Pinelli, A.; Favier, J. Accelerating fluid–solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures. J. Comput. Sci.
**2015**, 10, 249–261. [Google Scholar] [CrossRef] [Green Version] - Valero-Lara, P. Reducing memory requirements for large size LBM simulations on GPUs. Concurr. Comput. Pract. Exp.
**2017**, 29, e4221. [Google Scholar] [CrossRef] - Mantovani, F.; Calore, E. Multi-Node Advanced Performance and Power Analysis with Paraver. In Parallel Computing is Everywhere; Advances in Parallel Computing; Springer: Berlin, Germany, 2018; Volume 32, pp. 723–732. [Google Scholar] [CrossRef]

**Figure 1.**Tests run using three different memory configurations (DDR4 Flat, MCDRAM Flat and Cache) and numbers of threads (64, 128, 192 and 256). We show three metrics: time-to-solution (${T}_{s}$), average power drain (${P}_{avg}$) and energy-to-solution (${E}_{s}$). Average values over 1000 iterations. (

**a**) refers to the propagate and (

**b**) to the collide kernels. (

**a**) propagate function: ${T}_{s}$ in nanoseconds per site (

**top**), ${P}_{avg}$ in watts (

**middle**) and ${E}_{s}$ in microjoules per site (

**bottom**); (

**b**) collide function: ${T}_{s}$ in nanoseconds per site (

**top**), ${P}_{avg}$ in watts (

**middle**) and ${E}_{s}$ in microjoules per site (

**bottom**). AoS, Array of Structure; SoA, Structure of Array; CSoA, Clustered Structure of Array; CAoSoA, Clustered Array of Structure of Array.

**Figure 2.**Energy consumption in nano-joules per lattice site, using different memory data layouts and different numbers of threads. Each bar represents the package energy (

**bottom**), plus the DRAM energy (top). When using the MCDRAM, the latter is just the DRAM idle energy consumption. (

**a**) The propagate function. Flat configuration, using the DDR4 system memory (

**top**) and the MCDRAM (

**bottom**). Notice the different scales on the y-axes. (

**b**) The collide function. Flat configuration, using the DDR4 system memory (

**top**) and the MCDRAM (

**bottom**).

**Figure 3.**${E}_{S}$ (energy-to-solution) versus ${T}_{S}$ (time-to-solution) for the propagate and collide functions adopting the CAoSoA data layout and using 256 threads. KNL cores’ frequencies are shown in GHz as labels. For each data point, we show the average over 100 iterations. This measure has been performed three times, and the results are the same within a range of 1.5% in the ${E}_{S}$. (

**a**) The propagate function using respectively the DDR4 (red) and MCDRAM (blue) memories; (

**b**) the collide function using respectively the DDR4 (red) and MCDRAM (blue) memories.

**Figure 4.**${E}_{S}$ (energy-to-solution) versus ${T}_{S}$ (time-to-solution) for the whole simulation adopting the CAoSoA data layout and using 256 threads. KNL cores’ frequencies for the two functions are shown in GHz; the label on the left for propagate and the label on the right for collide. (

**a**) Full simulation storing the lattice in the MCDRAM memory; (

**b**) full simulation storing the lattice in the DDR4 memory.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Calore, E.; Gabbana, A.; Schifano, S.F.; Tripiccione, R.
Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors. *J. Low Power Electron. Appl.* **2018**, *8*, 18.
https://doi.org/10.3390/jlpea8020018

**AMA Style**

Calore E, Gabbana A, Schifano SF, Tripiccione R.
Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors. *Journal of Low Power Electronics and Applications*. 2018; 8(2):18.
https://doi.org/10.3390/jlpea8020018

**Chicago/Turabian Style**

Calore, Enrico, Alessandro Gabbana, Sebastiano Fabio Schifano, and Raffaele Tripiccione.
2018. "Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors" *Journal of Low Power Electronics and Applications* 8, no. 2: 18.
https://doi.org/10.3390/jlpea8020018