Comparative Performance Evaluation of Modern Heterogeneous High-Performance Computing Systems CPUs

: The study presents a comparison of computing systems based on IBM POWER8, IBM POWER9, and Intel Xeon Platinum 8160 processors running parallel applications. Memory subsystem bandwidth was studied, parallel programming technologies were compared, and the operating modes and capabilities of simultaneous multithreading technology were analyzed. Performance analysis for the studied computing systems running parallel applications based on the OpenMP and MPI technologies was carried out by using the NAS Parallel Benchmarks. An assessment of the results obtained during experimental calculations led to the conclusion that IBM POWER8 and Intel Xeon Platinum 8160 systems have almost the same maximum memory bandwidth, but require a di ﬀ erent number of threads for e ﬃ cient utilization. The IBM POWER9 system has the highest maximum bandwidth, which can be attributed to the large number of memory channels per socket. Based on the results of numerical experiments, recommendations are given on how the hardware of a similar grade can be utilized to solve various scientiﬁc problems, including recommendations on optimal processor architecture choice for leveraging the operation of high-performance hybrid computing platforms.


Introduction
The advent of specialized algorithms and software data-processing systems using the capabilities of graphic coprocessors (GPUs) has led to an increase in the demand for hybrid high-performance computing systems. The effectiveness of calculations using the GPU and related development tools (for example, CUDA) is the subject of a large number of works that have shown a high increase in productivity in solving scientific problems in various fields of knowledge. As an example, we can single out the tasks of image processing [1], which have found completely new capabilities for data analysis and interpretation.
Most of the above studies evaluate only the GPU in detail [2], paying very little attention to central processors. This issue is extremely important, since in supercomputer centers that provide access to their computing resources for scientists, it is necessary to use the capabilities of modern computing systems built on the basis of the GPU as efficiently as possible [3]. With the seemingly limited choice of CPU manufacturers, each of them (Intel (Santa Clara, CA, USA), IBM (Armonk, NY, USA), etc.) is trying to develop a relatively independent ecosystem, including various sets of system libraries and compilers. Their performance can vary significantly across different classes of tasks and affect the overall performance of a hybrid computing system. In this regard, the urgent task is to assess the functioning of the CPU when performing typical operations. This is an important issue, since the central processors in hybrid systems, in addition to dispatching functions, also perform a large amount of operations related directly to numerical calculations. The results of such studies, depending on the purpose of the computing system, will make it possible to choose the most efficient architectures, as well as give recommendations on their use in calculations in various fields of knowledge.
To solve the abovementioned objectives, a study of modern IBM POWER series processors was conducted. The first phase [4] includes the performance analysis of a hybrid computing cluster based on IBM POWER8 processors. The obtained results show that these systems proved to be effective for solving machine learning and deep learning problems, as well as for calculations in various software packages designed to fully utilize CPU resources. These systems demonstrate high performance due to architectural solutions of POWER processors (in particular, simultaneous multithreading technology (SMT) [5]), as well as the efficient utilization of GPUs assisting with calculations, which becomes possible with modern special-purpose libraries (for example, IBM ESSL) that boost application execution speed without modifying the application's source code.
This paper presents the second part of the results of this study, which contains benchmark data for modern IBM POWER series processors, including a comparison with Intel Xeon Platinum 8160 processors benchmarks. Based on a comprehensive benchmarking of these computing systems running parallel applications, the objective of this study is to substantiate recommendations on the choice of processor architectures to leverage the operation of hybrid computing platforms.

The Description of Computing Platforms and Utilized System Software
The following three hybrid systems were chosen as computer technology samples for analysis-IBM Power Systems S822LC 8335-GTB, IBM Power Systems AC922 8335-GTG, and Huawei FusionServer G5500 Server G560 V5.
The IBM Power Systems S822LC 8335-GTB server (hereinafter referred to as IBM POWER8 system) is based on two superscalar 10-core IBM POWER8 processors [6] with a maximum operating frequency of 4.023 GHz and SMT technology support (up to 8 threads per core), having two NVIDIA Tesla P100 GPU accelerators. One of the distinctive features of this system is that RAM is not directly connected to a CPU's four-channel memory controllers, but via specialized Centaur chips [7], which ensure data-transfer planning and control (one chip per memory channel). Each memory controller channel running at 9.6 GHz can read 2 bytes and write 1 byte at a time. Thus, the theoretical total memory bandwidth for the considered server is (2 B + 1 B) × 9.6 GHz × 4 × 2 = 230.4 GB/s [7]. Detailed system specifications are given in Reference [4].
The latest in the POWER series, IBM Power Systems AC922 8335-GTG (hereinafter referred to as IBM POWER9 system), is based on a hybrid architecture incorporating two superscalar IBM POWER9 processors [8] with a maximum operating frequency of 3.5 GHz and four NVIDIA Tesla V100 GPU accelerators. The server uses SO (scale-out) processor variations, with the main difference from the SU (scale-up) variation being the direct connection of RAM to its memory controller, without a Centaur buffer chip. This processor has 20 cores supporting SMT4 technology (80 hardware threads per socket). There are 512 KB of shared L2 cache and 10 MB of shared L3 cache for every two cores. Peak memory bandwidth for this computing system totals 340 GB/s [9].
The Huawei FusionServer G5500 Server G560 V5 computing system (hereinafter referred to as Intel Xeon Platinum 8160 system) is based on Intel processors. It incorporates two superscalar Intel Xeon Platinum 8160 processors with a maximum operating frequency of 2.1 GHz, which are based on the Skylake microarchitecture [10], and eight NVIDIA Tesla V100 GPU accelerators. The processors support SMT (Hyper-Threading) technology, allowing up to 2 hardware threads to be executed on each core. Thus, a 24-core CPU can simultaneously run 48 hardware threads. Peak memory bandwidth for the G5500 Server G560 V5 system is 256 GB/s [11].
Summary information with detailed specifications of the studied computing systems CPUs is presented in Table 1. The following compilers and MPI libraries were used for the study (Table 2), providing the optimal performance level for calculations on these types of computing systems, according to previous benchmarks [12].

Benchmark Conditions and Means
Hardware performance issues related to the execution of parallel applications (see Section 1) were considered in the study. In particular, the bandwidth of memory subsystems was evaluated, and the parallel computing performance of the studied computing systems was analyzed. To study the abovementioned issues, the widely used benchmarks were performed, as listed in Table 3. To evaluate the real memory bandwidth, the STREAM benchmark was used. It examines the steady-state bandwidth for read and write operations that are performed in conjunction with arithmetic operations. The benchmark contains four computational kernels (Copy, Scale, Add, and Triad), which are described in Table 4.

Name Operation Byte per Iteration FLOPS per Iteration
Copy Computational cores work with arrays of double-precision numbers (8 B), whose sizes are set during benchmark compilation. To do the correct memory bandwidth benchmark, arrays should be at least 4 times bigger than the sum of all last-level caches. Therefore, an array of size 8.4 × 10 7 elements was used for benchmarking the considered computing systems.
The benchmark results are compared with the theoretical memory bandwidth values listed in Table 5. Performance analysis for the studied computing systems using parallel computing with OpenMP and MPI technologies was carried out by using the NPB. The NBP consists of several individual benchmark problems: kernels and pseudo-applications. Kernels and pseudo-applications can perform calculations in the following problem classes: S, W, A, B, C, and D. The sizes of main arrays, the amount of data, and the number of iterations of the main program loops increase as the problem class increases. The following NPB kernels and pseudo-applications were used for research: EP (embarrassingly parallel), LU (lower-upper decomposition), MG (multigrid), CG (conjugate gradient), FT (fast Fourier transform), and IS (integer sort).
To compile all benchmarks, we used compilers presented in Table 2, with optimization flags listed in Table 6. When compiling MPI versions of the NPB, OpenMP-technology-related flags ("-qsmp = omp" for IBM POWER8 and POWER9 Systems, "-qopenmp" for Intel Xeon Platinum 8160 System) were omitted.
To eliminate the possibility of erroneous results caused by other users' activity, all numerical calculations were performed in a single-user mode. Dynamic CPU frequency scaling was disabled. Furthermore, CPUs were set to their maximum frequency, acceptable at a simultaneous maximum load of all computing cores (see Table 1).
Benchmark results and their analysis are given in the corresponding sections of this paper. Table 6. Optimization flags.

Memory Bandwidth Benchmark
STREAM benchmark calculations were performed with a different number of computational threads that were uniformly distributed over sockets (half of overall threads number on each). ST (Single Thread, one thread per core) mode was used for the IBM POWER8 processor for a case when the number of threads did not exceed 20. When the number of threads equals to 40, 80, and 160, the SMT2, SMT4, and SMT8 modes (two, four, and eight threads per core) were used, respectively. IBM POWER9 and Intel Xeon Platinum 8160 processors have more cores, so ST mode was used for them when the number of threads did not exceed 40. SMT2 and SMT4 modes were used for 80 and 160 threads, respectively.
The test results for the Copy, Scale, Add, and Triad kernels are shown in Figures 1 and 2.
Electronics 2020, 9, x FOR PEER REVIEW 5 of 14 To eliminate the possibility of erroneous results caused by other users' activity, all numerical calculations were performed in a single-user mode. Dynamic CPU frequency scaling was disabled. Furthermore, СPUs were set to their maximum frequency, acceptable at a simultaneous maximum load of all computing cores (see Table 1).
Benchmark results and their analysis are given in the corresponding sections of this paper.

Memory Bandwidth Benchmark
STREAM benchmark calculations were performed with a different number of computational threads that were uniformly distributed over sockets (half of overall threads number on each). ST (Single Thread, one thread per core) mode was used for the IBM POWER8 processor for a case when the number of threads did not exceed 20. When the number of threads equals to 40, 80, and 160, the SMT2, SMT4, and SMT8 modes (two, four, and eight threads per core) were used, respectively. IBM POWER9 and Intel Xeon Platinum 8160 processors have more cores, so ST mode was used for them when the number of threads did not exceed 40. SMT2 and SMT4 modes were used for 80 and 160 threads, respectively.
The test results for the Copy, Scale, Add, and Triad kernels are shown in Figures 1 and 2.
(a) (b)   Benchmark results show that the optimal mode for maximizing memory access speed is running one process per core (ST mode). In this case, the maximum memory bandwidth for IBM-processorsbased systems is achieved on a Triad kernel, and for Intel-based systems, it is achieved on a Copy kernel, respectively. The IBM-POWER8-processor-based system demonstrates 182.5 GB/s (79.2% of the peak value), which is achieved by running the benchmark on 20 threads. Similarly, the IBM  Benchmark results show that the optimal mode for maximizing memory access speed is running one process per core (ST mode). In this case, the maximum memory bandwidth for IBM-processors-based systems is achieved on a Triad kernel, and for Intel-based systems, it is achieved on a Copy kernel, respectively. The IBM-POWER8-processor-based system demonstrates 182.5 GB/s (79.2% of the peak value), which is achieved by running the benchmark on 20 threads. Similarly, the IBM POWER9 system demonstrates 267.1 GB/s (78.6% of the peak value) on 40 threads, while the Intel Xeon Platinum 8160 system demonstrates 177.8 GB/s (69.5% of the peak value) on 40 threads. Increasing the number of threads to 48 for the Intel Xeon 8160 processor leads to a negligible drop in memory bandwidth down to 176.6 GB/s.
Analysis of the obtained data shows that the biggest bandwidth improvement for IBM POWER8 processors is achieved with a sequential increase in the number of threads up to the number of available memory channels. A further increase in the number of threads up to the number of available cores leads to a bandwidth increase of 6-7.5%. On the other hand, the increase of bandwidth for IBM POWER9 and Intel Xeon Platinum 8160 is relatively uniform when computational cores are sequentially added in ST mode. All considered systems demonstrate a decrease in memory bandwidth in SMT mode, due to a growing number of cache conflicts.

Computing Systems' Parallel Benchmarks
A study of individual CPU cores running in various SMT modes during parallel computing using OpenMP and MPI technologies was carried out with EP, LU, MG, FT, IS (problem class C), and CG (problem classes C and D) benchmarks from the NPB suite.
Calculations were performed on 16 CPU cores of each system. During that, ST, SMT2, SMT4, and SMT8 modes were used for the IBM POWER8 system; ST, SMT2, and SMT4 modes for the IBM POWER9 system; and ST and SMT2 modes for the Intel Xeon Platinum 8160 system. PAMI (Parallel Active Messaging Interface) [15] was used for Spectrum MPI collective communication, and shm (Shared memory) [16] was used for Intel MPI. Below, the experiments' results are presented in a sequence as per the level of load on the communication network [17]. Performance values were averaged over 10 benchmark runs.
The EP benchmark was used to evaluate the performance of floating-point calculations in the absence of noticeable inter-processor communication. This benchmark includes the generation of pseudorandom normally distributed numbers. The benchmark is CPU-bound. Figure 3 shows the performance achieved per CPU core; error bars represent standard deviation per 10 runs. The total number of computing threads (MPI processes) is indicated in parentheses. Hereinafter, performance is given in millions of operations per second (Mop/s). The results shown in the figure show that, despite an almost double difference in operating frequency, IBM POWER8 and Intel Xeon Platinum 8160 processor cores demonstrate a similar performance in ST mode. The performance of IBM POWER9 processor cores is 9-36% lower than that of IBM POWER8 in the same mode, although the difference in operating frequency between them is  The results shown in the figure show that, despite an almost double difference in operating frequency, IBM POWER8 and Intel Xeon Platinum 8160 processor cores demonstrate a similar performance in ST mode. The performance of IBM POWER9 processor cores is 9-36% lower than that of IBM POWER8 in the same mode, although the difference in operating frequency between them is 14%.
It should be noted that SMT technology significantly improves the EP benchmark results on all computing systems. SMT8 mode provides a 2.1-2.3 × performance boost for the IBM POWER8 processor compared to ST mode. SMT4 mode is optimal for IBM POWER9 and doubles its performance. The Intel Xeon Platinum 8160 processor reaches the maximum performance in SMT2 mode, with a performance increase of 35%.
The EP benchmark results showed that OpenMP technology provides a performance comparable to MPI on IBM POWER9 and Intel Xeon Platinum 8160 computing systems. MPI technology increased the performance of the IBM POWER8 computing system by 8-15%, with the greatest gain being observed in SMT4 and SMT8 modes.
LU decomposition is performed in the LU benchmark. Its performance is limited by the CPU speed. Figure 4 shows the benchmark results. It can be noted that per-core performance of the considered IBM CPUs is the same, despite the difference in operating frequency. However, Intel Xeon Platinum 8160 processor cores have, on average, a 17% lower performance.  The results shown in the figure show that, despite an almost double difference in operating frequency, IBM POWER8 and Intel Xeon Platinum 8160 processor cores demonstrate a similar performance in ST mode. The performance of IBM POWER9 processor cores is 9-36% lower than that of IBM POWER8 in the same mode, although the difference in operating frequency between them is 14%.
It should be noted that SMT technology significantly improves the EP benchmark results on all computing systems. SMT8 mode provides a 2.1-2.3 × performance boost for the IBM POWER8 processor compared to ST mode. SMT4 mode is optimal for IBM POWER9 and doubles its performance. The Intel Xeon Platinum 8160 processor reaches the maximum performance in SMT2 mode, with a performance increase of 35%.
The EP benchmark results showed that OpenMP technology provides a performance comparable to MPI on IBM POWER9 and Intel Xeon Platinum 8160 computing systems. MPI technology increased the performance of the IBM POWER8 computing system by 8-15%, with the greatest gain being observed in SMT4 and SMT8 modes.
LU decomposition is performed in the LU benchmark. Its performance is limited by the CPU speed. Figure 4 shows the benchmark results. It can be noted that per-core performance of the considered IBM CPUs is the same, despite the difference in operating frequency. However, Intel Xeon Platinum 8160 processor cores have, on average, a 17% lower performance.   SMT2 mode proves to be an optimal SMT mode for the Intel Xeon Platinum 8160 processor, while for IBM processors utilizing MPI technology, SMT4 is optimal. With OpenMP technology, SMT2 mode is preferred for IBM POWER8 processors, while SMT4 mode is preferred for IBM POWER9.
MG benchmark uses a multi-grid algorithm to find an approximate solution of a 3D Poisson equation with periodic boundary conditions. It features structured long-distance data communication [14]. Figure 5 shows the results of the experiments. They prove that SMT does not lead to a performance boost for any of the POWER CPUs. However, SMT does boost Intel Xeon Platinum 8160 s performance in the OpenMP implementation by 69%. equation with periodic boundary conditions. It features structured long-distance data communication [14]. Figure 5 shows the results of the experiments. They prove that SMT does not lead to a performance boost for any of the POWER CPUs. However, SMT does boost Intel Xeon Platinum 8160′s performance in the OpenMP implementation by 69%.
The MPI version of the test has a 20% greater performance than its OpenMP counterparts when running on IBM POWER8. On other computing systems, no difference between them is observed. The CG benchmark implies solving random sparse linear systems by using the conjugate gradient method. It features irregular long-distance communication where reads operations prevail over writes [17]. The benchmark is memory-bound. Commutations in MPI implementation are made via non-blocking point-to-point operations. Figure 6 shows that IBM POWER8 and Intel Xeon Platinum 8160 processor cores show an identical maximum performance level for Problem Class C. IBM POWER9 processor cores have a 37% lower performance. The same performance of processor cores in Problem Class C can be explained by the small size of main program arrays that are loaded in the processor cache, which significantly reduces data-transfer overheads. This can be confirmed by the fact that additionally performed benchmarks using Problem Class D (Figure 7), which has a larger array size, showed a drop in performance for all computing systems. At the same time, performance levels began to correlate with the frequency of their CPUs in the MPI version of the test. The MPI version of the test has a 20% greater performance than its OpenMP counterparts when running on IBM POWER8. On other computing systems, no difference between them is observed.
The CG benchmark implies solving random sparse linear systems by using the conjugate gradient method. It features irregular long-distance communication where reads operations prevail over writes [17]. The benchmark is memory-bound. Commutations in MPI implementation are made via non-blocking point-to-point operations. Figure 6 shows that IBM POWER8 and Intel Xeon Platinum 8160 processor cores show an identical maximum performance level for Problem Class C. IBM POWER9 processor cores have a 37% lower performance. The same performance of processor cores in Problem Class C can be explained by the small size of main program arrays that are loaded in the processor cache, which significantly reduces data-transfer overheads. This can be confirmed by the fact that additionally performed benchmarks using Problem Class D (Figure 7), which has a larger array size, showed a drop in performance for all computing systems. At the same time, performance levels began to correlate with the frequency of their CPUs in the MPI version of the test.  Note that increased performance is reached by adding threads up to the maximum permissible number for Class C of the CG benchmark. Problem Class D shows a reduced performance speedup with SMT technology. OpenMP and MPI technology are optimal parallel programming technologies for IBM processors in Problem Class C and Problem Class D of this benchmark, respectively, MPI technology is optimal for the Intel processor in Class C, and OpenMP is optimal for the same processor in Class D.
The FT benchmark implies solving discrete 3D fast Fourier transform. It features high-rate longdistance communication [14]. Benchmark performance is limited by the memory-access speed. Collective communication in the MPI benchmark version was carried out by using the following collective operations: MPI_Reduce, MPI_Barrier, MPI_Bcast, and MPI_Alltoall. The results given in Figure 8 show that OpenMP technology boosts performance up to 48% when compared to MPI technology. However, using SMT technology does not lead to any significant performance improvements for IBM processors. Using the SMT2 mode improves performance by 14% for the Intel processor. Note that increased performance is reached by adding threads up to the maximum permissible number for Class C of the CG benchmark. Problem Class D shows a reduced performance speedup with SMT technology. OpenMP and MPI technology are optimal parallel programming technologies for IBM processors in Problem Class C and Problem Class D of this benchmark, respectively, MPI technology is optimal for the Intel processor in Class C, and OpenMP is optimal for the same processor in Class D.
The FT benchmark implies solving discrete 3D fast Fourier transform. It features high-rate long-distance communication [14]. Benchmark performance is limited by the memory-access speed. Collective communication in the MPI benchmark version was carried out by using the following collective operations: MPI_Reduce, MPI_Barrier, MPI_Bcast, and MPI_Alltoall. The results given in Figure 8 show that OpenMP technology boosts performance up to 48% when compared to MPI technology. However, using SMT technology does not lead to any significant performance improvements for IBM processors. Using the SMT2 mode improves performance by 14% for the Intel processor. The performance of the analyzed computing systems in this benchmark correlates well with the operation frequency of their CPUs.
The IS benchmark implies parallel sorting of a massive array of integers (see Figure 9). It is used to evaluate the performance of integer calculations in the presence of intensive interthread interaction [14]. The benchmark is memory-bound. Collective communication in the MPI benchmark version is done by means of MPI_Alltoall and MPI_Allreduce operations. Experimental results show that MPI technology provides lower performance than OpenMP, and the optimal operating mode for the IBM POWER8 and Intel Xeon Platinum 8160 processors is SMT2. The IBM POWER9 processor demonstrates the highest performance at a single thread per core. The maximum performance of The performance of the analyzed computing systems in this benchmark correlates well with the operation frequency of their CPUs.
The IS benchmark implies parallel sorting of a massive array of integers (see Figure 9). It is used to evaluate the performance of integer calculations in the presence of intensive interthread interaction [14].
The benchmark is memory-bound. Collective communication in the MPI benchmark version is done by means of MPI_Alltoall and MPI_Allreduce operations. Experimental results show that MPI technology provides lower performance than OpenMP, and the optimal operating mode for the IBM POWER8 and Intel Xeon Platinum 8160 processors is SMT2. The IBM POWER9 processor demonstrates the highest performance at a single thread per core. The maximum performance of processor cores correlates well with their operating frequency. The performance of the analyzed computing systems in this benchmark correlates well with the operation frequency of their CPUs.
The IS benchmark implies parallel sorting of a massive array of integers (see Figure 9). It is used to evaluate the performance of integer calculations in the presence of intensive interthread interaction [14]. The benchmark is memory-bound. Collective communication in the MPI benchmark version is done by means of MPI_Alltoall and MPI_Allreduce operations. Experimental results show that MPI technology provides lower performance than OpenMP, and the optimal operating mode for the IBM POWER8 and Intel Xeon Platinum 8160 processors is SMT2. The IBM POWER9 processor demonstrates the highest performance at a single thread per core. The maximum performance of processor cores correlates well with their operating frequency. To evaluate the maximum performance of the studied computing systems performing parallel calculations on IBM POWER9 and Intel Xeon Platinum 8160 systems, benchmarks similar to those presented above were performed, using 32 processor cores. This is because the number of computational threads started by most NPB benchmarks should be a power of two. After that, the maximum achieved performance values were taken from the obtained results, which are shown in Figure 10, together with the maximum values for the IBM POWER8 system as shown in Figures 3-9, multiplied by the number of threads. Thus, 80% of the cores of computing systems based on IBM To evaluate the maximum performance of the studied computing systems performing parallel calculations on IBM POWER9 and Intel Xeon Platinum 8160 systems, benchmarks similar to those presented above were performed, using 32 processor cores. This is because the number of computational threads started by most NPB benchmarks should be a power of two. After that, the maximum achieved performance values were taken from the obtained results, which are shown in Figure 10, together with the maximum values for the IBM POWER8 system as shown in Figures 3-9, multiplied by the number of threads. Thus, 80% of the cores of computing systems based on IBM processors and 67% of the cores of the Intel computing system CPUs were involved to obtain these results.
Electronics 2020, 9, x FOR PEER REVIEW 11 of 14 processors and 67% of the cores of the Intel computing system CPUs were involved to obtain these results. The results given in Figure 10 show that the IBM POWER9 processor has the highest performance in most benchmarks. The Intel Xeon Platinum 8160 processor exceeds IBM POWER9 in performance only in the CG benchmark. At the same time, the POWER8 processor exceeds Intel Xeon Platinum 8160 in performance in the MG and IS benchmarks.  The results given in Figure 10 show that the IBM POWER9 processor has the highest performance in most benchmarks. The Intel Xeon Platinum 8160 processor exceeds IBM POWER9 in performance only in the CG benchmark. At the same time, the POWER8 processor exceeds Intel Xeon Platinum 8160 in performance in the MG and IS benchmarks.

Discussion
An assessment of the results obtained during experimental calculations shows that the IBM POWER8 and Xeon Platinum 8160 processors have almost the same maximum memory bandwidth, but require different numbers of threads for its efficient utilization. The IBM POWER9 system has the highest maximum bandwidth, which can be attributed to the large number of memory channels per socket.
Despite the lower memory frequency, the IBM POWER8 system shows the highest real bandwidth among all of the studied CPUs for the number of threads below 10. This is due to Centaur chips, which improve memory-access efficiency. Thus, computing systems with a similar architecture allow for improvements in the performance of applications with a low degree of parallelism and high memory-bandwidth requirements.
SMT technology improves the utilization of CPU cores when executing non-optimized applications, whose performance is limited by the speed of computing operations (e.g., EP benchmark). If the application performance is limited by memory bandwidth, then performance degradation may occur due to increased cache conflicts when using SMT technology. The best per-core performance among the tested processors is shown by the IBM POWER8 processor due to its higher operating frequency and support for up to eight hardware threads per core. In most benchmarks, except for CG, the IBM POWER9 processor has a better overall performance compared to Intel Xeon Platinum 8160.
The Intel Xeon Platinum 8160 processor had the worst performance when running non-optimized applications. Though almost three times the peak performance of the IBM POWER9 processor, which is attributable to 512-byte vector instructions (AVX-512), the actual Intel Xeon Platinum 8160 performance in the NPB was similar (IBM POWER8 and POWER9 processors have 128-byte vector instructions-VSX-2 and VSX-3 respectively). This is related to the fact that optimized software supporting the vectorization of calculations should be developed to ensure the maximum utilization of such vector execution units.
The conducted experiments allowed us to partially evaluate the effectiveness of memory cache subsystems' organization for the studied CPUs. Thus, a larger amount of individual L2 cache per each core of the Intel Xeon Platinum 8160 CPUs allows for more efficient execution of applications that are sensitive to data locality, such as LU benchmark [18]. Furthermore, its high associativity allows for the reduction of cache misses and an increase in the performance of applications using the irregular memory access (for example, CG benchmark). On the other hand, IBM POWER CPUs with large L3 caches allow for the achievement of a better performance of caches' size-sensitive memory-bound applications, such as FT and IS benchmarks [18]. It is worth noting that the benchmarking technique used in this study provides only general conclusions about the relationship between the performance of various application classes and the cache and memory subsystem architecture. Dedicated studies are required to analyze this relationship.
Due to the fact that MPI technology usually does not use shared memory and all interaction between computational processes is carried out by message passing, MPI-based parallel applications show better data locality than their OpenMP counterparts [18]. This results in greater MPI performance on CPUs with large cache; it also boosts the performance of SMT (EP, MG, and LU Class C and CG Class D benchmarks on IBM POWER systems) due to less cache conflicts at a greater number of processes. On the other hand, applications developed using OpenMP and MPI technologies demonstrate similar performance in most cases when running on systems with a relatively small cache (Intel Xeon Platinum 8160). OpenMP performance improvement can only be observed when intensive interaction between threads is required (FT and IS benchmarks).