Time–Energy Correlation for Multithreaded Matrix Factorizations

: The relationship between time and energy is an important aspect related to energy savings in modern multicore architectures. In this paper, we investigated and analyzed the correlation between time and energy. We compared the execution time and energy consumption of the LU factorization algorithms (versions with and without pivoting) and Cholesky with the Math Kernel Library (MKL) on a multicore machine. To reduce the energy of these multithreaded factorizations, the Dynamic Voltage and Frequency Scaling (DVFS) technique was used. This technique allows the clock frequency to be scaled without changing the implementation. In particular, we studied the correlations between time and energy using two metrics: Energy Delay Product (EDP) and Greenup, Powerup, and Speedup (GPS-UP). An experimental evaluation was performed on an Intel Xeon Gold multicore machine as a function of the number of threads and the clock speed. Our test results showed that scalability in terms of execution time, expressed by the Speedup metric, had values close to a linear function as the number of threads increased. In contrast, the scalability in terms of energy consumption, expressed by the Greenup metric, had values close to a logarithmic function as the number of threads increased. The use of the EDP and GPS-UP metrics allowed us to evaluate the impact of the optimized code (DVFS and increase in the number of threads) on the time and energy consumption and to determine a better green category representing energy savings without losing performance.


Introduction
Time-energy correlation is one of the main requirements to be considered when implementing parallel software on multicore machines, especially for numerical algorithms involving many matrix computations.To save the time of matrix factorization on multicore machines, multithreaded code is required.To reduce the energy of these multithreaded factorizations, the Dynamic Voltage and Frequency Scaling (DVFS) technique is used.This technique allows the clock frequency to be scaled without changing the implementation.The scalability feature allows an increasing number of threads to be used on a multicore machine in the hope that both time and energy efficiency will increase rather than decrease.The classical approach to scalability in parallel processing focuses on performance in terms of execution time.In this work, we wanted to study scalability in terms of two criteria, both the runtime of the numerical algorithm and the energy consumption.The importance and the need to consider several criteria in relation to scalability in parallel processing is shown in [1].A distinction is made between two basic concepts related to scalability: scalability in the strong sense and scalability in the weak sense.In this paper, we will only consider scalability in the strong sense.
A deep understanding of the scalability of execution time and power consumption, along with the interrelationships between them, can enable the formulation of targeted improvements to reduce both uptime and power consumption across application domains.
It is particularly important to analyze applications that intentionally use cache mechanisms.Such applications typically come from the numerical domain of linear algebra and involve matrix computations.Linear algebra is a key component of many numerical algorithms that address a range of scientific and engineering challenges.
Over time, Basic Linear Algebra Subroutines (BLAS) [2] became the de facto standard interface for linear algebraic operations.Among the widespread BLAS libraries, the Math Kernel Library (MKL) [3] occupies a prominent place.In the MKL, you can find implementations of matrix factorization techniques, including LU factorization and Cholesky factorization.All of these factorization implementations rooted in the BLAS Level 3 library are cache-friendly and work to minimize memory accesses and make better use of data spatial locality.The conventional MKL approach for parallel matrix factorizations in cache-dependent systems uses fixed-sized blocks that fit in the cache, thus distributing the computational load evenly across the threads.We expect that the obtained results can be generalized to other algorithms from this group with a cache-friendly implementation, e.g., QR factorization and others.
Currently, the MKL is focused on optimizing runtime performance, with energy savings not yet taken into account.While it is widely recognized that reducing the computation time usually results in energy savings, this is not the only rationale for energy efficiency.Therefore, to increase the energy efficiency of the algorithms contained in the MKL without changing their existing implementations in multicore architectures, this study used the technique of Dynamic Voltage and Frequency Scaling (DVFS) [4].DVFS is a technique that reduces the clock frequency and voltage level of various computing node components (CPU, DRAM, etc.) at the cost of some performance degradation.Dynamic Voltage and Frequency Scaling (DVFS) technology allows us to effectively manage power consumption by adjusting the voltage and frequency of the processor depending on the current load, contributing to significant energy savings.However, it should be remembered that limiting performance by reducing the frequency and voltage of the processor to save energy may adversely affect the operation of applications that require high computing power, reducing its performance.In addition, the use of DVFS in one system component may affect the overall performance of the system, as differences in the performance adjustment of individual components can lead to inefficient synchronization of activities.
In this article, we ran tests on a multicore machine with two Intel Xeon Gold 5218R server processors with a base clock of 2.10 GHz, which are part of the Xeon Scalable series and have a Non-Uniform Memory Access (NUMA) architecture.The NUMA architecture is a commonly used concept in multicore and multiprocessor computing systems.In NUMA systems, memory is divided into multiple areas, each associated with a specific set of processor cores.This means that access to local memory is faster than access to remote memory, which introduces some non-linear delays (hence the name "Non-Uniform Memory Access").NUMA is one of the key aspects that affects the performance and optimization of multiprocessor systems.

Related Work
The section deals with a literature review of the issues related to numerical linear algebra computations (in particular, matrix factorizations) on multicore machines in the context of time and energy.The literature study will focus on two aspects.One aspect will consider ways of achieving energy efficiency in numerical linear algebra software.There are three ways: by making changes to the implementation, by autotuning the software parameters for a given hardware platform, and by adjusting the operating system (e.g., DVFS).The second aspect considered will be the different ways of assessing the time execution and consumed energy.
In the article [5], different techniques related to algorithm transformation were used to reduce static and dynamic energy consumption on multicore machines with shared memory.In particular, one of the most-popular matrix factorizations, namely the LU factorization, was considered.In the LU factorization code, the most-energetically expensive instructions were extracted, and then, by performing a code transformation, an attempt was made to reduce their number.This work investigated the effect of the number of threads on dynamic and total energy consumption, performance, and also the number of MFlops/W for different matrix sizes at a fixed number of threads.
There are also papers that have analyzed the energy impact of numerical linear algebra algorithms that do not change the code, but only control the software parameters (e.g., block size) accordingly.The Automatically Tuned Linear Algebra Software (ATLAS) [6] is an example of the automatic tuning of a library to the architecture of the machine to reduce the execution time.The focus of the paper [7] was to propose a method for tuning the ATLAS library, whereby it is possible to replace the execution time tuning process by tuning the energy consumption.In this work, the performance and the number of MFlops/J, as well as the execution time and energy consumption were investigated for single-and double-precision for different array sizes.
At the operating system level, the approach to optimizing energy consumption focuses on minimizing the energy consumption of the entire node through the use of a technique such as DVFS.The DVFS technique has a very high potential to save energy, as shown in the work of [8][9][10].The article [8] focuses on the DVFS technique, which is available for multiple platforms with multiple cores.The impact of the number of threads executing a multithreaded application and the different values of the clock frequency were considered.Only the Energy Delay Product (EDP) model relating power to the square of the execution time was used to assess energy efficiency.
In our work, we extend the study to include scalability in the energy sense of multithreaded applications measured by the Greenup and Powerup metrics described in the next section for numerical linear algebra problems with fixed problem dimensions.This will allow us to investigate scalability, as well as correlations between runtime and energy consumption.

Metrics
Energy consumption is the product of runtime and power consumed.Energy can be saved in various ways, e.g., by shortening runtime or reducing power consumption, or one of the two: extending the runtime, but reducing the power consumption more or, vice versa, increasing the power consumption slightly, but reducing the runtime.Therefore, the assessment of energy efficiency based on the analysis of runtime or energy consumption alone is insufficient and sometimes causes confusion.
In the current literature, we can find metrics that allow for a deeper analysis of energy efficiency.
Gonzalez and Horowitz introduced the Energy Delay Product metric [11], which is dedicated to the simultaneous analysis of runtime and energy consumption.The EDP was defined as: EDP = ET.
It is the product of energy and runtime.
In [12], other metrics were introduced, namely: Speedup, Greenup and Powerup (GPS-UP), and code evaluation was introduced by assigning the code to one of the 8 categories depicted in the GPS-UP Software Energy Efficiency Quadrant Graph (Figure 1), where the green categories mean energy savings and the red categories mean energy loss compared to non-optimized code.The Speedup metric is known in the literature and used to analyze performance in parallel programming between different code implementations.It is assumed that we have two implementations of the algorithm, one non-optimized It is well known that the lower the value of the EDP metric, the better the efficiency is, but this metric sometimes fails (especially when trying to optimize with DVFS), giving a better indication of optimizations running faster despite using more energy.GPS-UP metrics provide a deeper analysis of the relationship between performance and energy efficiency.In the above formulas, in the case of optimization consisting of increasing the number of threads, T B means the runtime of the multithreaded algorithm running for 1 thread and T O the runtime of the multithreaded algorithm running on p threads (p > 1).However, in the case of optimization using DVFS techniques, T B means the runtime of the algorithm running at the highest-possible frequency and T O the runtime of the algorithm running at a lower frequency.The notations E B and E O should be interpreted analogously in relation to the consumed energy.

Algorithms
We will briefly introduce the LU without pivoting, LU with pivoting, and Cholesky algorithms used to solve systems of linear equations.The LU factorization transforms the square nonsingular matrix A into a product of two matrices: where L and U are lower and upper triangular matrices, respectively.
Implementing LU of the above form is very rare in practice due to the risk of procedural problems [13].They can be removed by simply rearranging the rows of matrix A. The LU without pivoting (without rearranging the rows) exists if the matrix A has a strict dominant diagonal.The LU factorization with pivoting that factorizes the matrix into matrices has the following form: PA = LU where P is the permutation matrix.The Cholesky factorization is defined only for A being Hermitian and positive-definite and has the form: where L is a lower triangular matrix.In this article, we investigate the LAPACK [14] implementation of the LU factorization from MKL, namely dgetrfnpi (LU without pivoting) [15], dgetrf (LU with pivoting), and dpotrf (Cholesky) routines.These implementations are based on BLAS and arise from the use of a multithreaded BLAS.The total number of floating-point operations (add, multiply, divide) for the LU factorizations without and with pivoting are the same and equal approximately 2  3 n 3 .The number of floating point comparisons for the LU factorization without pivoting equals 0 and for the LU factorization with pivoting equals approximately 1  2 n 2 .The number of floating-point operations in the Cholesky factorization is 1  3 n 3 .The number n is the size of the factoring matrix A.

Methodology
We tested three versions of matrix factorization: LU without pivoting, LU with pivoting, and Cholesky.We tested all algorithms without parallelization (1 thread) and in parallelized versions for 10, 20, 30, and 40 threads.
Our test dataset is a random square matrix with double-precision values with dimension nxn, n = 32,786.Therefore, our test dataset is 1,073,741,824 cells, which is 8 GB of data.All versions of the algorithm match the rowwise layout and were implemented in C++ with vectorization and parallelism.
Our experimental setup included a computing platform equipped with a modern multicore processor, the Intel(R) Xeon(R) Gold 5218R with 40 cores and a 2.10 GHz clock.We used kernel Linux 4.18.0 and the AlmaLinux 8.4 operating system with Intel ICC v. 2021.5.0.
The Linux kernel supports CPU performance scaling using the CPU Frequency scaling (CPUFreq) subsystem, which consists of three layers of code: core, scaling drivers, and governors.The CPUFreq core provides a common code infrastructure and user space interfaces across all platforms that support CPU performance scaling.It defines the basic framework within which the other components operate.The scaling driver talks to the hardware.It provides the scale managers with information about available P-states (or ranges of P-states in some cases) and access platform-specific hardware interfaces to change the P-states of the processor as requested by the scale masters.The governor implements algorithms to estimate the required CPU capacity.As a rule, each manager implements one, possibly parameterized, scaling algorithm.
The default scaling driver and governor are selected automatically, but user space tools such as cpupower, acpid, laptop mode tools, or desktop GUI tools can still be used for advanced configuration.
We made a change to the clock frequencies using the CPUfreq.We used the acpi_cpufreq driver.By default, the acpi_cpufreq driver follows governor conservative, which increases or decreases the clock frequency depending on the load on the core by selecting one of several available frequencies from the minimum to the maximum supported by the processor.
We used the cpupower program to change the minimum and maximum values of the processor frequency limit at a given level.We used the following commands: cpupower frequency-set -d 1700000 cpupower frequency-set -u 1700000 for setting the minimum and maximum frequency limit values to 1.7 GHz.These commands automatically change the governor to userspace, which allows one to set a specific frequency.The frequency setting was made for all cores.
We performed tests for the frequencies (P-states) available on our platform from 0.8 GHz to 2.1 GHz with a step of 0.1 GHz.We first tested the following frequencies: 0.8 GHz, 1.1 GHz, 1.4 GHz, 1.7 GHz, and 2.1 GHz.
To analyze the impact of algorithm optimizations on energy consumption, we used measurements from the Running Average Power Limit (RAPL) interface, which is available for Intel processors.RAPL uses machine-specific records to monitor and control energy consumption in real-time.In multi-socket systems, RAPL provides results for each socket (each package) separately, also providing separate measurement values for the memory modules (DRAM) associated with each socket.Starting with Haswell processors, which have fully integrated voltage regulators, the accuracy of the measurements returned by RAPL was improved and satisfactory [16].During the tests, we took measurements every 1 s, and for the analysis, we took into account the total energy consumption of all sockets and their associated memory modules.

Time and Energy Consumption
In Figure 2, we have the runtime and energy consumption of individual algorithms for 1 thread and in Figure 3 for parallel versions (i.e., 10, 20, 30, and 40 threads), respectively.For one thread, the Cholesky factorization works best using both less time and energy.The LU factorization without pivoting and with pivoting are similar in terms of time and energy consumption; LU without pivoting is slightly better.Due to the fact that the Cholesky factorization consumes less time and energy than the LU factorizations in Figure 3, we have different scales for the x-axis.In Figure 3, we see that using clock frequency scaling down increases the runtime in all cases, while increasing the number of threads reduces it.Therefore, for our architecture, 40 threads and a frequency of 2.1 GHz were more optimal in terms of time.However, if we look at the energy consumption, in some cases, lowering the clock frequency to 1.7 GHz gave us a reduction in energy consumption.We saw this for the non-parallelized versions of the algorithms (about 19%), as well as for the parallelized versions in all cases for 10 threads (3% for LU without pivoting, 2% for LU with pivoting, and 9% for Cholesky).Furthermore, for 20 threads, we noted a decrease in energy consumption at 1.7 GHz in the case of the Cholesky distribution (5%) and even for 40 threads in the case of the non-pivoting LU factorization (1.6%).Reducing the clock frequency to 1.4 GHz and below had no benefit.Energy consumption also decreased with the increase in the number of threads used for the calculations.The only exception was the case of the Cholesky algorithm at 2.1 GHz; the algorithm turned out to have 1.2% more energy savings at 30 threads compared to 40 threads.We had a similar situation here at 2.0 and 1.9 GHz (Figure 4).
Due to energy drops when changing the clock frequency to 1.7 GHz, we performed further tests to investigate what happens to other frequencies between 1.4 GHz and 2.1 GHz.The results can be seen in Figure 4. Subsequent tests confirmed the occurrence of a local minimum for energy consumption at a frequency of 1.7 GHz for the LU without pivoting algorithm running on 40 threads.For the LU without pivoting algorithm running it on 40 threads and reducing the clock frequency to 1. 7   Table 1 presents a selection of algorithm versions and clock frequencies for which we obtained the lowest energy consumption for all three factorizations.In the table, the first column describes the algorithm and the selected settings, the second column shows the energy consumption in Joules, and the third shows the running time of the algorithm.The fourth and fifth columns show the loss of runtime and the gain in energy consumption in percent, respectively, compared to the algorithm and clock setting that gave the shortest time; for all three factorizations, it was 40 threads and a frequency of 2.1 GHz.We see that, for LU factorization with pivoting, an intuitive choice: more threads, a higher clock frequency actually gave the best result in terms of both time and energy.However, the other two tested factorizations showed exceptions to this rule.If we care more about saving energy than time, we can consider other choices of the number of threads and clock settings.Our tests showed that we saved the most energy when we used 40 threads for LU without pivoting, but reducing the clock frequency to 1.7 GHz, while for Cholesky, it was 30 threads and the clock frequency was reduced to 2.0 GHz.

Speedup and Greenup
In the graphs in Figure 5, we have the Speedup (left column) and Greenup (right column) values of the algorithms run for different clock frequencies based on the measurement values of our experience.Each graph is for one algorithm (LU without pivoting, LU with pivoting, and Cholesky).On the x-axis, we have the number of running threads marked; the y-axis shows the values of the Speedup or Greenup measure.For better reference, we marked the maximum expected Speedup (which is linear with the number of threads p) and Greenup (which is logarithmic with the number of threads p) with a black dashed line in the charts.We saw for all three algorithms that, as the number of threads increased, regardless of the frequency, we gained more runtime than energy consumption (time decreased faster than energy consumption decreased).For the LU without and with pivoting algorithms, the Speedup for all frequencies was almost linear, and with the increase in the number of threads, it deviated further from the maximum expected value.In terms of Speedup, the frequency of 0.8 GHz was the most-favorable, and the lowest was 2.1 GHz.In general, the Greenup plot deviated much further from the maximum expected value than the Speedup in all cases, which confirmed our experience.For the algorithms tested by us, as in the case of Speedup, the highest Greenup values were achieved for 0.8 GHz, whereas the smallest ones were different than in the case of Speedup, and this was achieved for 2.0 GHz.For the 2.1 GHz frequency, which had the lowest Speedup values, we also obtained relatively high Greenup values (purple line in the graph).
On our architecture for the tested algorithms, we observed the relationship: where α > β, and in our case, β belongs to the interval (2.84; 2.85).The research question that arises is: What is the value of α for other architectures?

Energy and Runtime Correlation
Since looking at Speedup and Greenup separately, we did not have complete information about which optimization was cost-effective in terms of runtime and energy consumption, we now looked at the metrics that would give us a more-complete picture of the situation.We used the EDP metric [11] and the GPS-UP metric [12] (Greenup, Powerup, Speedup).Let us look at Powerup itself, which expresses Speedup's relationship with Greenup.From the graphs from Figure 6, we can see that the more threads we add, the more we gain on time than on energy (Powerup is getting bigger).We also see that Powerup was bounded from above by the curve γ log 2 (p), where p is the number of threads and γ = 0.75 in our architecture for the tested algorithms.
To analyze which optimizations were cost-effective for energy consumption, let us look at the GPS-UP and EDP metrics.Let us first look at the optimization for increasing the number of threads.In Table 2, we can see a summary of the measures for the LU without pivoting algorithm operating on a single thread running at 2.1 GHz and its optimizations consisting of increasing the number of threads.With GPS-UP, we obtain Category 3, which is green in every case, i.e., when increasing the number of threads to 10, 20, 30, and 40.Thus, each of these optimizations results in energy savings.In this table, the last two columns present the measure of EDP and normalized EDP: where EDP O is the EDP of the optimized version and EDP B is the EDP of the base version.
The EDP is the product of the time and energy used, so it was assumed that the smaller the EDP, the better.We present here the values of the EDP measure because it is a well-known and often-considered measure (among others in [17][18][19]), although sometimes, it can be misleading, as we can see in the results from Table 3.Thus, each of the considered optimizations in Table 2 save energy, but the EDP indicated that the most energy savings would be achieved by the optimization with 40 threads.In the case of the pivoting LU algorithm, we obtained analogous results.
In the case of analogous optimizations of the Cholesky algorithm, we also obtained Category 3 based on GPS-UP (Table 3).Similarly, if we look at the EDP measure, we obtained an indication of 40 threads as the best optimization.In addition, we see that the EDP metrics for 40 and 30 thread optimization were slightly different.However, if we look at the Powerup column, we see a lower value for optimizing 30 threads.Looking at the runtime and energy consumption data, we can see that the algorithm for 30 threads works better than for 40 threads in terms of energy, but consumes more time, which is indicated by the lower Powerup.Thus, for the Cholesky algorithm at 2.1 GHz, the optimization with 30 threads is the best in terms of energy savings.
We will now look at the optimizations using DVFS.We start with the parallel version of the LU without pivoting algorithm on 40 threads running at 2.1 GHz.In Table 4, we have a summary of the results and measures for optimizing the change to a lower clock frequency.Looking at the EDP measure, we can conclude that these optimizations will not pay off for us due to too much time loss.However, we obtained a green category from the GPS-UP measure in two cases for the 2.0 and 1.7 GHz frequencies.In these cases, we lost time, but we saved energy, while in the others, we lost both time and energy-Category 6, red.Looking at Powerup for green optimizations, we can choose a more-energy-saving optimization; in this case, it was a frequency of 1.7 GHz.Our experience has shown that, in the case of the LU algorithm with pivoting on 40 threads running at 2.1 GHz, the optimization consisting of reducing the frequency did not have a benefit, and we obtained a red category: 6.The same is true for 30 and 20 threads.On the other hand, the clock frequency scaling benefited this algorithm operating on 1 thread and 10 threads, as shown in Tables 5 and 6, where the green Category 4 appears.In the case of the Cholesky algorithm and optimization by reducing the clock frequency in Table 7, we can see that, for 40 threads, reducing the clock frequency had positive effects (green category) when the frequency was set to 2.0 GHz or 1.8 GHz.On the other hand, a lower Powerup value indicates a more-favorable clock frequency setting of 1.8 GHz in terms of lower energy consumption.

Conclusions
This paper considered the concept of scalability in terms of execution time and energy consumption for three matrix factorizations (LU without pivoting, LU with pivoting, and Cholesky) derived from the MKL.In order to reduce the energy consumption of these factorizations, the DVFS technique was used, which made it possible not to interfere with the implementation code, but to influence the clock frequency settings from the operating system level.
The study investigated the effect of two parameters, clock frequency and number of threads, on the execution time and energy consumption on a multicore machine.The execution time was always the shortest for the highest clock frequency and the highest number of threads for all three factorizations.We cannot say the same about the energy savings, as it depended on the number of threads and the clock frequency.
The Speedup and Greenup values increased with an increasing number of threads and usually with a decreasing clock frequency.The experimental results showed that the Speedup value was higher than the Greenup value.The Speedup was even up to 74% higher than the Greenup for certain settings of the clock value and the number of threads.
By examining the EDP and GPS-UP metrics, a correlation between runtime and energy consumption was observed for all three factorizations.Some parameter settings (number of threads and clock frequency) allowed energy savings despite increased runtime (Category 4), while other parameter settings showed less energy consumption by reducing the runtime (Category 3).Still, others showed more energy consumption with increased runtime and power savings, but not enough to save energy (Category 6).These results support the claim that minimizing energy consumption is not always derived only from minimizing the runtime of the applications.
Future work includes an investigation of the weak scalability in terms of execution time and in terms of energy consumption on different multicore machines and different applications.During the investigation of weak scalability, the problem size per processor will remain constant as more computing units are added.In addition, an important issue to be addressed would be to find a correlation between strong and weak scalability in terms of energy consumption.

Figure 2 .
Figure 2. Time execution for LU without pivoting, LU with pivoting, and Cholesky for 1 thread (left); energy consumption of LU without pivoting, LU with pivoting, and Cholesky for 1 thread (right).

Figure 3 .
Figure 3.Time execution for LU without pivoting, LU with pivoting, and Cholesky (left); energy consumption of LU without pivoting, LU with pivoting, and Cholesky (right).

Figure 5 .
Figure 5. Speedup for LU without pivoting, LU with pivoting, and Cholesky (left); Greenup of LU without pivoting, LU with pivoting, and Cholesky (right) (p denotes the number of threads).

Figure 6 .
Figure 6.Powerup for LU without pivoting, LU with pivoting, and Cholesky for a variable number of threads (p denotes the number of threads) at a fixed clock frequency.
GHz, we obtained the lowest energy consumption.

Table 1 .
The most-energy-saving versions of the algorithms.

Table 2 .
Base-version algorithm-LU without pivoting without parallelization running at a frequency of 2.1 GHz compared to a parallel version for 10, 20, 30, and 40 threads.

Table 3 .
Base-version algorithm-Cholesky without parallelization running at a frequency of 2.1 GHz compared to a parallel version for 10, 20, 30, and 40 threads.

Table 4 .
Base-version algorithm LU without pivoting parallel version running at a frequency of 2.1 GHz on 40 threads compared to an algorithm running at lower CPU frequencies.

Table 5 .
Base-version algorithm LU with pivoting without parallelization version running at a frequency of 2.1 GHz compared to an algorithm running at lower CPU frequencies.

Table 6 .
Base-version algorithm LU with pivoting parallel version running at a frequency of 2.1 GHz on 10 threads compared to an algorithm running at lower CPU frequencies.(For 20, 30, and 40, we gained nothing by reducing the frequency, so we obtained a red category).

Table 7 .
Base-version algorithm Cholesky parallel version running at a frequency of 2.1 GHz on 40 threads compared to an algorithm running at lower CPU frequencies.