Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count

Ju, Tao; Zhang, Yan; Zhang, Xuejun; Du, Xiaogang; Dong, Xiaoshe

doi:10.3390/en12071346

Open AccessArticle

Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count

by

Tao Ju

^1,*

,

Yan Zhang

²,

Xuejun Zhang

¹,

Xiaogang Du

¹ and

Xiaoshe Dong

³

¹

School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

²

School of Media Engineering, Lanzhou University of Arts and Science, Lanzhou 730000, China

³

School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Energies 2019, 12(7), 1346; https://doi.org/10.3390/en12071346

Submission received: 20 February 2019 / Revised: 23 March 2019 / Accepted: 4 April 2019 / Published: 8 April 2019

(This article belongs to the Section D: Energy Storage and Application)

Download

Browse Figures

Versions Notes

Abstract

:

Improving computing performance and reducing energy consumption are a major concern in heterogeneous many-core systems. The thread count directly influences the computing performance and energy consumption for a multithread application running on a heterogeneous many-core system. For this work, we studied the interrelation between the thread count and the performance of applications to improve total energy efficiency. A prediction model of the optimum thread count, hereafter the thread count prediction model (TCPM), was designed by using regression analysis based on the program running behaviors and heterogeneous many-core architecture feature. Subsequently, a dynamic predictive thread mapping (DPTM) framework was proposed. DPTM uses the prediction model to estimate the optimum thread count and dynamically adjusts the number of active hardware threads according to the phase changes of the running program in order to achieve the optimal energy efficiency. Experimental results show that DPTM obtains a nearly 49% improvement in performance and a 59% reduction in energy consumption on average. Moreover, DPTM introduces about 2% additional overhead compared with traditional thread mapping for PARSEC (The Princeton Application Repository for Shared-Memory Computers) benchmark programs running on an Intel MIC (Many integrated core) heterogeneous many-core system.

Keywords:

heterogeneous many-core system; heterogeneous computing; optimum thread count; prediction model; performance optimization; energy efficiency

1. Introduction

With the recent shift towards energy-efficient computing, the heterogeneous many-core system has emerged as a promising solution in the domain of high-performance computing [1]. In the emerging heterogeneous many-core systems composed of a host processor and co-processor, the host processor is used to deal with complex logical control tasks (i.e., task scheduling, task synchronizing, and data allocating), and the co-processor is used to compute large-scale parallel tasks with high computing density and simple logical branch. These two processors cooperate to compute different portions of a program to improve the program energy efficiency [2]. Determining an appropriate thread count for a program that runs on both the host and co-processor is associated with the computing performance and energy consumption.

The host processor in an emerging heterogeneous many-core system generally adopts chip multi-processors that contain a limited number of processor cores. The thread count is set to equal the number of available processor cores of a host processor that can obtain the desired performance. The co-processor generally adopts an emerging many-core processor (such as GPU and Intel MIC), which contains many processing cores (generally tens or even hundreds of cores) and employs simultaneous multithreading (SMT). The use of too few threads will not fully exploit the computing power of the co-processor, whereas too many threads will increase the energy consumption and aggravate the contention of shared resources among multiple threads.

Figure 1 shows the variations in the performance with the thread count for the eight applications from the PARSEC benchmark [3] on the Intel Xeon Phi MIC heterogeneous many-core system. The test results can be divided into four types of scenarios. Case I: The program performance improves slowly with increasing thread count. When the thread count is more than 24, the performance increase has almost no clear change, as shown in the blackscholes and raytrace. Case II: The performance speedup increases along with increasing thread count, and the program has proper scalability as shown in the freqmine and ferret. Case III: When the thread count reaches a certain value, the program performance decreases with increasing thread count, as shown in the bodytrack and streamcluster. Case IV: The performance speedup exhibits an irregular change with increasing thread count, as shown in the canneal and swaptions. These observations clearly indicate the importance of the appropriate number of cores and threads for computing performance and energy efficiency in many-core systems [4].

Many previous studies have been conducted to determine an appropriate thread count for multi-threaded applications running on multi-core or many-core systems. These approaches include a static setting based on prior experiences or rule of thumb [5], iterative searching [6], and dynamic predicting [7,8,9]. In general, the static setting is simple and does not introduce additional overhead, but it is unable to correctly reflect the running behaviors of applications due to the changes in input sets and running platforms. The iterative searching searches the appropriate thread count by constantly testing and contrasting the performance of different thread counts. This approach has high overhead and could not reflect the dynamic change behavior of the application. The dynamic predicting estimates the optimum thread count by sampling the status information of the running program. This approach could reflect the dynamic characteristics of the running program, but will introduce high overhead.

An appropriate thread count should be set relying on the program running behaviors and heterogeneous many-core architecture characteristics. However, existing research efforts focus mainly on traditional multi-core and many-core systems without considering the heterogeneous properties and are inapplicable to be used on the emerging heterogeneous many-core systems.

To handle the challenges above and on the basis of our previous paper [3], we analyze the impact of thread count on computing performance by focusing on the characteristics of the applications and their dynamic behaviors when they are running on an Intel MIC heterogeneous system. We establish an optimum thread count prediction model (TCPM) using regression analysis on the basis of extending Amdahl’s law. Subsequently, a dynamic predictive thread mapping (DPTM) framework is designed based on the TCPM. DPTM uses the TCPM to estimate the optimum thread count at different phases of the application through a real-time sampling of hardware performance counter information. Meanwhile, DPTM dynamically adjusts the number of active hardware threads and processing cores during program execution. Evaluation results show that DPTM improves the application performance by 48.6% and decreases energy consumption by 59% on average for PARSEC benchmark programs running on an Intel MIC heterogeneous system compared with the traditional thread mapping policy. DPTM also introduces about 2% additional overhead on average to predict and adjust the thread count.

2. Related Work

In 2008, Suleman et al. [7] proposed a feedback-driven threading framework to dynamically control the thread count. For two types of applications that are limited by data synchronization and off-chip bandwidth, their Synchronization-Aware Threading (SAT) and Bandwidth-Aware Threading (BAT) mechanism could predict the optimal thread count. The two threading mechanisms use an analytical model to predict the thread count without considering the performance impact of competing shared caches, thread context switching, and thread migration. Lee et al. [8] presented a dynamic compilation system called Thread Tailor, which can dynamically stitch threads together on the basis of a communication graph, and minimize synchronization overhead and contention for shared resources. However, Thread Tailor only analyzes the thread types and communication patterns offline without considering the dynamic phase changes of the running program itself. Pusukuri et al. [5] developed the Thread Reinforcer framework, which comprehensively considers the OS level effect factors (such as CPU utilization, lock conflicts, thread context switch rate, and thread migration) for the thread count. However, the phase changes of application and hardware architecture characteristics are not considered. Sasaki et al. [6] proposed a scheduler that can recognize phase changes and dynamically predict the scalability of application. That work focuses on allocating an appropriate number of cores for each application, and the specific hardware architecture characteristics and shared resource conflicts are not considered. Heirman et al. [9] proposed a mechanism to improve application performance and energy efficiency by matching the working set sizes of an application and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST is extended in another work [10] to consider the SMT in the Intel Xeon Phi processor. However, the dynamic phase changes of applications are not considered. Moreover, the method activates all the processing cores of the MIC processor and only regulates the hardware threads inside each processing core when running the application. Thus, the energy consumption optimization is limited. Kanemitsu et al. [11] proposed a clustering-based task scheduling algorithm, which can minimize the schedule length in a heterogeneous system and improve system performance. The proposed method derives the lower bound of the total execution time for each processor by taking both the system and application characteristics into account to obtain the near-optimal set of processors. That work focuses on the near-optimal set of processors without considering the effect of the number of active threads on system performance. Singh et al. [12] proposed an energy-efficient runtime mapping and thread partitioning approach. For each concurrently executing OpenCL application, the proposed mapping process finds the appropriate number of CPU cores and GPU cores, and the partitioning process identifies an efficient partitioning of the applications’ threads between CPU and GPU cores. Birhanu et al. [13] proposed the Fastest-Thread-Fastest-Core (FTFC) dynamic thread scheduling mechanism. By dynamically and periodically measuring the CPU utilization of running threads, those threads that exhibit high CPU utilization are assigned to cores that can deliver high performance when needed, whereas those threads with low CPU utilization are assigned to low performance cores. However, the effect of the number of active threads on system performance and overall energy consumption is not considered in the above works. Liu et al. [14] analyzed the behavior and scalability of the Intel experimental many-core processor system, the SCC (Single-Chip Cloud Computer), by running several workloads. Their analysis indicates that the number of cores and threads should be elaborately selected on the basis of the characteristics of different applications. This scenario also demonstrates that the appropriate number of cores and threads is important for computing performance and energy efficiency in many-core systems.

Unlike previous efforts, DPTM comprehensively considers the application characteristics, the phase change behaviors of the running program, and the feature of heterogeneous many-core system architectures during the mapping of threads to the processing cores. DPTM can dynamically adjust the thread count in the process of program running. Thus, DPTM has the potential of effectively exploiting the computing power of a heterogeneous many-core system to improve the computing performance and energy efficiency of the entire system.

3. Impact Factors on Computing Performance

The impact factors on heterogeneous many-core systems’ computing performance for the different applications are as follows:

Program characteristics. Some programs are computation intensive. The increase in thread count can help achieve better performance. Some programs are memory intensive, that is, spawning more threads does not improve the performance due to the shared storage capacity and storage bandwidth limitations. Some programs are communication intensive, that is, frequent information interaction between threads; thus, setting too many threads incurs considerable lock synchronization overhead and significantly decreases performance. Moreover, the different portions of some applications have different characteristics. We should dynamically set thread count according to program characteristics to achieve optimal performance.
Hardware architecture and OS level impact factors. These factors mainly include the thread count, cache miss rate, bandwidth utilization, thread context switch rate, and thread migration rate.

The thread count is a major impact factor among the abovementioned factors, because other factors will change with the changing thread count. Increasing the thread count increases the cache miss rate because more threads will compete for the shared cache [15]. Furthermore, additional transmission delays occur due to the fact that more threads are competing with the shared bandwidth. The SMT has been introduced in the many-core processor, where many threads concurrently run on one processing core, thus the thread context switch rate will be increased. To fully exploit the computing resources, thread migration is performed between the different processing cores in the many-core processor [16,17]. The thread migration rate will increase as the thread count increases. The theory of principal component analysis states that the above performance impacting factors can be indirectly determined by the thread count.

In order to achieve a better tradeoff between the different factors that affect the application performance, an effective thread mapping mechanism is necessary. This can be achieved by dynamically adjusting the thread count in the process of running programs according to the running behaviors of applications and the characteristics of heterogeneous many-core architectures.

4. Optimum Thread Count Prediction Model (TCPM)

The thread count is the main factor that influences program execution performance. In this section, we design an optimum thread count prediction model (TCPM) based on our previous work [4].

4.1. Notations of Performance Metrics

The performance metrics used in our prediction model included the following [18]:

Turnaround time (TT) refers to the total time consumed in executing the program.
TT_n refers to the turnaround time of the program that runs n processing threads.
TT₁ refers to the turnaround time of the program that runs a single processing thread.
SIP refers to the sum of instructions of a program.
IPS₁ refers to the number of executing instructions per second when running the single thread.
IPS_N refers to the number of executing instructions per second when running n threads.

4.2. Theoretical Basis of TCPM

Regardless of a single processing core or multiple processing cores being executed, the SIP is fixed. On this basis, we propose a model by extending Amdahl’s law in the multi-core era [19,20,21]. The model reflects the relationship of program performance and thread count. The performance effects of sharing resource competition, thread synchronization, thread context switch, and thread migration are simultaneously considered.

Let f denote the relative turnaround time of the execution program on the multiple and single processing threads; it can be calculated according to the Amdahl’s law as follows:

f = \frac{T T_{n}}{T T_{1}} = \frac{α ω + \frac{β ω}{n} + γ . ω . n}{ω} = α + \frac{β}{n} + γ \times n (α + β + γ = 1),

(1)

where ω refers to the total number of tasks; n refers to the number of threads to be allocated; α, β, and γ refer to the sequential ratio factor of the task, parallel ratio factor of the task, and effect factor of the extra overhead, respectively; α, β, and γ satisfy the constraint condition α + β + γ = 1.

As

T T_{n} = \frac{1}{I P S_{n}} \times S I P and T T_{1} = \frac{1}{I P S_{1}} \times S I P,

So

f = \frac{T T_{n}}{T T_{1}} = \frac{\frac{1}{I P S_{n}} \times S I P}{\frac{1}{I P S_{1}} \times S I P} = \frac{I P S_{1}}{I P S_{n}} .

(2)

We can obtain the following by using Equations (1) and (2):

f = \frac{T T_{n}}{T T_{1}} = \frac{I P S_{1}}{I P S_{n}} = α + \frac{β}{n} + γ \times n (α + β + γ = 1) .

(3)

Given that TT₁ and TT_n can be obtained only after the program execution is finished, using the two metrics to predict the thread count is unfeasible. However, IPS₁ and IPS_n could be dynamically obtained in the execution process of a program, which can be used as the experimental values for calculating the unknown coefficients α, β, and γ by using the least squares.

4.3. TCPM Establishment

The values of IPS₁ and IPS_N can be collected by sampling and testing the program that runs at different numbers of processing cores and threads, for which Equation (2) can be used to calculate the value of f. After obtaining the multiple pairs values of (f, n), the coefficients α, β, and γ can be calculated using the least squares. The process is detailed as follows.
(a)
Using Equation (3), we obtain the following:

$f (n) = α + \frac{β}{n} + (1 - α - β) \times n$

where n is the thread count.
(b)
Using the least squares, we obtain the following:

$S = \sum_{n = 1}^{N} {\{f_{n} - [α + \frac{β}{n} + (1 - α - β) \times n]\}}^{2} = \sum_{n = 1}^{N} {[f_{n} + (n - 1) \times α + \frac{n^{2} - 1}{n} \times β - n]}^{2}$

(4)

(c)
The following equation is solved to calculate the coefficients α, β, and γ to minimize the deviation value of S:

$\{_{\frac{\partial S}{\partial β} = 2 \times \sum_{n = 1}^{N} \{[f_{n} + (n - 1) \times α + \frac{n^{2} - 1}{n} \times β - n] \times \frac{n^{2} - 1}{n}\} = 0}^{\frac{\partial S}{\partial α} = 2 \times \sum_{n = 1}^{N} \{[f_{n} + (n - 1) \times α + \frac{n^{2} - 1}{n} \times β - n] \times (n - 1)\} = 0}$

(5)

Transforming Equation (6), we obtain the following:

$\{_{α \times \sum_{n = 1}^{N} \frac{(n^{2} - 1) (n - 1)}{n} + β \times \sum_{n = 1}^{N} \frac{{(n^{2} - 1)}^{2}}{n^{2}} = \sum_{n = 1}^{N} (n^{2} - \frac{n^{2} - 1}{n} \times f_{n} - 1)}^{α \times \sum_{n = 1}^{N} {(n - 1)}^{2} + β \times \sum_{n = 1}^{N} (n^{2} - \frac{n^{2} - 1}{n} - 1) = \sum_{n = 1}^{N} [n^{2} - (f_{n} + 1) \times n + f_{n}]}$

(6)

(d)
Considering the obtained different pairs values of the (f, n) into Equation (6), we can construct the equation that includes the unknown coefficients α, β, and γ, then solve the equation to calculate the unknown coefficients α, β, and γ.
The extreme value theorem is used to calculate the value n that minimizes the relative turnaround time f in Equation (3). The calculating process is as follows:

$\frac{d f (n)}{d n} = \frac{- β}{n^{2}} + γ = \frac{- β}{n^{2}} + 1 - α - β = 0$

(7)

$n = \sqrt{\frac{β}{γ}} = \sqrt{\frac{β}{1 - α - β}}$

(8)

Equation (8) denotes the final thread count prediction model.
By sampling real-time values of IPS₁ and IPS_n with different thread numbers, we can calculate the relative turnaround time f according to Equation (2) combined with Equation (6) for obtaining the coefficients α, β, and γ. Finally, the thread count n can be calculated according to Equation (8), which is the optimum thread count.

Our proposed model has the following advantages over existing models (e.g., using statistical regression and machine learning): simple, lower overhead, and dynamic real-time prediction of the thread count. The machine learning-based model usually has a high prediction accuracy but requires a long training and learning time, thus introducing high additional overhead. That model can obtain a better prediction accuracy in static prediction, but it cannot properly adapt to the dynamic prediction. The reason is that the model needs to be re-trained when the program inputs, the program running characteristics, and the program running platforms have been changed, which introduces a higher overhead that will hinder the feasibility of dynamically predicting the thread count. Furthermore, the statistical regression-based prediction models generally use multivariate regression analysis. These models can obtain a better prediction efficiency compared with the machine learning model, and have suitable prediction accuracy in a static prediction circumstance. However, given the need for a large number of performance metric samplings and complex model computation, these models easily lead to high overhead, which restricts their application in dynamic and real-time prediction.

However, in our model’s constructing process, the principal factor IPS is already considered. Meanwhile, in order to improve the prediction accuracy, the other performance influence factors are also considered in judging the program phase change. Since our proposed model is able to reach a good tradeoff between overhead and prediction accuracy, it can achieve an effective, dynamic, and real-time prediction of the optimal thread count for the heterogeneous many-core system.

5. DPTM Mechanism

We designed a dynamic predictive thread mapping (DPTM) mechanism based on the TCPM prediction model. DPTM dynamically regulates the thread count in real-time during program execution to improve the application performance and energy efficiency. The detailed DPTM process is shown in Figure 2.

First, DPTM pre-runs the portion of the program and samples the program performance metric values of IPS₁ and IPS_n to calculate the prediction model parameters α, β, and γ. DPTM then uses the prediction model to compute the prediction values of the thread count. The program is then re-executed using the estimated thread count. The program running status is continuously detected by using the five collected system-level performance metrics, including the CPU utilization, thread context switch, thread migration, cache utilization, and bandwidth utilization rates. Once the phase change of the program is detected, the thread count is re-calculated according to the current values of IPS₁ and IPS_n. The running threads of program are dynamically adjusted to ensure that different factors affecting program performance (i.e., thread context switch, thread migration, cache miss, and shared bandwidth utilization) are in a reasonable status. It also prevents the program performance from being affected by unreasonable sharing resource contention, thread synchronization, and transmission delay. Moreover, the idle hardware threads and processing cores are deactivated by incorporating other runtime power management approaches in order to maintain computing performance while reducing its energy consumption [22,23,24].

6. DPTM Framework and Implementation

6.1. DPTM Framework

DPTM targets heterogeneous many-core systems composed of the host processor (CPU) and co-processor (MIC). Figure 3 shows the DPTM framework.

The CPU and MIC co-processor synergistically compute the workload of each application using the offload mode on the Intel MIC heterogeneous system [25]. The host processor is in charge of the task allocation and the control of the entire program run, and the MIC co-processor is in charge of the loop portion of the program. In the executing process of workload, as long as the loop portion is encountered, it will be offloaded by the CPU to the MIC co-processor to be executed in parallel. When the MIC co-processor finishes the computation, it returns the computing results to the CPU. The program then continues to execute under the CPU control. Given that the major computing task is focused on the loop portions for most of the workloads, inserting the offload statements in the OpenMP loop parts can dispatch the loop to the MIC co-processor and achieve the synergistic computation of the CPU and MIC co-processor.

Under the DPTM framework, the CPU master process calculates the optimum thread count by using the prediction model TCPM. The optimal thread count estimate is based mainly on the real-time status information of the running program. The runtime system continuously samples the program status information and detects the phase change of the program run on the MIC co-processor. Once the evident phase change is detected, the current status information is returned to the CPU. The CPU master process recalculates the optimum thread count, and then dynamically regulates the parallelism of the program. This process constantly iterates until the computing task is completed on the basis of this framework.

6.2. Sampling the Status Information

In order to detect program phase changes, DPTM first samples the performance status information. The performance status information reflects the program behaviors and processor architecture characteristics, and is the basis of the prediction model. The effectiveness of sampled values directly impacts accuracy for predicting the optimum thread count.

Therefore, the sampling accuracy and efficiency must be ensured. In the entire process of the dynamic thread mapping, the status information that must be sampled includes the following: IPS₁ and IPS_n, thread context switch rate, thread migration rate, cache miss rate, CPU utilization (cpu cycles), and bandwidth utilization (bus cycles). The status information can be sampled through the real-time access of the performance monitoring unit using the Per-tools provided by the Linux OS kernel [26,27]. However, using the Per-tools directly introduces additional overhead because of the system calls. In order to decrease the additional overhead, we access the rdmpc directly to obtain the performance status information from the user-space by designing a kernel module to change the processor’s CR4.PCE configuration bit [28]. The system status information sampling interval is set to 100 milliseconds according to the experience value [7,29]. The sampled status information includes context switches, thread migrations, cache misses, CPU cycles, and bus cycles, of which the last two sampled status values are saved to improve the effectiveness of sampling information. DPTM collects the IPS status values at the six different predefined thread counts of 8, 24, 48, 120, 168, and 240 when the program starts executing. The specific IPS value is obtained by sampling three performance counter information elements: instruction number, CPU cycles, and CPU clock. The formula obtained is IPS = instruction_number/(cpu_cycles/cpu_clock). The collected IPS sampled values are used to predict the optimal thread count on the CPU side; therefore, these IPS sampled values should be saved as a form of global data structures. The IPS status information continues to be sampled according to a certain time interval in the subsequent program run, similar to the other performance counter information. However, it is different from others so that all IPS sampled values should be saved during the program execution to ensure that sufficient information is obtained to predict the optimum thread count.

6.3. Detecting the Phase Changes of the Running Program

The changes of program input and computing workload cause the program phase change. Dynamically adjusting the processing core allocated to the program according to the requirements of the computing resources at different execution phases is beneficial to improving the computing resource utilization and lowering the energy consumption. DPTM achieves this goal by detecting the program phase changes in real time. The evident changes mostly occur in the different loop parts of the program. DPTM detects the program phase changes using the following metrics: thread context switch rate, thread migration rate, cache miss rates, CPU utilization, and bandwidth utilization.

DPTM reads the performance counter information during program execution, which includes the context switch rate, thread migration rate, cache miss rate, CPU cycles, and bus cycles once every 100 milliseconds [29]. DPTM then compares the current status information with the previous saved ones and computes the relative changes of every performance metric: Δcpu-cycles, Δcontext-switches, Δthread-migration, Δcache-misses, and Δbus-cycles. The performance metric threshold values were set in advance according to the empirical values. The performance metric thresholds are Threshold_cpu-cycle, Threshold_{context-switches}, Threshold_{thread-migration}, Threshold_bus-cycles, and Threshold_cache-miss.

6.3.1. Threshold Values of the Performance Metric

The specific threshold values of the performance metrics are obtained by empirical observation [8,28]. The detailed process is described as follows. (1) Five programs from the PARSEC benchmarks, namely, bodytrack, x264, canneal, blackscholes, and streamcluster, were selected, which covered all five of the performance metrics. The CPU cycle is important for blackscholes, the thread migration is important for bodytrack, the thread context switch is important for x264, the bandwidth is important for canneal, and the cache miss is important for streamcluster. (2) The above five representative benchmark programs were run and profiled at the native input sets. The relative threshold values of the profiled performance metrics were determined such that after reaching the change rate, the program performance became more sensitive to the thread count and exhibited a rapid change. We obtained the following relative threshold values of performance metrics through experimental measurement: ΔThreshold_cpu-cycle was 60%, ΔThreshold_{thread-migration} was 50%, Threshold_bus-cycles was 50%, Threshold_cache-miss was 30%, and Threshold_{context-switches} was 30%. The above threshold values are only rough empirical observation values, which lack rigorous theoretical basis. More efforts on determining reliable thresholds will be done in our future work.

6.3.2. Detection Algorithm of the Program Phase Changes

The algorithm first compares the current sampled CPU cycles with the latest stored value. If the Δcpu-cycles is lower than the Threshold_cpu-cycle (i.e., no evident change has taken place in the CPU utilization rate), the program continues to run at the original routine. If the Δcpu-cycles is larger than or equal to the Threshold_cpu-cycle (Δcpu-cycles ≥ Threshold_cpu-cycle), this means that evident change occurs in the CPU utilization. The cause for this change can be that the computing task or program running characteristics have changed, so the current threads could not properly utilize the computing resources. The algorithm continues to detect the context switch change rate Δcontext-switches, thread migration change rate Δthread-migration, bus width utilization change rate Δbus-cycles, and cache miss change rate Δcache-misses to determine whether the phase change has occurred in the program.

The specific arbitration process is as follows. (1) If the Δcontext-switches > Threshold_{context-switches} (i.e., the original thread count does not adapt well to the current computing resources, which results in a larger context switch change), then the program running phase has changed. (2) If Δthread-migration > Threshold_{thread-migration} && Δbus-cycles > Threshold_bus-cycles (i.e., the original thread count does not match well with the processing core, which results in a large number of thread migrations between the different processing cores and excessive bandwidth utilization), then the program running phase has changed from computing intensive to memory intensive (vice versa). The thread count should be adjusted to better adapt to the current program running characteristics under this condition. (3) If the Δcache-misses>Threshold_cache-miss (i.e., the original thread count does not properly share the current cache resource, which results in a high cache miss rate), then the program running phase has changed and the thread count should be adjusted to adapt well to the current program running phase.

6.4. DPTM Framework Implementation

Under the DPTM framework, the CPU master process calculates the optimum thread count using the prediction model. The optimum thread is mapped to the specific processing cores through the binding processing cores, and idle processing cores are turned off or set as inactive to lower the energy consumption. The CPU master process dynamically adjusts the thread count of the program run on the MIC co-processor according to the program runtime phase changes. The DPTM framework prototype system is achieved in the form of a dynamic runtime library by extending the Intel OpenMP Runtime Library [30].

The detailed DPTM prototype system implementation on the MIC heterogeneous system is shown in Figure 4.

The DPTM prototype system mainly includes the following five parts. The entire program uses the offload mode to run [26].

HOST SIDE 1 code: The CPU master process first executes the program. When encountering the loop part, it uses the #pragma offload target (MIC SIDE 1) to offload the loop code to the MIC co-processor to execute. It then performs the #pragma offload_wait target (mic) to wait for the execution results of MIC.
MIC SIDE 1 code: The MIC first regulates the control register CR4.PCE status by calling the init_module() to directly access the rdmpc from the user-space to obtain the performance status information. It then pre-runs the program by calling the function pre_running_program(). The status information is then read, collected, and returned to the CPU master process by calling the functions read_pmc(), collect_status_information(), and return_status_information (HOST SIDE 2) individually.
HOST SIDE 2 code: The CPU master process predicts the optimal thread count by calling predicting_optimal_number_threads (status_info, opt_number_thread) and then uses the pragma clause #pragma offload_transfer target (MIC SIDE 2) in (opt_number_threads) to send the optimum thread count to the MIC co-processor side to control the parallelism of the loop code running on the MIC co-processor.
MIC SIDE 2 code: The MIC co-processor re-executes the loop code according to the optimum thread count while continuously detecting the phase changes by calling detecting_running_exception.
HOST SIDE 3 code: The CPU master process continues to execute the subsequent portion code of the program after receiving the computing results from the MIC side. If the loop execution part is encountered, it will be offloaded to the MIC co-processor for execution according to the previous running mechanism. The program continuously iterates until the application is finished.

7. Experimental Evaluation

7.1. Experimental Environment

Experimental platform. The experiment was conducted on an Intel MIC heterogeneous many-core system. The system consists of two eight-core Intel E5-2670 CPUs and one Intel Xeon Phi 7110P MIC co-processor with a 64 GB memory and a 300 GB hard disk. The main memory and co-processor is connected by the PCI-E x16 bus, whose maximum data transmission speed is up to 16 GB/s. The OS is a Red Hat Enterprise Linux Server release 6.3, and the soft development environment is an Intel parallel_studio_xe_2013_update3_intel64. The performance metrics were obtained using the PAPI_5.4.1 performance measurement tool [31,32].

Benchmarks. We used the following ten programs from the PARSEC suite [3]: blackscholes, freqmine, canneal, streamcluster, ferret, x264, raytrace, bodytrack, swaptions, and vips. The native input sets were used for all benchmark programs in our experiment.

7.2. TCPM Prediction Accuracy Evaluation

To evaluate the prediction accuracy of the TCPM, we tested the performance speedup of different benchmark programs compared to the serial program for the three strategies. The three strategies are the Optimal strategy that refers to the best performance speedup and the corresponding thread count, the OS_Default strategy that refers to the maximum number of hardware threads supported by the Intel MIC co-processor, and the optimum thread count prediction model (TCPM), respectively.

We evaluated the performance differences of the TCPM, Optimal, and OS_Default by testing the performance speedup of the different benchmark programs at multiple threads against the serial program. Table 1 shows the performance speedup and corresponding thread count for the above three strategies. Figure 5 shows the comparison of relative performance ratio between TCPM and OS_Default. The relative performance ratio refers to a ratio of the speedup of the OS_Default and the TCPM over the Optimal strategy.

As shown in Table 1, the average thread count of all benchmark programs triggered in the three strategies (Optimal, OS_Default, and TCPM) is 161, 240, and 118, respectively. The average speedup compared to the serial program of the three strategies is 18.43, 13.33, and 17.85, respectively. Overall, the TCPM obtains the minimum thread count and good performance. As shown in Figure 5, the average speedup of the OS_Default reached 73% of the Optimal, and the average speedup of the TCPM reached 97% of the Optimal, which demonstrates that the TCPM has good prediction accuracy.

7.3. DPTM Evaluation

In order to evaluate the effectiveness of DPTM, we first tested the speedup of different benchmark programs using the OS default mapping mechanism (OS_Default). In this mapping mechanism, the thread count is equal to the number of processing cores. Therefore, the thread count was set to 240, which is the maximum number of hardware threads supported by the Intel MIC co-processor. Second, we tested the best performance speedup and corresponding thread count of all benchmark programs, which is an ideal thread count setting standard (Optimal). Third, we tested the performance speedup and corresponding thread count of all benchmark programs using DPTM and compared it with the two previous measurement results. Furthermore, for the Optimal and DPTM, we tested their reduction ratio of the last-level cache misses normalized to the OS default mapping mechanism. Finally, we measured the energy consumption of the three types of thread mapping and the overhead of DPTM, and compared their energy–performance efficiencies.

7.3.1. Performance Speedup Evaluation

We evaluated the performance differences of DPTM, Optimal, and OS_Default by testing the speedup of the different benchmark programs at multiple threads over a single thread. Figure 6 shows the speedup of different benchmark programs for the three different mapping approaches.

The performance speedup of DPTM is increased by 34.35% compared to the OS_Default, and reached 96.8% of the Optimal. For most benchmark programs, DPTM obtained a better performance compared to the OS_Default. The reason is that most of the benchmark programs are memory-bound or communication-bound applications. When too many threads are allocated, their performance may be decreased due to their competing for shared cache and memory bandwidth. However, DPTM did not improve the performance of blackscholes, ferret, raytrace, and freqmine, instead it slightly decreased. The main reason is that these four benchmark programs are computing intensive applications, whose performance shows an increasing trend as the thread count increases. Thus, using the maximum thread count (OS_Default) obtained the optimal performance.

7.3.2. Cache Miss Evaluation

Figure 7 shows the relative reduction ratio of the last-level cache misses normalized to the OS_Default mapping mechanism. The smaller the normalized value, the better. The average cache misses of DPTM decreased by 8% compared to the OS_Default, and the Optimal decreased by 12% compared to the OS_Default. Overall, the reduction ratio of the L2 level cache misses of DPTM is superior to the OS_Default and close to the Optimal.

7.3.3. Energy Consumption Evaluation

We measured the power consumption of the Xeon Phi co-processor when running each benchmark program under the different mapping methods by incorporating other runtime power management approaches [33,34]. This measure was conducted by periodically reading power information from the Linux system file /sys/class/micras/power in a background thread at 100-millisecond intervals. We then computed the relative energy consumption of each benchmark program under DPTM and Optimal mapping.

The relative energy consumption was quantified using the metric normalized energy consumption in our evaluation. The specific definition of the metric is as follows:

\begin{array}{l} N o r m a l i z e d E n e r g y C o n s u m p t i o n & = \frac{E n e r g y C o n s u m p t i o n}{E n e r g y C o n s u m p t i o n_{240_t h r e a d s}} \\ = \frac{P o w e r \times T}{P o w e r_{240_t h r e a d s} \times T_{240_t h r e a d s}} \end{array}

(9)

where EnergyConsumption is the energy consumption of running any benchmark program using different thread mapping, which mainly refers to the Optimal and DPTM; Power is the corresponding power; T is the corresponding execution time; EnergyConsumption_{240_threads} is the energy consumption of running any benchmark program using the OS_Default mapping; Power_{240_threads} refers to the corresponding power; and T_{240_threads} is the corresponding execution time.

As shown in Figure 8, the mean normalized energy consumption of Optimal reaches 53.7% of the OS_Default, and the DPTM only reaches 41%. The reason is that DPTM can dynamically regulate the thread count and allocate the reasonable thread at different program phases. Thus, it can avoid triggering the unnecessary hardware threads and processing cores, as well as lower the total energy consumption in the entire program execution.

7.3.4. Energy–Performance Efficiency Evaluation

In order to evaluate the energy–performance efficiency of the thread mapping, we defined two metrics of energy–performance efficiency and normalized energy efficiency as follows:

\begin{array}{l} E n e r g y - p e r f o r m a n c e e f f i c i e n c y & = \frac{P e r f o r m a n c e}{E n e r g y C o n s u m p t i o n} = \frac{S p e e d u p}{E n e r g y C o n s u m p t i o n} \\ = \frac{S p e e d u p}{P o w e r \times T} = \frac{1}{p_{0}} \times \frac{S p e e d u p}{N_{t} \times \bar{T}} = R \times \frac{S p e e d u p}{N_{t} \times \bar{T}} \end{array}

(12)

where

P o w e r = P_{0} \times N_{t h}

and

R = \frac{1}{P_{0}}

, P₀ is the power per unit time of each running thread (i.e., a constant),

\bar{T}

is the average execution time of each thread, and N_t refers to the thread count. Furthermore, we defined the normalized energy efficiency by removing the constant parameter R in Equation (12) as follows:

N o r m a l i z e e n e r g y e f f i c i e n c y = \frac{S p e e d u p}{N_{t} \times \bar{T}}

(13)

The higher value of the normalized energy efficiency means more efficient use of energy.

Figure 9 shows that DPTM’s energy–performance efficiency is better than the OS_Default and Optimal in all benchmark programs, except for the streamcluster where DPTM’s energy–performance efficiency was slightly lower than the Optimal, and vips where DPTM’s energy–performance efficiency equaled the Optimal. The comparison of the geometric average for the three mapping mechanisms shows that DPTM achieved a higher energy–performance efficiency.

7.3.5. Overhead Evaluation

The overhead time was mainly caused by the status information transmission time and the thread count estimating time, which occurred when adjusting the number of threads. We used DPTM’s adjusting time to represent them. The additional overhead is used to evaluate the influence of overhead time on the whole program performance. The additional overhead is equal to DPTM’s adjusting time divided by the total running time of program. We tested the overhead of DPTM on different benchmark programs. Table 2 shows the overhead of DPTM on 10 benchmark programs. The average additional overhead introduced by DPTM is only 2.03%, which can be negligible relative to the obtained high energy efficiency.

In addition, we compared the performance improvement rate and energy reduction rate with the overhead of DPTM against the OS_Default strategy in order to evaluate the validity of DPTM.

Figure 10 shows the performance improvement, energy reduction, and additional overhead ratios of DPTM compared with the OS_Default at all benchmark programs. From Figure 10, we can see that the average performance of ten benchmark programs has been improved by 48.6%, and the average energy consumption has been reduced by 59%. However, the performance of four benchmark programs (i.e., blackscholes, ferret, raytrace, and freqmine) has no improvement and instead slightly decreases, but the energy consumption of each program has obviously declined. Meanwhile, the amount of energy reduction is much larger than the amount of performance degradation and the additional overhead. The reason is that DPTM only predicts and sets the smallest number of threads, which can ensure that the approximate optimal performance and optimal energy efficiency are obtained.

8. Conclusions

In this work, we investigated the impact of thread count on program performance and energy efficiency in a heterogeneous many-core system. An optimum thread count prediction model (TCPM) was proposed by using regression analysis on the basis of extending Amdahl’s law. Based on TCPM, a dynamic predictive thread mapping (DPTM) framework was proposed for the Intel MIC heterogeneous many-core system in order to improve the program performance and lower the system energy consumption. Experimental evaluation results show that the DPTM framework is effective in terms of improving the program performance and lowering the energy consumption. DPTM can be used to dynamically regulate the active hardware thread and processing cores according to the program behaviors, phase changes, and the dynamic requirements of computing resources in the process of program running. It can achieve the purpose of high computing performance and low energy consumption at the cost of a negligible additional overhead.

Ongoing and future work is focused on further extending the DPTM framework to use other heterogeneous architectures, and apply the approach to mixed-mode workloads. In addition, we will explore the phase change threshold parameters, and directly compare DPTM with other mapping approaches.

Author Contributions

Conceptualization, T.J. and X.D.; Data curation, T.J., Y.Z., and X.Z.; Formal analysis, T.J. and X.Z.; Investigation, T.J.; Methodology, T.J.; Software, T.J. and X.D.; Validation, T.J.; Visualization, Y.Z.; Writing—review and editing, T.J. and Y.Z.

Funding

This work is supported by the National Natural Science Foundation of China under Grant Nos. 61862037, 61762058, and 61861024, and the Key Laboratory Opening Project of Opto-Technology and Intelligent Control Ministry of Education under Grant No. KFKT2016-7.

Acknowledgments

The authors would like to thank the anonymous reviewers for their useful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Brodtkorb, A.R.; Dyken, C.; Hagen, T.R.; Hjelmervik, J.M.; Storaasli, O.O. State-of-the-art in heterogeneous computing. Sci. Programm. 2010, 18, 1–33. [Google Scholar] [CrossRef]
Ju, T.; Dong, X.; Chen, H.; Zhang, X. DagTM: An Energy-Efficient Threads Grouping Mapping for Many-Core Systems Based on Data Affinity. Energies 2016, 9, 754. [Google Scholar] [CrossRef]
Bienia, C.; Kumar, S.; Singh, J.P.; Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), Toronto, ON, Canada, 25–29 August 2008; pp. 72–81. [Google Scholar] [CrossRef]
Ju, T.; Weiguo, W.W.; Chen, H.; Zhu, Z.; Dong, X. Thread Count Prediction Model: Dynamically Adjusting Threads for Heterogeneous Many-Core Systems. In Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems, Melbourne, Australia, 14–17 December 2015; pp. 459–464. [Google Scholar] [CrossRef]
Pusukuri, K.K.; Gupta, R.; Bhuyan, L.N. Thread reinforcer: Dynamically determining number of threads via os level monitoring. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, USA, 6–8 November 2011; pp. 116–125. [Google Scholar] [CrossRef]
Sasaki, H.; Tanimoto, T.; Inoue, K.; Nakamura, H. Scalability-based manycore partitioning. In Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, USA, 19–23 September 2012; pp. 107–116. [Google Scholar] [CrossRef]
Suleman, M.A.; Qureshi, M.K.; Patt, Y.N. Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs. In Proceedings of the 13th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Seattle, WA, USA, 1–5 March 2008; pp. 227–286. [Google Scholar] [CrossRef]
Lee, J.; Wu, H.; Ravichandran, M.; Clark, N. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the 37th ACM Annual International Symposium on Computer architecture (ISCA), Saint-Malo, France, 19–23 June 2010; pp. 270–279. [Google Scholar] [CrossRef]
Heirman, W.; Carlson, T.E.; Van Craeynest, K.; Eeckhout, L.; Hur, I.; Jaleel, A. Undersubscribed threading on clustered cache architectures. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 15–19 February 2014; pp. 678–689. [Google Scholar] [CrossRef]
Heirman, W.; Carlson, T.E.; Craeynest, K.V.; Hur, I.; Jaleel, A.; Eeckhout, L. Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor. In Proceedings of the 4th ACM International Workshop on Runtime and Operating Systems for Supercomputers, Munich, Germany, 10 June 2014. [Google Scholar] [CrossRef]
Kanemitsu, H.; Hanada, M.; Nakazato, H. Clustering-Based Task Scheduling in a Large Number of Heterogeneous Processors. IEEE Trans. Parallel Distrib. Syst. 2016, 27, 3144–3157. [Google Scholar] [CrossRef]
Singh, A.K.; Basireddy, K.R.; Merrett, G.V.; Al-Hashimi, B.M.; Prakash, A. Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs. ACM Trans. Embed. Comput. Syst. 2017, 16. [Google Scholar] [CrossRef]
Birhanu, T.M.; Choi, Y.J.; Li, Z.; Sekiya, H.; Komuro, N. Efficient Thread Mapping for Heterogeneous Multicore IoT Systems. Mobile Inf. Syst. 2017, 1–8. [Google Scholar] [CrossRef]
Liu, C.; Thanarungroj, P.; Gaudiot, J.L. How many cores do we need to run a parallel workload: A test drive of the Intel SCC platform? J. Parallel Distrib. Comput. 2014, 74, 2582–2595. [Google Scholar] [CrossRef]
Donyanavard, B.; Mück, T.; Sarma, S.; Dutt, N. SPARTA: Runtime task allocation for energy efficient heterogeneous many-cores. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, Pittsburgh, PA, USA, 2–7 October 2016. [Google Scholar] [CrossRef]
Pusukuri, K.K.; Vengerov, D.; Fedorova, A.; Kalogeraki, V. FACT: A framework for adaptive contention-aware thread migrations. In Proceedings of the 8th ACM International Conference on computing frontiers, Ischia, Italy, 3–5 May 2011; pp. 1–10. [Google Scholar] [CrossRef]
Tillenius, M.; Larsson, E.; Badia, R.M.; Martorell, X. Resource-Aware Task Scheduling. ACM Trans. Embed. Comput. Syst. 2014, 14, 1–25. [Google Scholar] [CrossRef]
Eyerman, S.; Eeckhout, L. System-level performance metrics for multiprogram workloads. IEEE Micro 2008, 28, 42–53. [Google Scholar] [CrossRef]
Sun, X.; Chen, Y. Reevaluating Amdahl’s law in the multicore era. J. Parallel Distribut. Comput. 2010, 70, 183–188. [Google Scholar] [CrossRef]
Huang, T.; Zhu, Y.; Yin, X.; Wang, X.; Qiu, M. Extending Amdahl’s law and Gustafson’s law by evaluating interconnections on multi-core processors. J. Supercomput. 2013, 66, 305–319. [Google Scholar] [CrossRef]
Rafiev, A.; Al-Hayanni, M.A.N.; Xia, F.; Shafik, R.; Romanovsky, A.; Yakovlev, A. Speedup and Power Scaling Models for Heterogeneous Many-Core Systems. IEEE Trans. Multi-Scale Comput. Syst. 2018, 99, 1–14. [Google Scholar] [CrossRef]
Etinski, M.; Corbalan, J.; Labarta, J.; Valero, M. Understanding the future of energy-performance trade-off via DVFS in HPC environments. J. Parallel Distrib. Comput. 2012, 72, 579–590. [Google Scholar] [CrossRef]
Rodrigues, R.; Koren, I.; Kundu, S. Does the Sharing of Execution Units Improve performance/Power of Multicores? ACM Trans. Embed. Comput. Syst. 2015, 14, 1–24. [Google Scholar] [CrossRef]
Ma, J.; Yan, G.; Han, Y.; Li, X. An analytical framework for estimating scale-out and scale-up power efficiency of heterogeneous many-cores. IEEE Trans. Comput. 2016, 65, 367–381. [Google Scholar] [CrossRef]
Newburn, C.J.; Dmitriev, S.; Narayanaswamy, R.; Wiegert, J.; Murty, R.; Chinchilla, F. Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor. In Proceeding of the 27th IEEE Parallel and Distributed Processing Symposium Workshops & PhD Forum, Cambridge, MA, USA, 20–24 May 2013; pp. 1213–1225. [Google Scholar] [CrossRef]
Yasin, A. A top-down method for performance analysis and counters architecture. In Proceeding of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, USA, 23–25 March 2013; pp. 35–44. [Google Scholar] [CrossRef]
Knauerhase, R.; Brett, P.; Hohlt, B.; Li, T.; Hahn, S. Using OS observations to improve performance in multicore systems. IEEE Micro. 2008, 28, 54–66. [Google Scholar] [CrossRef]
Tallent, N.R.; Mellor-Crummey, J.M. Effective performance measurement and analysis of multithreaded applications. In Proceeding of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, Raleigh, NC, USA, 14–18 April 2009; pp. 229–240. [Google Scholar] [CrossRef]
Zhang, W.; Li, J.; Yi, L.; Chen, H. Multilevel Phase Analysis. ACM Trans. Embed. Comput. Syst. 2015, 14, 1–29. [Google Scholar] [CrossRef]
Intel OpenMP Runtime Library. Available online: http://www.openmprtl.org/ (accessed on 20 November 2017).
Weaver, V.M.; Johnson, M.; Kasichayanula, K.; Ralph, J.; Luszczek, P.; Terpstra, D. Measuring Energy and Power with PAPI. In Proceedings of the IEEE International conference on Parallel Processing Workshops, Pittsburgh, PA, USA, 10–13 September 2012; pp. 262–268. [Google Scholar] [CrossRef]
Terpstra, D.; Jagode, H.; You, H.; Dongarra, J. Collecting performance data with PAPI-C. Tools High Perform. Comput. 2010, 157–173. [Google Scholar] [CrossRef]
Ge, R.; Feng, X.; Song, S.; Chang, H.-C.; Li, D.; Cameron, K.W. Powerpack: Energy profiling and analysis of high-performance systems and applications. IEEE Trans. Parallel Distrib. Syst. 2009, 21, 658–671. [Google Scholar] [CrossRef]
Reddy, B.K.; Singh, A.K.; Biswas, D.; Merrett, G.V.; Al-Hashimi, B.M. Inter-cluster Thread-to-core Mapping and DVFS on Heterogeneous Multi-cores. IEEE Trans. Multi-Scale Comput. Syst. 2018, 4, 369–382. [Google Scholar] [CrossRef]

Figure 1. The impact of thread count on performance. (a) Case I: The program performance improves slowly with increasing thread count. (b) Case II: The performance speedup increases along with increasing thread count. (c) Case III: The performance speedup increases along with increasing thread count. (d) Case IV: The performance speedup exhibits an irregular change with increasing thread count.

Figure 2. Dynamic predictive thread mapping mechanism.

Figure 3. DPTM framework on the MIC heterogeneous system.

Figure 4. DPTM implementation.

Figure 5. The comparison of relative performance ratio between OS_Default and the thread count prediction model (TCPM).

Figure 6. Performance speedup comparison.

Figure 7. Reduction of the last-level cache misses.

Figure 8. Relative energy consumption compared to OS default mapping strategy.

Figure 9. Energy–performance efficiency comparison. (a) Benchmark programs: blackscholes, streamcluster, raytrace, and canneal; (b) Benchmark programs: vips, x264, and swaptions; (c) Benchmark programs: bodytrack, freqmine, and ferret.

Figure 10. Performance, energy reduction, and overhead improvement ratio compared to the OS_Default.

Table 1. Speedup and corresponding thread count of the three strategies.

Benchmark	Optimal		OS_Default		TCPM
Benchmark	# of Threads	Speedup	# of Threads	Speedup	# of Threads	Speedup
blackscholes	240	4.57	240	4.57	144	4.53
ferret	192	17.33	240	17.04	168	17.21
streamcluster	72	7.35	240	3.01	72	7.25
raytrace	192	3.16	240	3.1	144	3.1
bodytrack	120	17.94	240	8.21	96	16.27
vips	72	34.41	240	18.58	72	34.35
cannel	144	5.66	240	2.73	48	5.22
freqmine	240	29.87	240	29.87	216	28.31
X264	168	23.48	240	16.93	72	23.41
swaptions	168	54.35	240	38.02	144	52.21
blackscholes	240	4.57	240	4.57	144	4.53
Average	161	18.43	240	13.33	118	17.85

Table 2. The overhead of DPTM on 10 benchmark programs.

Benchmark Program	Total Time	DPTM Adjusting Time	Additional Overhead
blackscholes	192.63	0.83	0.43%
ferret	20.80	0.08	0.38%
streamcluster	70.95	2.82	3.97%
raytrace	87.04	0.37	0.43%
bodytrack	37.05	1.06	2.86%
vips	6.13	0.12	1.96%
canneal	148.18	5.51	3.72%
freqmine	37.23	0.15	0.40%
x264	6.90	0.13	1.88%
swaptions	7.92	0.34	4.29%
Average	–	–	2.03%

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ju, T.; Zhang, Y.; Zhang, X.; Du, X.; Dong, X. Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count. Energies 2019, 12, 1346. https://doi.org/10.3390/en12071346

AMA Style

Ju T, Zhang Y, Zhang X, Du X, Dong X. Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count. Energies. 2019; 12(7):1346. https://doi.org/10.3390/en12071346

Chicago/Turabian Style

Ju, Tao, Yan Zhang, Xuejun Zhang, Xiaogang Du, and Xiaoshe Dong. 2019. "Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count" Energies 12, no. 7: 1346. https://doi.org/10.3390/en12071346

APA Style

Ju, T., Zhang, Y., Zhang, X., Du, X., & Dong, X. (2019). Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count. Energies, 12(7), 1346. https://doi.org/10.3390/en12071346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count

Abstract

1. Introduction

2. Related Work

3. Impact Factors on Computing Performance

4. Optimum Thread Count Prediction Model (TCPM)

4.1. Notations of Performance Metrics

4.2. Theoretical Basis of TCPM

4.3. TCPM Establishment

5. DPTM Mechanism

6. DPTM Framework and Implementation

6.1. DPTM Framework

6.2. Sampling the Status Information

6.3. Detecting the Phase Changes of the Running Program

6.3.1. Threshold Values of the Performance Metric

6.3.2. Detection Algorithm of the Program Phase Changes

6.4. DPTM Framework Implementation

7. Experimental Evaluation

7.1. Experimental Environment

7.2. TCPM Prediction Accuracy Evaluation

7.3. DPTM Evaluation

7.3.1. Performance Speedup Evaluation

7.3.2. Cache Miss Evaluation

7.3.3. Energy Consumption Evaluation

7.3.4. Energy–Performance Efficiency Evaluation

7.3.5. Overhead Evaluation

8. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI