Energy-E ﬃ cient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread Count

: Improving computing performance and reducing energy consumption are a major concern in heterogeneous many-core systems. The thread count directly inﬂuences the computing performance and energy consumption for a multithread application running on a heterogeneous many-core system. For this work


Introduction
With the recent shift towards energy-efficient computing, the heterogeneous many-core system has emerged as a promising solution in the domain of high-performance computing [1].In the emerging heterogeneous many-core systems composed of a host processor and co-processor, the host processor is used to deal with complex logical control tasks (i.e., task scheduling, task synchronizing, and data allocating), and the co-processor is used to compute large-scale parallel tasks with high computing density and simple logical branch.These two processors cooperate to compute different portions of a program to improve the program energy efficiency [2].Determining an appropriate thread count for a program that runs on both the host and co-processor is associated with the computing performance and energy consumption.
The host processor in an emerging heterogeneous many-core system generally adopts chip multi-processors that contain a limited number of processor cores.The thread count is set to equal the number of available processor cores of a host processor that can obtain the desired performance.The co-processor generally adopts an emerging many-core processor (such as GPU and Intel MIC), which contains many processing cores (generally tens or even hundreds of cores) and employs simultaneous multithreading (SMT).The use of too few threads will not fully exploit the computing power of the co-processor, whereas too many threads will increase the energy consumption and aggravate the contention of shared resources among multiple threads.
Figure 1 shows the variations in the performance with the thread count for the eight applications from the PARSEC benchmark [3] on the Intel Xeon Phi MIC heterogeneous many-core system.The test results can be divided into four types of scenarios.Case I: The program performance improves slowly with increasing thread count.When the thread count is more than 24, the performance increase has almost no clear change, as shown in the blackscholes and raytrace.Case II: The performance speedup increases along with increasing thread count, and the program has proper scalability as shown in the freqmine and ferret.Case III: When the thread count reaches a certain value, the program performance decreases with increasing thread count, as shown in the bodytrack and streamcluster.Case IV: The performance speedup exhibits an irregular change with increasing thread count, as shown in the canneal and swaptions.These observations clearly indicate the importance of the appropriate number of cores and threads for computing performance and energy efficiency in many-core systems [4].Many previous studies have been conducted to determine an appropriate thread count for multi-threaded applications running on multi-core or many-core systems.These approaches include a static setting based on prior experiences or rule of thumb [5], iterative searching [6], and dynamic predicting [7][8][9].In general, the static setting is simple and does not introduce additional overhead, but it is unable to correctly reflect the running behaviors of applications due to the changes in input sets and running platforms.The iterative searching searches the appropriate thread count by constantly testing and contrasting the performance of different thread counts.This approach has high overhead and could not reflect the dynamic change behavior of the application.The dynamic predicting estimates the optimum thread count by sampling the status information of the running program.This approach could reflect the dynamic characteristics of the running program, but will introduce high overhead.
An appropriate thread count should be set relying on the program running behaviors and heterogeneous many-core architecture characteristics.However, existing research efforts focus mainly on traditional multi-core and many-core systems without considering the heterogeneous properties and are inapplicable to be used on the emerging heterogeneous many-core systems.
To handle the challenges above and on the basis of our previous paper [3], we analyze the impact of thread count on computing performance by focusing on the characteristics of the applications and their dynamic behaviors when they are running on an Intel MIC heterogeneous system.We establish an optimum thread count prediction model (TCPM) using regression analysis on the basis of extending Amdahl's law.Subsequently, a dynamic predictive thread mapping (DPTM) framework is designed based on the TCPM.DPTM uses the TCPM to estimate the optimum thread count at different phases of the application through a real-time sampling of hardware performance counter information.Meanwhile, DPTM dynamically adjusts the number of active hardware threads and processing cores during program execution.Evaluation results show that DPTM improves the application performance by 48.6% and decreases energy consumption by 59% on average for PARSEC benchmark programs running on an Intel MIC heterogeneous system compared with the traditional thread mapping policy.DPTM also introduces about 2% additional overhead on average to predict and adjust the thread count.

Related Work
In 2008, Suleman et al. [7] proposed a feedback-driven threading framework to dynamically control the thread count.For two types of applications that are limited by data synchronization and off-chip bandwidth, their Synchronization-Aware Threading (SAT) and Bandwidth-Aware Threading (BAT) mechanism could predict the optimal thread count.The two threading mechanisms use an analytical model to predict the thread count without considering the performance impact of competing shared caches, thread context switching, and thread migration.Lee et al. [8] presented a dynamic compilation system called Thread Tailor, which can dynamically stitch threads together on the basis of a communication graph, and minimize synchronization overhead and contention for shared resources.However, Thread Tailor only analyzes the thread types and communication patterns offline without considering the dynamic phase changes of the running program itself.Pusukuri et al. [5] developed the Thread Reinforcer framework, which comprehensively considers the OS level effect factors (such as CPU utilization, lock conflicts, thread context switch rate, and thread migration) for the thread count.However, the phase changes of application and hardware architecture characteristics are not considered.Sasaki et al. [6] proposed a scheduler that can recognize phase changes and dynamically predict the scalability of application.That work focuses on allocating an appropriate number of cores for each application, and the specific hardware architecture characteristics and shared resource conflicts are not considered.Heirman et al. [9] proposed a mechanism to improve application performance and energy efficiency by matching the working set sizes of an application and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth.CRUST is extended in another work [10] to consider the SMT in the Intel Xeon Phi processor.However, the dynamic phase changes of applications are not considered.Moreover, the method activates all the processing cores of the MIC processor and only regulates the hardware threads inside each processing core when running the application.Thus, the energy consumption optimization is limited.Kanemitsu et al. [11] proposed a clustering-based task scheduling algorithm, which can minimize the schedule length in a heterogeneous system and improve system performance.The proposed method derives the lower bound of the total execution time for each processor by taking both the system and application characteristics into account to obtain the near-optimal set of processors.That work focuses on the near-optimal set of processors without considering the effect of the number of active threads on system performance.Singh et al. [12] proposed an energy-efficient runtime mapping and thread partitioning approach.For each concurrently executing OpenCL application, the proposed mapping process finds the appropriate number of CPU cores and GPU cores, and the partitioning process identifies an efficient partitioning of the applications' threads between CPU and GPU cores.Birhanu et al. [13] proposed the Fastest-Thread-Fastest-Core (FTFC) dynamic thread scheduling mechanism.By dynamically and periodically measuring the CPU utilization of running threads, those threads that exhibit high CPU utilization are assigned to cores that can deliver high performance when needed, whereas those threads with low CPU utilization are assigned to low performance cores.However, the effect of the number of active threads on system performance and overall energy consumption is not considered in the above works.Liu et al. [14] analyzed the behavior and scalability of the Intel experimental many-core processor system, the SCC (Single-Chip Cloud Computer), by running several workloads.Their analysis indicates that the number of cores and threads should be elaborately selected on the basis of the characteristics of different applications.This scenario also demonstrates that the appropriate number of cores and threads is important for computing performance and energy efficiency in many-core systems.
Unlike previous efforts, DPTM comprehensively considers the application characteristics, the phase change behaviors of the running program, and the feature of heterogeneous many-core system architectures during the mapping of threads to the processing cores.DPTM can dynamically adjust the thread count in the process of program running.Thus, DPTM has the potential of effectively exploiting the computing power of a heterogeneous many-core system to improve the computing performance and energy efficiency of the entire system.

Impact Factors on Computing Performance
The impact factors on heterogeneous many-core systems' computing performance for the different applications are as follows: 1.
Program characteristics.Some programs are computation intensive.The increase in thread count can help achieve better performance.Some programs are memory intensive, that is, spawning more threads does not improve the performance due to the shared storage capacity and storage bandwidth limitations.Some programs are communication intensive, that is, frequent information interaction between threads; thus, setting too many threads incurs considerable lock synchronization overhead and significantly decreases performance.Moreover, the different portions of some applications have different characteristics.We should dynamically set thread count according to program characteristics to achieve optimal performance.2.
Hardware architecture and OS level impact factors.These factors mainly include the thread count, cache miss rate, bandwidth utilization, thread context switch rate, and thread migration rate.
The thread count is a major impact factor among the abovementioned factors, because other factors will change with the changing thread count.Increasing the thread count increases the cache miss rate because more threads will compete for the shared cache [15].Furthermore, additional transmission delays occur due to the fact that more threads are competing with the shared bandwidth.The SMT has been introduced in the many-core processor, where many threads concurrently run on one processing core, thus the thread context switch rate will be increased.To fully exploit the computing resources, thread migration is performed between the different processing cores in the many-core processor [16,17].The thread migration rate will increase as the thread count increases.The theory of principal component analysis states that the above performance impacting factors can be indirectly determined by the thread count.
In order to achieve a better tradeoff between the different factors that affect the application performance, an effective thread mapping mechanism is necessary.This can be achieved by dynamically adjusting the thread count in the process of running programs according to the running behaviors of applications and the characteristics of heterogeneous many-core architectures.

Optimum Thread Count Prediction Model (TCPM)
The thread count is the main factor that influences program execution performance.In this section, we design an optimum thread count prediction model (TCPM) based on our previous work [4].

Notations of Performance Metrics
The performance metrics used in our prediction model included the following [18]: Turnaround time (TT) refers to the total time consumed in executing the program.

•
TT n refers to the turnaround time of the program that runs n processing threads.

•
TT 1 refers to the turnaround time of the program that runs a single processing thread.

•
SIP refers to the sum of instructions of a program.

•
IPS 1 refers to the number of executing instructions per second when running the single thread.

•
IPS N refers to the number of executing instructions per second when running n threads.

Theoretical Basis of TCPM
Regardless of a single processing core or multiple processing cores being executed, the SIP is fixed.On this basis, we propose a model by extending Amdahl's law in the multi-core era [19][20][21].The model reflects the relationship of program performance and thread count.The performance effects of sharing resource competition, thread synchronization, thread context switch, and thread migration are simultaneously considered.
Let f denote the relative turnaround time of the execution program on the multiple and single processing threads; it can be calculated according to the Amdahl's law as follows: where ω refers to the total number of tasks; n refers to the number of threads to be allocated; α, β, and γ refer to the sequential ratio factor of the task, parallel ratio factor of the task, and effect factor of the extra overhead, respectively; α, β, and γ satisfy the constraint condition α We can obtain the following by using Equations ( 1) and ( 2): Given that TT 1 and TT n can be obtained only after the program execution is finished, using the two metrics to predict the thread count is unfeasible.However, IPS 1 and IPS n could be dynamically obtained in the execution process of a program, which can be used as the experimental values for calculating the unknown coefficients α, β, and γ by using the least squares.

1.
The values of IPS 1 and IPS N can be collected by sampling and testing the program that runs at different numbers of processing cores and threads, for which Equation ( 2) can be used to calculate the value of f.After obtaining the multiple pairs values of (f, n), the coefficients α, β, and γ can be calculated using the least squares.The process is detailed as follows.
(a) Using Equation (3), we obtain the following: where n is the thread count.(b) Using the least squares, we obtain the following: The following equation is solved to calculate the coefficients α, β, and γ to minimize the deviation value of S: Transforming Equation ( 6), we obtain the following: Considering the obtained different pairs values of the (f, n) into Equation ( 6), we can construct the equation that includes the unknown coefficients α, β, and γ, then solve the equation to calculate the unknown coefficients α, β, and γ.

2.
The extreme value theorem is used to calculate the value n that minimizes the relative turnaround time f in Equation (3).The calculating process is as follows: Equation ( 8) denotes the final thread count prediction model.

3.
By sampling real-time values of IPS 1 and IPS n with different thread numbers, we can calculate the relative turnaround time f according to Equation (2) combined with Equation ( 6) for obtaining the coefficients α, β, and γ.Finally, the thread count n can be calculated according to Equation (8), which is the optimum thread count.
Our proposed model has the following advantages over existing models (e.g., using statistical regression and machine learning): simple, lower overhead, and dynamic real-time prediction of the thread count.The machine learning-based model usually has a high prediction accuracy but requires a long training and learning time, thus introducing high additional overhead.That model can obtain a better prediction accuracy in static prediction, but it cannot properly adapt to the dynamic prediction.The reason is that the model needs to be re-trained when the program inputs, the program running characteristics, and the program running platforms have been changed, which introduces a higher overhead that will hinder the feasibility of dynamically predicting the thread count.Furthermore, the statistical regression-based prediction models generally use multivariate regression analysis.These models can obtain a better prediction efficiency compared with the machine learning model, and have suitable prediction accuracy in a static prediction circumstance.However, given the need for a large number of performance metric samplings and complex model computation, these models easily lead to high overhead, which restricts their application in dynamic and real-time prediction.
However, in our model's constructing process, the principal factor IPS is already considered.Meanwhile, in order to improve the prediction accuracy, the other performance influence factors are also considered in judging the program phase change.Since our proposed model is able to reach a good tradeoff between overhead and prediction accuracy, it can achieve an effective, dynamic, and real-time prediction of the optimal thread count for the heterogeneous many-core system.

DPTM Mechanism
We designed a dynamic predictive thread mapping (DPTM) mechanism based on the TCPM prediction model.DPTM dynamically regulates the thread count in real-time during program execution to improve the application performance and energy efficiency.The detailed DPTM process is shown in Figure 2. and IPS n .The running threads of program are dynamically adjusted to ensure that different factors affecting program performance (i.e., thread context switch, thread migration, cache miss, and shared bandwidth utilization) are in a reasonable status.It also prevents the program performance from being affected by unreasonable sharing resource contention, thread synchronization, and transmission delay.Moreover, the idle hardware threads and processing cores are deactivated by incorporating other runtime power management approaches in order to maintain computing performance while reducing its energy consumption [22][23][24].

DPTM Framework
DPTM targets heterogeneous many-core systems composed of the host processor (CPU) and co-processor (MIC).Figure 3 shows the DPTM framework.The CPU and MIC co-processor synergistically compute the workload of each application using the offload mode on the Intel MIC heterogeneous system [25].The host processor is in charge of the task allocation and the control of the entire program run, and the MIC co-processor is in charge of the loop portion of the program.In the executing process of workload, as long as the loop portion is encountered, it will be offloaded by the CPU to the MIC co-processor to be executed in parallel.When the MIC co-processor finishes the computation, it returns the computing results to the CPU.The program then continues to execute under the CPU control.Given that the major computing task is focused on the loop portions for most of the workloads, inserting the offload statements in the OpenMP loop parts can dispatch the loop to the MIC co-processor and achieve the synergistic computation of the CPU and MIC co-processor.
Under the DPTM framework, the CPU master process calculates the optimum thread count by using the prediction model TCPM.The optimal thread count estimate is based mainly on the real-time status information of the running program.The runtime system continuously samples the program status information and detects the phase change of the program run on the MIC co-processor.Once the evident phase change is detected, the current status information is returned to the CPU.The CPU master process recalculates the optimum thread count, and then dynamically regulates the parallelism of the program.This process constantly iterates until the computing task is completed on the basis of this framework.

Sampling the Status Information
In order to detect program phase changes, DPTM first samples the performance status information.The performance status information reflects the program behaviors and processor architecture characteristics, and is the basis of the prediction model.The effectiveness of sampled values directly impacts accuracy for predicting the optimum thread count.
Therefore, the sampling accuracy and efficiency must be ensured.In the entire process of the dynamic thread mapping, the status information that must be sampled includes the following: IPS 1 and IPS n , thread context switch rate, thread migration rate, cache miss rate, CPU utilization (cpu cycles), and bandwidth utilization (bus cycles).The status information can be sampled through the real-time access of the performance monitoring unit using the Per-tools provided by the Linux OS kernel [26,27].However, using the Per-tools directly introduces additional overhead because of the system calls.In order to decrease the additional overhead, we access the rdmpc directly to obtain the performance status information from the user-space by designing a kernel module to change the processor's CR4.PCE configuration bit [28].The system status information sampling interval is set to 100 milliseconds according to the experience value [7,29].The sampled status information includes context switches, thread migrations, cache misses, CPU cycles, and bus cycles, of which the last two sampled status values are saved to improve the effectiveness of sampling information.DPTM collects the IPS status values at the six different predefined thread counts of 8, 24, 48, 120, 168, and 240 when the program starts executing.The specific IPS value is obtained by sampling three performance counter information elements: instruction number, CPU cycles, and CPU clock.The formula obtained is IPS = instruction_number/(cpu_cycles/cpu_clock).The collected IPS sampled values are used to predict the optimal thread count on the CPU side; therefore, these IPS sampled values should be saved as a form of global data structures.The IPS status information continues to be sampled according to a certain time interval in the subsequent program run, similar to the other performance counter information.However, it is different from others so that all IPS sampled values should be saved during the program execution to ensure that sufficient information is obtained to predict the optimum thread count.

Detecting the Phase Changes of the Running Program
The changes of program input and computing workload cause the program phase change.Dynamically adjusting the processing core allocated to the program according to the requirements of the computing resources at different execution phases is beneficial to improving the computing resource utilization and lowering the energy consumption.DPTM achieves this goal by detecting the program phase changes in real time.The evident changes mostly occur in the different loop parts of the program.DPTM detects the program phase changes using the following metrics: thread context switch rate, thread migration rate, cache miss rates, CPU utilization, and bandwidth utilization.
DPTM reads the performance counter information during program execution, which includes the context switch rate, thread migration rate, cache miss rate, CPU cycles, and bus cycles once every 100 milliseconds [29].DPTM then compares the current status information with the previous saved ones and computes the relative changes of every performance metric: ∆cpu-cycles, ∆context-switches, ∆thread-migration, ∆cache-misses, and ∆bus-cycles.The performance metric threshold values were set in advance according to the empirical values.The performance metric thresholds are Threshold cpu-cycle , Threshold context-switches , Threshold thread-migration , Threshold bus-cycles , and Threshold cache-miss .

Threshold Values of the Performance Metric
The specific threshold values of the performance metrics are obtained by empirical observation [8,28].The detailed process is described as follows.(1) Five programs from the PARSEC benchmarks, namely, bodytrack, x264, canneal, blackscholes, and streamcluster, were selected, which covered all five of the performance metrics.The CPU cycle is important for blackscholes, the thread migration is important for bodytrack, the thread context switch is important for x264, the bandwidth is important for canneal, and the cache miss is important for streamcluster.(2) The above five representative benchmark programs were run and profiled at the native input sets.The relative threshold values of the profiled performance metrics were determined such that after reaching the change rate, the program performance became more sensitive to the thread count and exhibited a rapid change.We obtained the following relative threshold values of performance metrics through experimental measurement: ∆Threshold cpu-cycle was 60%, ∆Threshold thread-migration was 50%, Threshold bus-cycles was 50%, Threshold cache-miss was 30%, and Threshold context-switches was 30%.The above threshold values are only rough empirical observation values, which lack rigorous theoretical basis.More efforts on determining reliable thresholds will be done in our future work.

Detection Algorithm of the Program Phase Changes
The algorithm first compares the current sampled CPU cycles with the latest stored value.If the ∆cpu-cycles is lower than the Threshold cpu-cycle (i.e., no evident change has taken place in the CPU utilization rate), the program continues to run at the original routine.If the ∆cpu-cycles is larger than or equal to the Threshold cpu-cycle (∆cpu-cycles ≥ Threshold cpu-cycle ), this means that evident change occurs in the CPU utilization.The cause for this change can be that the computing task or program running characteristics have changed, so the current threads could not properly utilize the computing resources.The algorithm continues to detect the context switch change rate ∆context-switches, thread migration change rate ∆thread-migration, bus width utilization change rate ∆bus-cycles, and cache miss change rate ∆cache-misses to determine whether the phase change has occurred in the program.
The specific arbitration process is as follows.( 1) If the ∆context-switches > Threshold context-switches (i.e., the original thread count does not adapt well to the current computing resources, which results in a larger context switch change), then the program running phase has changed.(2) If ∆thread-migration > Threshold thread-migration && ∆bus-cycles > Threshold bus-cycles (i.e., the original thread count does not match well with the processing core, which results in a large number of thread migrations between the different processing cores and excessive bandwidth utilization), then the program running phase has changed from computing intensive to memory intensive (vice versa).The thread count should be adjusted to better adapt to the current program running characteristics under this condition.(3) If the ∆cache-misses>Threshold cache-miss (i.e., the original thread count does not properly share the current cache resource, which results in a high cache miss rate), then the program running phase has changed and the thread count should be adjusted to adapt well to the current program running phase.

DPTM Framework Implementation
Under the DPTM framework, the CPU master process calculates the optimum thread count using the prediction model.The optimum thread is mapped to the specific processing cores through the binding processing cores, and idle processing cores are turned off or set as inactive to lower the energy consumption.The CPU master process dynamically adjusts the thread count of the program run on the MIC co-processor according to the program runtime phase changes.The DPTM framework prototype system is achieved in the form of a dynamic runtime library by extending the Intel OpenMP Runtime Library [30].
The detailed DPTM prototype system implementation on the MIC heterogeneous system is shown in Figure 4.The DPTM prototype system mainly includes the following five parts.The entire program uses the offload mode to run [26].

1.
HOST SIDE 1 code: The CPU master process first executes the program.When encountering the loop part, it uses the #pragma offload target (MIC SIDE 1) to offload the loop code to the MIC co-processor to execute.It then performs the #pragma offload_wait target (mic) to wait for the execution results of MIC.

2.
MIC SIDE 1 code: The MIC first regulates the control register CR4.PCE status by calling the init_module() to directly access the rdmpc from the user-space to obtain the performance status information.It then pre-runs the program by calling the function pre_running_program().
The status information is then read, collected, and returned to the CPU master process by calling the functions read_pmc(), collect_status_information(), and return_status_information (HOST SIDE 2) individually.

3.
HOST SIDE 2 code: The CPU master process predicts the optimal thread count by calling predicting_optimal_number_threads (status_info, opt_number_thread) and then uses the pragma clause #pragma offload_transfer target (MIC SIDE 2) in (opt_number_threads) to send the optimum thread count to the MIC co-processor side to control the parallelism of the loop code running on the MIC co-processor.4.
MIC SIDE 2 code: The MIC co-processor re-executes the loop code according to the optimum thread count while continuously detecting the phase changes by calling detecting_running_exception.

5.
HOST SIDE 3 code: The CPU master process continues to execute the subsequent portion code of the program after receiving the computing results from the MIC side.If the loop execution part is encountered, it will be offloaded to the MIC co-processor for execution according to the previous running mechanism.The program continuously iterates until the application is finished.

Experimental Environment
Experimental platform.The experiment was conducted on an Intel MIC heterogeneous many-core system.The system consists of two eight-core Intel E5-2670 CPUs and one Intel Xeon Phi 7110P MIC co-processor with a 64 GB memory and a 300 GB hard disk.The main memory and co-processor is connected by the PCI-E x16 bus, whose maximum data transmission speed is up to 16 GB/s.The OS is a Red Hat Enterprise Linux Server release 6.3, and the soft development environment is an Intel parallel_studio_xe_2013_update3_intel64.The performance metrics were obtained using the PAPI_5.4.1 performance measurement tool [31,32].

TCPM Prediction Accuracy Evaluation
To evaluate the prediction accuracy of the TCPM, we tested the performance speedup of different benchmark programs compared to the serial program for the three strategies.The three strategies are the Optimal strategy that refers to the best performance speedup and the corresponding thread count, the OS_Default strategy that refers to the maximum number of hardware threads supported by the Intel MIC co-processor, and the optimum thread count prediction model (TCPM), respectively.
We evaluated the performance differences of the TCPM, Optimal, and OS_Default by testing the performance speedup of the different benchmark programs at multiple threads against the serial program.Table 1 shows the performance speedup and corresponding thread count for the above three strategies.Figure 5 shows the comparison of relative performance ratio between TCPM and OS_Default.The relative performance ratio refers to a ratio of the speedup of the OS_Default and the TCPM over the Optimal strategy.As shown in Table 1, the average thread count of all benchmark programs triggered in the three strategies (Optimal, OS_Default, and TCPM) is 161, 240, and 118, respectively.The average speedup compared to the serial program of the three strategies is 18.43, 13.33, and 17.85, respectively.Overall, the TCPM obtains the minimum thread count and good performance.As shown in Figure 5, the average speedup of the OS_Default reached 73% of the Optimal, and the average speedup of the TCPM reached 97% of the Optimal, which demonstrates that the TCPM has good prediction accuracy.

DPTM Evaluation
In order to evaluate the effectiveness of DPTM, we first tested the speedup of different benchmark programs using the OS default mapping mechanism (OS_Default).In this mapping mechanism, the thread count is equal to the number of processing cores.Therefore, the thread count was set to 240, which is the maximum number of hardware threads supported by the Intel MIC co-processor.Second, we tested the best performance speedup and corresponding thread count of all benchmark programs, which is an ideal thread count setting standard (Optimal).Third, we tested the performance speedup and corresponding thread count of all benchmark programs using DPTM and compared it with the two previous measurement results.Furthermore, for the Optimal and DPTM, we tested their reduction ratio of the last-level cache misses normalized to the OS default mapping mechanism.Finally, we measured the energy consumption of the three types of thread mapping and the overhead of DPTM, and compared their energy-performance efficiencies.

Performance Speedup Evaluation
We evaluated the performance differences of DPTM, Optimal, and OS_Default by testing the speedup of the different benchmark programs at multiple threads over a single thread.Figure 6 shows the speedup of different benchmark programs for the three different mapping approaches.The performance speedup of DPTM is increased by 34.35% compared to the OS_Default, and reached 96.8% of the Optimal.For most benchmark programs, DPTM obtained a better performance compared to the OS_Default.The reason is that most of the benchmark programs are memory-bound or communication-bound applications.When too many threads are allocated, their performance may be decreased due to their competing for shared cache and memory bandwidth.However, DPTM did not improve the performance of blackscholes, ferret, raytrace, and freqmine, instead it slightly decreased.The main reason is that these four benchmark programs are computing intensive applications, whose performance shows an increasing trend as the thread count increases.Thus, using the maximum thread count (OS_Default) obtained the optimal performance.

Cache Miss Evaluation
Figure 7 shows the relative reduction ratio of the last-level cache misses normalized to the OS_Default mapping mechanism.The smaller the normalized value, the better.The average cache misses of DPTM decreased by 8% compared to the OS_Default, and the Optimal decreased by 12% compared to the OS_Default.Overall, the reduction ratio of the L2 level cache misses of DPTM is superior to the OS_Default and close to the Optimal.

Energy Consumption Evaluation
We measured the power consumption of the Xeon Phi co-processor when running each benchmark program under the different mapping methods by incorporating other runtime power management approaches [33,34].This measure was conducted by periodically reading power information from the Linux system file /sys/class/micras/power in a background thread at 100-millisecond intervals.We then computed the relative energy consumption of each benchmark program under DPTM and Optimal mapping.
The relative energy consumption was quantified using the metric normalized energy consumption in our evaluation.The specific definition of the metric is as follows: where EnergyConsumption is the energy consumption of running any benchmark program using different thread mapping, which mainly refers to the Optimal and DPTM; Power is the corresponding power; T is the corresponding execution time; EnergyConsumption 240_threads is the energy consumption of running any benchmark program using the OS_Default mapping; Power 240_threads refers to the corresponding power; and T 240_threads is the corresponding execution time.
As shown in Figure 8, the mean normalized energy consumption of Optimal reaches 53.7% of the OS_Default, and the DPTM only reaches 41%.The reason is that DPTM can dynamically regulate the thread count and allocate the reasonable thread at different program phases.Thus, it can avoid triggering the unnecessary hardware threads and processing cores, as well as lower the total energy consumption in the entire program execution.

Energy-Performance Efficiency Evaluation
In order to evaluate the energy-performance efficiency of the thread mapping, we defined two metrics of energy-performance efficiency and normalized energy efficiency as follows: where Power = P 0 × N th and R = 1 P 0 , P 0 is the power per unit time of each running thread (i.e., a constant), T is the average execution time of each thread, and N t refers to the thread count.Furthermore, we defined the normalized energy efficiency by removing the constant parameter R in Equation ( 12) as follows: Normalize energy e f f iciency = Speedup N t × T (13) The higher value of the normalized energy efficiency means more efficient use of energy.Figure 9 shows that DPTM's energy-performance efficiency is better than the OS_Default and Optimal in all benchmark programs, except for the streamcluster where DPTM's energy-performance efficiency was slightly lower than the Optimal, and vips where DPTM's energy-performance efficiency equaled the Optimal.The comparison of the geometric average for the three mapping mechanisms shows that DPTM achieved a higher energy-performance efficiency.

Overhead Evaluation
The overhead time was mainly caused by the status information transmission time and the thread count estimating time, which occurred when adjusting the number of threads.We used DPTM's adjusting time to represent them.The additional overhead is used to evaluate the influence of overhead time on the whole program performance.The additional overhead is equal to DPTM's adjusting time divided by the total running time of program.We tested the overhead of DPTM on different benchmark programs.Table 2 shows the overhead of DPTM on 10 benchmark programs.The average additional overhead introduced by DPTM is only 2.03%, which can be negligible relative to the obtained high energy efficiency.In addition, we compared the performance improvement rate and energy reduction rate with the overhead of DPTM against the OS_Default strategy in order to evaluate the validity of DPTM.
Figure 10 shows the performance improvement, energy reduction, and additional overhead ratios of DPTM compared with the OS_Default at all benchmark programs.From Figure 10, we can see that the average performance of ten benchmark programs has been improved by 48.6%, and the average energy consumption has been reduced by 59%.However, the performance of four benchmark programs (i.e., blackscholes, ferret, raytrace, and freqmine) has no improvement and instead slightly decreases, but the energy consumption of each program has obviously declined.Meanwhile, the amount of energy reduction is much larger than the amount of performance degradation and the additional overhead.The reason is that DPTM only predicts and sets the smallest number of threads, which can ensure that the approximate optimal performance and optimal energy efficiency are obtained.

Conclusions
In this work, we investigated the impact of thread count on program performance and energy efficiency in a heterogeneous many-core system.An optimum thread count prediction model (TCPM) was proposed by using regression analysis on the basis of extending Amdahl's law.Based on TCPM, a dynamic predictive thread mapping (DPTM) framework was proposed for the Intel MIC heterogeneous many-core system in order to improve the program performance and lower the system energy consumption.Experimental evaluation results show that the DPTM framework is effective in terms of improving the program performance and lowering the energy consumption.DPTM can be used to dynamically regulate the active hardware thread and processing cores according to the program behaviors, phase changes, and the dynamic requirements of computing resources in the process of program running.It can achieve the purpose of high computing performance and low energy consumption at the cost of a negligible additional overhead.
Ongoing and future work is focused on further extending the DPTM framework to use other heterogeneous architectures, and apply the approach to mixed-mode workloads.In addition, we will explore the phase change threshold parameters, and directly compare DPTM with other mapping approaches.

Figure 1 .
Figure 1.The impact of thread count on performance.(a) Case I: The program performance improves slowly with increasing thread count.(b) Case II: The performance speedup increases along with increasing thread count.(c) Case III: The performance speedup increases along with increasing thread count.(d) Case IV: The performance speedup exhibits an irregular change with increasing thread count.

Figure 2 .
Figure 2. Dynamic predictive thread mapping mechanism.First, DPTM pre-runs the portion of the program and samples the program performance metric values of IPS 1 and IPS n to calculate the prediction model parameters α, β, and γ.DPTM then uses the prediction model to compute the prediction values of the thread count.The program is then re-executed using the estimated thread count.The program running status is continuously detected by using the five collected system-level performance metrics, including the CPU utilization, thread context switch, thread migration, cache utilization, and bandwidth utilization rates.Once the phase change of the program is detected, the thread count is re-calculated according to the current values of IPS 1

Figure 3 .
Figure 3. DPTM framework on the MIC heterogeneous system.

Figure 5 .
Figure 5.The comparison of relative performance ratio between OS_Default and the thread count prediction model (TCPM).

Figure 7 .
Figure 7. Reduction of the last-level cache misses.

Figure 8 .
Figure 8. Relative energy consumption compared to OS default mapping strategy.

Figure 10 .
Figure 10.Performance, energy reduction, and overhead improvement ratio compared to the OS_Default.

Table 1 .
Speedup and corresponding thread count of the three strategies.

Table 2 .
The overhead of DPTM on 10 benchmark programs.