DTaPO: Dynamic Thermal-Aware Performance Optimization for Dark Silicon Many-Core Systems

: Future many-core systems need to handle high power density and chip temperature effectively. Some cores in many-core systems need to be turned off or ‘dark’ to manage chip power and thermal density. This phenomenon is also known as the dark silicon problem. This problem prevents many-core systems from utilizing and gaining improved performance from a large number of processing cores. This paper presents a dynamic thermal-aware performance optimization of dark silicon many-core systems (DTaPO) technique for optimizing dark silicon a many-core system performance under temperature constraint. The proposed technique utilizes both task migration and dynamic voltage frequency scaling (DVFS) for optimizing the performance of a many-core system while keeping system temperature in a safe operating limit. Task migration puts hot cores in low-power states and moves tasks to cooler dark cores to aggressively reduce chip temperature while maintaining high overall system performance. To reduce task migration overhead due to cold start, the source core (i.e., active core) keeps its L2 cache content during the initial migration phase. The destination core (i.e., dark core) can access it to reduce the impact of cold start misses. Moreover, the proposed technique limits tasks migration among cores that share the last level cache (LLC). In the case of major thermal violation and no cooler cores being available, DVFS is used to reduce the hot cores temperature gradually by reducing their frequency. Experimental results for different threshold temperatures show that DTaPO can keep the average system temperature below the thermal limit. Afﬁrmatively, the execution time penalty is reduced by up to 18% compared with using only DVFS for all thermal thresholds. Moreover, the average peak temperature is reduced by up to 10.8 ◦ C. In addition, the experimental results show that DTaPO improves the system’s performance by up to 80% compared to optimal sprinting patterns (OSP) and reduces the temperature by up to 13.6 ◦ C.


Introduction
Moore's law [1] predicts that the number of transistors on a chip would double every two years, and Dennard scaling [2] predicts that power downscaling is proportional to technology size. These two laws are the key concepts for increasing processor performance. As technology size reduces, it becomes difficult to scale down the supply voltage as it nears the threshold voltage. Thus, further increases in frequency are infeasible due to increasing power densities that directly contribute to increasing chip temperature. As a solution, more cores on a single chip are integrated to increase processing

Related Work
The dark silicon problem is due to the increase in many-core systems' power densities as technology sizes shrink, leading to thermal violations. In recent years, the dark silicon problem has received considerable attention as a real many-core system problem that needs to be carefully addressed. It affects the many-core system from utilizing all available cores for maximum performance [16]. Many techniques have been proposed to optimize the performance of multi/many-core systems in the dark silicon era. These techniques can be classified either as techniques to optimize performance under the power budget constraint or techniques for optimizing performance under thermal constraint.
The power budget techniques in [6,8,9,17] consider a fixed per-chip power budget TDP or per-core power budget TSP. These techniques aim to avoid thermal violations by running the system under the TDP/TSP power budget. However, using only the power budget may lead to a severe thermal violation due to excluding transient temperatures and heat transfer among cores [10]. On the other hand, thermal constraint techniques deal directly with the chip temperature. Thus, both transient temperatures and heat transfer among cores are considered to avoid thermal violations. A DTM technique should be used to prevent thermal violations at run-time. Task migration and DVFS are the two widely used techniques for DTM. The following paragraphs discuss related work that used these DTM techniques to maximize performance under the thermal constraint for dark silicon.
Some approaches use task migration and application mapping to improve dark silicon many-core systems. Shafique et al. [13] proposed a variability-aware dark silicon management technique called DaSiM. DaSiM uses both thread mapping and dark silicon patterning to lower the maximum temperature by modeling core-to-core leakage power variations. Consequently, it can activate or boost more cores. DaSiM aims to avoid a thermal violation by providing a lightweight run-time prediction technique for a steady-state temperature. This technique can predict the temperature distribution for a candidate solution. When a thermal violation occurs, DaSiM uses task migration or power-gating to deal with the thermal violation.
Similarly, Kanduri et al. [18] proposed a dark silicon aware run-time application mapping approach for many-core systems to distribute power density uniformly on the chip by patterning active cores and the dark cores. This approach keeps the active cores' temperature within a safe operating temperature to provide a higher power budget and better use of the resources. Wang et al. [19,20] also proposed a static and dynamic patterning approach for many-core systems to find the location and number of active and dark cores for each application. This approach aims to optimize the system throughput by optimizing performance, communication cost, and waiting time. All of these approaches used task migration only. However, in some cases, using only task migration cannot ensure a safe temperature operation. Thus, some cores should turn off, which leads to higher performance degradation than using DVFS.
Other approaches use DVFS and application mapping to improve dark silicon many-core systems. Khdr et al. [12] proposed a design-time resource management technique called DsRem for maximizing the performance of dark silicon many-core systems. It maps applications among active cores such that the maximum steady-state temperature for all cores is kept lower than the threshold temperature. DsRem schedules the number of active cores and their DVFS levels based on the thread-level parallelism (TLP) and instruction-level parallelism (ILP) characteristics of the running applications. However, DsRem has a high scheduling time overhead that may not be suitable for run-time performance optimization.
Another DVFS technique to improve dark silicon many-core systems' performance is by increasing cores' frequencies for a short period. This technique is called computation sprinting [7,11,[21][22][23][24][25][26][27]. Raghavan et al. [11] proposed a computational sprinting technique to improve many-core system performance through activating all cores (including dark cores) with maximum frequency for a sub-second period to improve parallel computation and later deactivate them to cool down the chip. During the sprinting mode, power consumption significantly exceeds the TDP budget. Raghavan et al. [21] proposed an adaptive sprint pacing to adapt sprint pattern at run-time dynamically. This technique runs all cores at maximum speed until half of the thermal constraints and then reduces the cores speed to half. Wang et al. [27] proposed a technique to find the optimal sprinting patterns (OSP) that keep the many-core system sprinting without resting to maximize the system performance.
A few studies used a combination of task migration and DVFS, such as [8,28,29]. Hanumaiah et al. [28] proposed a technique for maximizing system performance by finding the optimal DVFS levels of the cores under a thermal constraint. However, this technique ignores the heat generated from the neighboring cores to reduce the computation overhead. In a many-core system, heat exchange among cores is significant and cannot be ignored. Wang et al. [8] proposed a hierarchical dynamic thermal management technique for many-core systems. This technique uses both task migration and DVFS with model predictive control to maximize many-core system performance. However, this technique did not consider the dark silicon problem. It reduces the voltage/frequency level of cores and does not turn them completely off for aggressive thermal reduction. Wang et al. [29] proposed a technique that considered the dark silicon problem. This technique swaps the tasks among active cores to balance the thermal/power of the chip. However, swapping tasks between two active cores doubles the migration overhead since the tasks migrate in both directions.
Our proposed technique differs from the techniques mentioned above in that it leverages dark silicon by moving a task from a hot core to a dark core. Previous studies avoid migration to dark cores due to cold start cache misses overhead. However, in modern many-core systems, the core goes in multiple low-power states before it completely shut off, such as Intel Xeon Phi [30]. During the first low-power state, its L2 cache stays active, and the destination core can access data from it rather than from the shared L3 cache or the main memory. The proposed technique combines task migration and DVFS to optimize many-core system performance while the system temperature is kept in a safe operating range. The tasks are moved only in one direction after activating the dark core, i.e., a task is only moved from an active core to the dark core. The proposed DTaPO differs from our previous proof-of-concept work [31] as follows: • A cluster-based task migration is used to reduce the number of memory accesses due to L2 cache misses as migration is only allowed between cores sharing the LLC. • The cold start overhead is considered as it can significantly affect the system performance.
• The proposed DTaPO's performance is evaluated under several threshold temperatures.
• A mix of four compute-and memory-intensive applications are used with more cores to show the task migration overhead, where using only compute-intensive applications cannot show the task migration overhead.

Proposed DTaPO Methodology
This section presents the proposed technique methodology. Our proposed technique aims to optimize the performance of many-core systems under thermal constraints dynamically. Therefore, the proposed technique should be computationally light. DTM techniques (i.e., task migration and DVFS) are good candidates to manage chip thermal at run-time. The proposed technique depends on that the many-core system supports multiple low-power states. A short background about the core power states is introduced in the next subsection.

Core Power States
Modern many-core processors support several low-power states called C-states [32]. The C-states are named C0, C1, C2, ..., Cn, where the value of n depends on the processor design. C0 state is the active state, where the core in the execution mode. As the C-state goes deeper, more power-reducing actions are taken by shutting down more core components, as illustrated in Figure 1. However, the wake-up latency is increased as the C-state goes deeper. According to the ACPI standard [32], the C1 state reduces the core voltage and turns off the core's clock while maintaining the contents of L1/L2 caches. In the C2 state, the contents of L1/L2 caches are flushed to the LLC cache. In the C3 state, the core is completely off.

Proposed Framework
The proposed technique uses task migration to lower the temperature of hot active cores aggressively by moving tasks to cooler dark cores after turning them on, as shown in Figure 2a.
Moving the tasks to dark cores instead of other active cores reduces task migration overhead, where the tasks move only in one direction from active to dark cores. To reduce the cold start overhead due to L1/L2 cache misses, the proposed technique assumes that the source core goes to C1 state at the initial phase of migration. As mentioned in the previous section, the core maintains the contents of its L1/L2 caches in the C1 power state. Thus, the cache controller can get data for the destination core from the source core's L2 cache instead of the last level cache (LLC) or the main memory. Moreover, the cores are arranged in clusters, where all cores belonging to a cluster share the LLC. Thus, task migration is performed as cluster-based, i.e., the tasks are migrated among the cores that share the same LLC. The cluster-based task migration will reduce the number of memory accesses due to cache misses because the data are already available to all cores that share an LLC without the need to fetch it from the main memory. In the case of no cooler cores available, DVFS is used to reduce hot cores temperature gradually by lowering the voltage/frequency of the active cores to below the threshold temperature, as shown in Figure 2b. As the proposed technique optimizes the performance of dark silicon many-core systems, we assume that only 50% of cores can be activated simultaneously. Figure 3 shows an overview of the proposed technique. The active and dark cores are arranged in the form of a chessboard pattern so that each active core is surrounded by dark cores to dissipate the heat better [10]. Although the chessboard pattern increases the communication latency by one hop for each active core, its peak chip temperature is lower than for the contiguous pattern. To map the upcoming task on the chessboard pattern, the following formula is used.
where l is the location of new task, c is the location of current task, and w is the width of the network-on-chip (NoC) mesh. Equation (1) ensures that all tasks are mapped to the active cores, which are surrounded by the dark cores. To ensure that the tasks are migrated in a clustered-based, Equation (2) is used.
where d is the location of destination core and s is the location of source core.  Periodically, the proposed technique monitors the many-core system parameters such as the number of active cores, the number of dark cores, the current setting of frequency, and the transient temperature. Transient temperature is sensed instead of steady-state temperature to allow the thermal management to capture actual core temperature, which might exceed the steady-state temperature [33]. Suppose the transient temperature exceeded a predefined threshold temperature, and the temperature of the dark cores was below the threshold temperature by a safe margin. In that case, the proposed technique moves tasks from active cores to dark cores. The safe margin prevents the tasks from moving if the dark cores temperature just below the threshold temperature.

Proposed Algorithm
Given M multi-threaded applications running on a many-core system that consists of N cores where only half of the cores can be active, the proposed technique aims at minimizing the total completion time or makespan time (M t ), while keeping the chip temperature below a predefined threshold temperature T thr . This aim can be mathematically expressed as follows: where T i is the transient temperature of core i, where the transient temperatures of all active and dark cores are checked. Table 1 describes all symbols that are used in our algorithm. The proposed algorithm (DTaPO) is detailed in Algorithm 1. DTaPO monitors the following parameters: the transient temperature of all cores T, a set of all active cores A = {a 0 , ..., a n }, a set of all dark cores D = {d 0 , ..., d n }, and a set of current frequency level for all cores F . In addition, DTaPO reads a threshold temperature T thr and threshold frequency F thr from a configuration file. Periodically, the proposed technique monitors the transient temperature for each active core. If the temperature exceeded the threshold temperature, DTaPO checks the temperature of a destination dark core. If the temperature of dark cores is less than the threshold temperature by a safe margin ε, DTaPO marks a task as a movable task by setting Q[t i ] = 1 (Lines 4-6). If the temperature of dark cores is not less than the threshold temperature, but it is less than the active core temperature by ε, DTaPO reduces the frequency of dark core by one frequency level step called ζ, and marks a task as a movable task (Lines 7-10). Otherwise, it reduces only the frequency level of high-temperature cores by ζ, and marks a task as an unmovable task by setting Q[t i ] = 0 (Lines [11][12][13][14]. In the case the temperature is below the threshold temperature by the safe margin ε, and the frequency does not exceed the threshold frequency, the proposed technique increases the frequency level by one step to enhance the performance (Lines 16-19). Lastly, if all tasks are movable tasks, DTaPO performs a task migration by moving tasks from the active cores to the dark cores after activating the dark cores within the same cluster and putting the active cores in a low-power state (Lines 20-24).

Algorithm 1 DTaPO algorithm
Input: A, D, F , T, T thr , F thr , ε, and ζ Output: New sets of A, D, and F 1 while true do

Experimental Evaluation
This section presents the experimental setup, results, and discussion of the proposed work.

Experimental Setup
A 64-core NoC-based many-core system with shared memory was used to evaluate our proposed work. The simulated cores have a homogeneous microarchitecture, i.e., they have the same instruction set architecture (ISA), and a heterogeneous frequency, i.e., each core can run on a different frequency. The maximum frequency of each core is set to 4 GHz. Figure 4 shows the floorplan of the simulated system. The size of each core is 2.95 mm × 2.95 mm, where its power profile is calculated using McPAT [34] for 22-nm technology. Every core has a private 32 KB L1 data cache, 32 KB L1 instruction cache and 512 KB L2 cache. Every eight cores share 8 MB L3 cache. The 64-core are connected through 8 × 8 mesh NoC. A summary of the system configuration is presented in Table 2.
The experimental setup of the proposed work is shown in Figure 5. We used LifeSim simulation tool [35] for many-core systems. LifeSim is a tool that contains Sniper [36], McPAT [34], and HotSpot [37] simulators. Sniper is a parallel x86-64 multi/many-core simulator. It offers fast simulation with an average performance error of 25% compared to real hardware. McPAT is widely and recently used [38,39] for integrated power, area, and timing modeling as it provides comprehensive low-level configuration details for multi/many-core processors. Other power models such as Wattch [40] and PowerTrain [41] can also be used to estimate the cores' power profiles as our proposed algorithm deals with the cores' transient temperatures, which do not change immediately with the power changes. HotSpot is the most widely used thermal simulator. It is based on the popular stacked-layer packaging scheme of modern very-large-scale integration (VLSI) systems (see Figure 6). We modified Sniper to support dark silicon modeling. Specifically, we modified Sniper's scheduler to schedule the tasks only on the active cores using a core mask pattern. In addition, we modified McPAT to report only the static power of the dark cores. Moreover, we modeled wake-up latency from a dark state to an active state by adding 100 µs, which is near the practical wake-up latency for real processors as reported in [42].    Based on the system configurations that are provided to the simulation tool (i.e., number of cores and floorplan), Sniper generates performance traces by run multi-threaded applications from the Splash-2 benchmark suite [43] and PARSEC benchmark suite [44]. McPAT uses these traces to estimate the power consumption of each core. As the whole core will turn off, the total power of the core (including L1 and L2 power) is used as an input to HotSpot. The L3 cache's power was excluded as it is sited outside the core, and it is shared among the cluster's cores. The processing core generates the most heat, and, thus, the core's power density is not uniformly distributed. As the temperature is measured for each core, the non-uniform power density will not affect our estimation. After a specified period that is termed the control period , HotSpot uses both the power traces and the cores' steady-state temperature for the previous control period (i.e., the cores' initial temperature for the current control period) to estimate the transient temperature. Based on the generated transient temperature, the proposed DTaPO algorithm determines either to migrate the tasks or to reduce the voltage/frequency level.
In our experiment, a mix of four compute-and memory-intensive applications were used to evaluate the proposed technique's efficiency. Radix and ocean are from Splash-2, and blackscholes and bodytrack are from PARSEC. Compute-intensive applications are good candidates to validate our DTaPO algorithm because the tasks for these applications are high-temperature tasks, which can show an increase in the core temperature [45]. On the other hand, using memory-intensive applications can show task migration overhead of the proposed technique. Table 3 shows the number of instructions and cycles (in millions) and the number of memory accesses of the used applications, where each application has eight threads. It can be noticed that ocean and bodytrack have the highest number of memory accesses. Table 3. Applications characteristics. The four applications were simulated on 64 cores, where 32 cores are active and 32 cores are dark cores. The values of ε, ζ, and F thr are empirically determined through parameterization of their values until the system achieves optimal switching frequency (i.e., the system does not keep switching between the active and dark cores). In our experiment, the values of ε, f , and F thr are set to 5 • C, 200 MHz and 3.6 GHz, respectively. The configuration parameters of HotSpot are detailed in Table 4. The proposed DTaPO technique was evaluated using several threshold temperatures, which was ranged from 60 to 85 • C to analyze the influence of threshold temperature. Besides, to get more accurate results, the experiments were conducted ten times, and the average results are reported.

Experimental Results and Discussion
Our previous paper [31] shows that the use of only task migration cannot keep the average chip temperature below the threshold temperature. Therefore, in this paper, the proposed technique and DVFS are evaluated at several threshold temperatures.
Using the task migration in the dark silicon era can utilize all cores, i.e., active and dark cores, by swapping tasks among active and dark cores, as illustrated in Figure 7. However, task migration incurs system performance overhead due to three factors [46]: 1. A fixed time is spent on storing and restoring the state of architecture. 2. Time is spent draining and refilling a core's pipeline. 3. Time is spent restoring the data from memory due to cache misses. The last factor is considered very large compared to the first two factors. In our work, 1000 cycles are added as a penalty of the first factor for each task migration, while the last two factors are already modeled in Sniper as mention in [46].
Our proposed technique uses cluster-based task migration to reduce the task migration overhead so that tasks are migrated only among cores that shared an L3 cache. Figure 8 shows the normalized memory accesses when running the four applications using the proposed technique with and without clustering. Using cluster-based task migration reduces the number of memory accesses by 8% over no clustering. In addition, the increase in the number of memory accesses using the proposed technique is only 1% compared with using no DTM. Table 5 shows the average number of times that task migration and DVFS have been used in DTaPO at different thermal thresholds. It can be seen that the number of used task migration and DVFS is decreased by increasing the threshold temperature.   Figure 9 shows that using the chessboard pattern keeps the active cores cooler than using a contiguous pattern. The peak chip temperature in the chessboard pattern is by about 5 • C lower compared to using the contiguous pattern. In addition, the performance degradation due to increasing the communication latency by one hop for each active core is insignificant by only 1%. Figure 10 shows the penalty of using the proposed technique and DVFS only compared with using no DTM, as well as the maximum, average, and minimum temperature of running the four applications, which are radix, ocean, blackscholes, and bodytrack. The 10th and 90th percentiles are used to represent the minimum and maximum temperature because of the highly fluctuating of transient temperatures, as shown in Figure 11.  It is observed that the proposed technique (DTaPO) always keeps the average temperature below the threshold temperature with up to 18% lower penalty compared with using only DVFS for any threshold temperatures. Moreover, the average peak temperature is reduced by 10.8 • C. Besides that, the results show that the slowdown is highly dependent on the threshold temperature. The slowdown increases as the threshold decreases due to the increase in the number of calls for DTaPO to retain the chip temperature in a safe operating range.
To compare our work with state-of-the-art, the OSP algorithm [27], which shares the same goal of maximizing the many-core system's performance under thermal constraint, was re-implemented in this experiment. The concept of OSP is to run the many-core system with an optimal frequency that keeps the system sprinting without violating the thermal limit and going to the resting mode. To make a fair comparison, TDP is added as a constraint to the DTaPO algorithm in which the total power consumption does not exceed the TDP. The value of TDP was set to 42 W as in [27]. The values of f and ε were chosen to ensure that the system does not enter the resting mode. In our experiments, f and ε were set to 200 and 0.5, respectively.
The comparative results when running ten applications from SPLASH2 and PARSEC on the simulation framework described in Section 4 are shown in Figure 12. Figure 12a shows the computational efficiency in terms of normalized completion time. In most of the studied applications, our DTaPO algorithm outperforms the OSP in computational efficiency. It reduces the completion time over OSP by 80% in Occean, 46% in FFT, 36% in Raytrace, 36% in Cholesky, 28% in Canneal, 21% in Blackscholes, and 2% in Fluidanimate. These applications have high instruction-level parallelism. Thus, they benefit from running 50% of the cores (32 cores) with a high frequency in DTaPO compared to running 100% of the cores (64 cores) with low frequency in OSP. Although OSP reduces the completion time over DTaPO by 29% in Radix, 2% in Dedup, and 1.5 in Bodytrack, our algorithm reduces the peak temperature in all studied applications by up to 13.6 • C, as shown in Figure 12b. This is because DTaPO uses a chessboard pattern that gives better heat dissipation.

Conclusions
This paper proposes DTaPO, a dynamic thermal-aware performance optimization technique for dark silicon many-core systems. The proposed technique utilizes both task migration and DVFS to optimize many-core system performance under thermal constraint. Task migration aggressively reduces many-core system temperature by putting the hot cores in low-power states and moves tasks to dark cores that share the LLC. Putting the source core in low-power reduces overhead due to the cold start cache misses. Moreover, limiting task migration among cores that share LLC reduces cache misses for memory-intensive applications. Task migration enhances the cores' utilization in the dark silicon many-core system by swapping tasks among active and dark cores. When cooler cores are unavailable, DVFS is used to keep the core temperature in a safe operating range. The simulation results show that the average system operating temperature is kept below the threshold temperature with up to 18% less penalty than using only DVFS for all threshold temperatures. In addition, simulation results show that the slowdown is highly dependent on the threshold temperature. The slowdown increases as the threshold decreases because of the increase in the number of calls for DTaPO to keep the chip temperature in a safe operating range. Moreover, the average peak temperature is reduced by 10.8 • C. Compared with the state-of-the-art, our algorithm improves the system's performance by up to 80% in computational efficiency, especially for tasks with high instruction-level parallelism. In addition, it improves the temperature efficiency by up to 13.6 • C. In future work, we plan to integrate a patterning model to our technique to increase the number of active cores without violating the thermal limit.