This section presents the experimental setup, results, and discussion of the proposed work.
4.1. Experimental Setup
A 64-core NoC-based many-core system with shared memory was used to evaluate our proposed work. The simulated cores have a homogeneous microarchitecture, i.e., they have the same instruction set architecture (ISA), and a heterogeneous frequency, i.e., each core can run on a different frequency. The maximum frequency of each core is set to 4 GHz.
Figure 4 shows the floorplan of the simulated system. The size of each core is 2.95 mm × 2.95 mm, where its power profile is calculated using McPAT [
34] for 22-nm technology. Every core has a private 32 KB L1 data cache, 32 KB L1 instruction cache and 512 KB L2 cache. Every eight cores share 8 MB L3 cache. The 64-core are connected through
mesh NoC. A summary of the system configuration is presented in
Table 2.
The experimental setup of the proposed work is shown in
Figure 5. We used LifeSim simulation tool [
35] for many-core systems. LifeSim is a tool that contains Sniper [
36], McPAT [
34], and HotSpot [
37] simulators. Sniper is a parallel x86-64 multi/many-core simulator. It offers fast simulation with an average performance error of
compared to real hardware. McPAT is widely and recently used [
38,
39] for integrated power, area, and timing modeling as it provides comprehensive low-level configuration details for multi/many-core processors. Other power models such as Wattch [
40] and PowerTrain [
41] can also be used to estimate the cores’ power profiles as our proposed algorithm deals with the cores’ transient temperatures, which do not change immediately with the power changes. HotSpot is the most widely used thermal simulator. It is based on the popular stacked-layer packaging scheme of modern very-large-scale integration (VLSI) systems (see
Figure 6). We modified Sniper to support dark silicon modeling. Specifically, we modified Sniper’s scheduler to schedule the tasks only on the active cores using a core mask pattern. In addition, we modified McPAT to report only the static power of the dark cores. Moreover, we modeled wake-up latency from a dark state to an active state by adding 100
s, which is near the practical wake-up latency for real processors as reported in [
42].
Based on the system configurations that are provided to the simulation tool (i.e., number of cores and floorplan), Sniper generates performance traces by run multi-threaded applications from the Splash-2 benchmark suite [
43] and PARSEC benchmark suite [
44]. McPAT uses these traces to estimate the power consumption of each core. As the whole core will turn off, the total power of the core (including L1 and L2 power) is used as an input to HotSpot. The L3 cache’s power was excluded as it is sited outside the core, and it is shared among the cluster’s cores. The processing core generates the most heat, and, thus, the core’s power density is not uniformly distributed. As the temperature is measured for each core, the non-uniform power density will not affect our estimation. After a specified period that is termed the
control period, HotSpot uses both the power traces and the cores’ steady-state temperature for the previous control period (i.e., the cores’ initial temperature for the current control period) to estimate the transient temperature. Based on the generated transient temperature, the proposed DTaPO algorithm determines either to migrate the tasks or to reduce the voltage/frequency level.
In our experiment, a mix of four compute- and memory-intensive applications were used to evaluate the proposed technique’s efficiency.
Radix and
ocean are from Splash-2, and
blackscholes and
bodytrack are from PARSEC. Compute-intensive applications are good candidates to validate our DTaPO algorithm because the tasks for these applications are high-temperature tasks, which can show an increase in the core temperature [
45]. On the other hand, using memory-intensive applications can show task migration overhead of the proposed technique.
Table 3 shows the number of instructions and cycles (in millions) and the number of memory accesses of the used applications, where each application has eight threads. It can be noticed that
ocean and
bodytrack have the highest number of memory accesses.
The four applications were simulated on 64 cores, where 32 cores are active and 32 cores are dark cores. The values of
,
, and
are empirically determined through parameterization of their values until the system achieves optimal switching frequency (i.e., the system does not keep switching between the active and dark cores). In our experiment, the values of
,
, and
are set to 5
C, 200 MHz and 3.6 GHz, respectively. The configuration parameters of HotSpot are detailed in
Table 4. The proposed DTaPO technique was evaluated using several threshold temperatures, which was ranged from 60 to 85
C to analyze the influence of threshold temperature. Besides, to get more accurate results, the experiments were conducted ten times, and the average results are reported.
4.2. Experimental Results and Discussion
Our previous paper [
31] shows that the use of only task migration cannot keep the average chip temperature below the threshold temperature. Therefore, in this paper, the proposed technique and DVFS are evaluated at several threshold temperatures.
Using the task migration in the dark silicon era can utilize all cores, i.e., active and dark cores, by swapping tasks among active and dark cores, as illustrated in
Figure 7. However, task migration incurs system performance overhead due to three factors [
46]:
A fixed time is spent on storing and restoring the state of architecture.
Time is spent draining and refilling a core’s pipeline.
Time is spent restoring the data from memory due to cache misses.
The last factor is considered very large compared to the first two factors. In our work, 1000 cycles are added as a penalty of the first factor for each task migration, while the last two factors are already modeled in Sniper as mention in [
46].
Our proposed technique uses cluster-based task migration to reduce the task migration overhead so that tasks are migrated only among cores that shared an L3 cache.
Figure 8 shows the normalized memory accesses when running the four applications using the proposed technique with and without clustering. Using cluster-based task migration reduces the number of memory accesses by
over no clustering. In addition, the increase in the number of memory accesses using the proposed technique is only
compared with using no DTM.
Table 5 shows the average number of times that task migration and DVFS have been used in DTaPO at different thermal thresholds. It can be seen that the number of used task migration and DVFS is decreased by increasing the threshold temperature.
Figure 9 shows that using the chessboard pattern keeps the active cores cooler than using a contiguous pattern. The peak chip temperature in the chessboard pattern is by about 5
C lower compared to using the contiguous pattern. In addition, the performance degradation due to increasing the communication latency by one hop for each active core is insignificant by only
.
Figure 10 shows the penalty of using the proposed technique and DVFS only compared with using no DTM, as well as the maximum, average, and minimum temperature of running the four applications, which are radix, ocean, blackscholes, and bodytrack. The 10th and 90th percentiles are used to represent the minimum and maximum temperature because of the highly fluctuating of transient temperatures, as shown in
Figure 11.
It is observed that the proposed technique (DTaPO) always keeps the average temperature below the threshold temperature with up to lower penalty compared with using only DVFS for any threshold temperatures. Moreover, the average peak temperature is reduced by C. Besides that, the results show that the slowdown is highly dependent on the threshold temperature. The slowdown increases as the threshold decreases due to the increase in the number of calls for DTaPO to retain the chip temperature in a safe operating range.
To compare our work with state-of-the-art, the OSP algorithm [
27], which shares the same goal of maximizing the many-core system’s performance under thermal constraint, was re-implemented in this experiment. The concept of OSP is to run the many-core system with an optimal frequency that keeps the system sprinting without violating the thermal limit and going to the resting mode. To make a fair comparison, TDP is added as a constraint to the DTaPO algorithm in which the total power consumption does not exceed the TDP. The value of TDP was set to 42 W as in [
27]. The values of
and
were chosen to ensure that the system does not enter the resting mode. In our experiments,
and
were set to 200 and
, respectively.
The comparative results when running ten applications from SPLASH2 and PARSEC on the simulation framework described in
Section 4 are shown in
Figure 12.
Figure 12a shows the computational efficiency in terms of normalized completion time. In most of the studied applications, our DTaPO algorithm outperforms the OSP in computational efficiency. It reduces the completion time over OSP by
in
Occean,
in
FFT,
in
Raytrace,
in
Cholesky,
in
Canneal,
in
Blackscholes, and
in
Fluidanimate. These applications have high instruction-level parallelism. Thus, they benefit from running
of the cores (32 cores) with a high frequency in DTaPO compared to running
of the cores (64 cores) with low frequency in OSP. Although OSP reduces the completion time over DTaPO by
in
Radix,
in
Dedup, and
in
Bodytrack, our algorithm reduces the peak temperature in all studied applications by up to
C, as shown in
Figure 12b. This is because DTaPO uses a chessboard pattern that gives better heat dissipation.