Power-Time Exploration Tools for NMP-Enabled Systems

: Recently, dramatic improvements in memory performance have been highly required for data demanding application services such as deep learning, big data, and immersive videos. To this end, the throughput-oriented memory such as high bandwidth memory (HBM) and hybrid memory cube (HMC) has been introduced to provide a high bandwidth. For its e ﬀ ective use, various research e ﬀ orts have been conducted. Among them, the near-memory-processing (NMP) is a concept that utilizes bandwidth and power consumption by placing computation logic near the memory. In the NMP-enabled system, a processor hierarchy consisting of hosts and NMPs is formed based on the distance from the main memory. In this paper, an evaluation tool is proposed to obtain the optimal design decision considering the power-time trade-o ﬀ in the processor hierarchy. Every time the operating condition and constraints change, the decision of task-level o ﬄ oading is dynamically made. For the realistic NMP-enabled system environment, the relationship among HBM, host, and NMP should be carefully considered. Hosts and NMPs are almost hidden from each other and the communications between them are extremely limited. In the simulation results, popular benchmarks and a machine learning application are used to demonstrate power-time trade-o ﬀ s depending on applications and system conditions.


Introduction
For decades, efforts have been conducted to increase both the processor speed and memory size.Consequently, the memory bottleneck problem has become increasingly serious and is a critical issue to overcome urgently to improve overall system performance.Recently, data-intensive applications such as deep learning, big data, and immersive video have attracted attention, and a significant improvement in memory performance is in high demand.Hence, a through-silicon via (TSV)-based stacked DRAM memory such as the high bandwidth memory (HBM) [1] or hybrid memory cube (HMC) [2] has been introduced to provide a high bandwidth with a wide I/O.This next-generation memory has a structure in which multiple layers of the DRAM die are stacked on a base logic layer, and interlayer communication is achieved through high-speed TSV technology.Unlike the conventional memory, it provides a high bandwidth with low power consumption.
Various research efforts have been conducted for the effective use of the HBM and HMC.Among them, the concept of near-memory-processing (NMP) changes the traditional relationship between a processor and memory for data-intensive applications.As shown in Figure 1a, the traditional processor-centered approach has a deep memory hierarchy with several levels of cache.The closer This hierarchy utilizes the data locality of the applications.The distance between the data location and the processor is determined by how soon the data will be used.In such a structure, it is essential to provide the necessary data quickly.In Figure 1b, the NMP exploits the bandwidth and power consumption by placing the computation logic near the memory.Hosts outside the memory and NMP inside the memory demonstrate a processor hierarchy depending on the distance from the memory.Usually, lightweight processors are considered for the NMP, whereas powerful processors are suitable for the host.When a large amount of data requires a simple calculation or when one data point does not require repeated computations, it is appropriate to use NMP.On the contrary, when a highly complex computation is necessary or when a sophisticated cache coherence protocol is essential among processors, it is reasonable to handle it at the host, at the sacrifice of an additional overhead to pass through the memory I/ O interface.Over the decades, a number of studies regarding optimal cache size, type, and its replacement policy for a processor-centered approach [3][4][5][6] have been reported.Furthermore, many design exploration and evaluation methods have been attempted to optimize the performance of multi-core environments [7,8].Therefore, it is opportune to study various tools and schemes for system optimization and performance evaluation considering the processor hierarchy in the memory-centered approach.

Previous Works and Problem Statements
With the commercialization of a three-dimensional (3D) stacked memory, research on the NMP is at a new turning point.Recently, the accurate modeling of the NMP performance is the primary research topic because no products or simulators are available that can implement NMP.Some studies have been reported that estimated and modeled the expected execution time and energy efficiency in an NMP-enabled system using the actual results obtained from the existing processors [9,10].They conducted the performance comparison between a host and NMP on the assumption that NMP has a much higher memory bandwidth than the host [9][10][11][12].For specific benchmarks, an NMP showed competitiveness by proving the possibility of performance enhancement due to its high

Previous Works and Problem Statements
With the commercialization of a three-dimensional (3D) stacked memory, research on the NMP is at a new turning point.Recently, the accurate modeling of the NMP performance is the primary research topic because no products or simulators are available that can implement NMP.Some studies have been reported that estimated and modeled the expected execution time and energy efficiency in an NMP-enabled system using the actual results obtained from the existing processors [9,10].
They conducted the performance comparison between a host and NMP on the assumption that NMP has a much higher memory bandwidth than the host [9][10][11][12].For specific benchmarks, an NMP showed competitiveness by proving the possibility of performance enhancement due to its high bandwidth advantage [9,10,13].Various computation-bound or memory-bound benchmarks were tested in hosts and NMPs.Subsequently, the performance in terms of execution time and energy efficiency was analyzed in relation to the benchmark characteristics.Depending on the nature of systems or benchmarks, running the host may have an absolute performance advantage over the NMP processor and vice versa.However, it is typical to obtain the optimal solution through NMP offloading in the host-NMP heterogeneous system [14][15][16].Operation-level NMP offloading was tested using the MapReduced benchmark, where the trends of power consumption and execution time change are analyzed to obtain the optimal NMP offloading point [14,15].Similarly, NMP offloading is tried in the operation level using the graph processing benchmark [16].Given the target benchmark, the performance of an NMP-enabled system is evaluated with various configurations.By changing the memory bandwidth, operating frequency, number of processors, and cache size, the NMP structure that is suitable for the target benchmark is evaluated [10,17,18].The energy-delay product indicator is well known and used to compare various NMP structures in two areas of interest: execution time and energy efficiency [10].Furthermore, attempts to obtain the point at which execution time and data parallelism are maximized are conducted by changing the number and configuration of NMP [18].In References [19][20][21], the possible benefits from enabling in-memory computations are well analyzed using various architecture and applications.
Investigations on the effective use of HBM are still in its infancy.The problems to be complemented are summarized as follows.First, the combination of HBM and NMP is expected to be widely used.However, realistic constraints are not considered carefully.For example, in the HBM, it is reasonable to assume that both the host and NMP are connected to the logic die's memory controller and have the same bandwidth.However, many previous studies assume a stacked memory environment in which the NMP has an absolute bandwidth advantage over the host.Next, the decision on which core to use is typically made in the application level.In other cases, an operation-level allocation decision is made, by assuming the NMP is on the very small logic die.Decision units are too coarse or too fine to obtain optimal offloading points.In addition, no criterion based on power-time tradeoff exists for the design decision and evaluation, even though the primary characteristic that NMP differentiates from the host is low power consumption.Scalability near the memory processing [22] was tried but no analytical tool was provided.
This paper focuses on the advantage of the power reduction due to the processor hierarchy in the NMP-enabled system.The primary contributions are as follows.First, the proposed tool makes it possible to test various design decisions in the early stage for the system with a processor hierarchy without the actual experiment.In this case, the power consumption and execution time are the main constrains to consider.Second, the design close to the optimal level for the whole application execution was found by searching the best operating condition at the task-level.

Proposed Power-Time Exploration
The HBM-based NMP-enabled system assumed in this paper contain different constraints unlike the conventional multi-core systems.First, NMP and the host are independent processors.However, the host has absolute priority for memory access.For the host, the NMP is not explicitly present and, therefore, does not compete for memory accesses, such as arbitration.After the memory controller in the host side sends the memory request, the deterministic response time should not be changed because of the NMP inside the memory.Next, the communication between the host and NMP is highly limited.They are not connected through a bus.Thus, a cache coherence protocol that is typically used in multi-core systems cannot be adopted.The idea that an NMP inside the memory uses an additional pin to send an interrupting signal to the host is also not welcome due to the hardware overhead.It is, therefore, reasonable to minimize the communication overhead between the host and NMP, and between one NMP and another by isolating the memory access time and space.Furthermore, it is necessary to reduce the synchronization overhead by using a large processing unit such as a task, rather than a small unit such as an operation.

Power-Time Optimization
In this paper, the Lagrangian core-selection technique is adopted.The popularity of this approach in various research fields is due to its effectiveness and simplicity.In this section, the Lagrangian optimization techniques and their application to task-level NMP offloading is briefly reviewed.Let A be a specific application, and C be the assigned cores for A. The purpose of core-task allocation is to obtain a C that minimizes the whole execution time T of A. However, for the NMP-enabled system where the heat problem is critical, the total power consumption constraint P c should be satisfied.The problem to solve can be formulated as Equation ( 1).minimize T(C), T(C) is the execution time when A is executed in the core set C, whereas P(C) is the power consumed at that time.In practice, rather than solving the constrained problem in Equation ( 1), it can be reformulated with unconstrained minimization, which is shown in Equation ( 2).In this case, λ is the Lagrange multiplier.The solution C* is optimal in the sense that the total execution time T(C*) has the minimum value with C* among all combinations of core allocation options, which satisfies power consumption less than or equal to P c.In this scenario, it is assumed that a power constraint P c corresponds to λ [23].
If application A is partitioned into a number of tasks A i (i = 1, . . ., n), the associated core allocation decisions are independent of each other.An additive time measure is used assuming a serialized execution manner.The minimization problem in Equation ( 2) can be written as Equation (3). Minimize In this case, T i (C i ) and P i (C i ) are the time and power consumed when task A i is allocated to core C i , respectively.The optimum solution of Equation ( 3) is obtained by selecting the appropriate core C i for each task A i .Herein, the problem can be simplified as shown in Equation (3), by assuming task-level serial execution and considering the distinctive relation among NMP, host, and memory mentioned in the beginning of Section 3.

Power-Time Cost Using Easy-to-Use Lambda for Design Decision
In the NMP-enabled system, there are various combinations of core allocations that determine whether to execute the tasks on the host or NMP.For example, when running N tasks to a system consisting of one host and one NMP, a combination of 2 N exists.Among them, the best decision should be made considering power and time tradeoff.As N increases, the available core allocation combination increases rapidly.Searching for all cases is time consuming.To narrow the search range for the core allocation, the Lagrangian optimization scheme is adopted.Equation ( 3) is represented by the cost function as Equation (4).The best offloading decision is made by choosing the host or NMP to have a low cost for each task.The searching complexity is reduced from 2 N to 2N.
To solve the minimization problem shown in Equation (4), it is important to determine the λ value.Given the cost function, if T is differentiated with respect to P, the best offloading point is determined by obtaining the stationary point.In other words, the most convex point needs to be found in the power-time curve shown in Figure 2.However, in the simulation-based design exploration, a power-time curve can be obtained after testing all the numerous offloading options.As the numbers of core and task increase, the computational complexity increases sharply.Note that the best offloading point is determined by the relative difference in the performance of the NMP and the host.It is observed that, for a given application, the shape and slope of the power-time curve are primarily determined by the performance ratio of the host and the NMP.The lower the performance of the NMP is, the higher the slope of the curve will be.In contrast, the higher the performance of the host is, the lower the slope of the curve will be.Fortunately, in the NMP-enabled system, both end points of the curve are easily obtained.One end point represents an application running in the host-only system, whereas the other represents running in the NMP-only system.Therefore, herein, the λ value is determined based on the performance ratio of the given host and NMP as Equation ( 5).This approach also satisfies the qualifying condition for λ to solve the unconstrained problem [23].The P c constraint corresponds to λ.The value of λ increases as the power constraint P c is decreased, whereas λ decreases when the high-performance processors are used due to the sufficient P c .
Figure 2 shows an example of assigning four tasks to the host and the NMP.The horizontal axis represents the power consumption, whereas the vertical axis represents the execution time.Black dots represent the power and time results of 2 4 choices.For the NMP in the base layer of the HBM, it is realistic to use a processor with low performance and low power when compared to the host.Therefore, the offloading of some tasks to the NMP increases the total execution time of the application, whereas the power consumption is decreased.A gray power-time curve consisting of dots on the lower left represents the power and time tradeoff as tasks are offloaded to the NMP.The most convex point in the curve will be the best NMP offloading choice with the least increase in the execution time versus power reduction.In this paper, the best offloading point is determined by 2 × 4 = 8 runs.First, the value of λ is set according to the performance ratio of NMP and host using Equation (5).The determined λ value is indicated by a dashed line in Figure 2.Then, two costs of each task executed on the host and the NMP are compared to decide whether or not to offload. of the NMP is, the higher the slope of the curve will be.In contrast, the higher the performance of the host is, the lower the slope of the curve will be.Fortunately, in the NMP-enabled system, both end points of the curve are easily obtained.One end point represents an application running in the hostonly system, whereas the other represents running in the NMP-only system.Therefore, herein, the λ value is determined based on the performance ratio of the given host and NMP as Equation ( 5).This approach also satisfies the qualifying condition for λ to solve the unconstrained problem [23].The Pc constraint corresponds to λ.The value of λ increases as the power constraint Pc is decreased, whereas λ decreases when the high-performance processors are used due to the sufficient Pc.
Figure 2 shows an example of assigning four tasks to the host and the NMP.The horizontal axis represents the power consumption, whereas the vertical axis represents the execution time.Black dots represent the power and time results of 2 4 choices.For the NMP in the base layer of the HBM, it is realistic to use a processor with low performance and low power when compared to the host.Therefore, the offloading of some tasks to the NMP increases the total execution time of the application, whereas the power consumption is decreased.A gray power-time curve consisting of dots on the lower left represents the power and time tradeoff as tasks are offloaded to the NMP.The most convex point in the curve will be the best NMP offloading choice with the least increase in the execution time versus power reduction.In this paper, the best offloading point is determined by 2 × 4 = 8 runs.First, the value of λ is set according to the performance ratio of NMP and host using Equation (5).The determined λ value is indicated by a dashed line in Figure 2.Then, two costs of each task executed on the host and the NMP are compared to decide whether or not to offload.It is very common to change the operating condition of the systems due to the energy status or user requirements.After measuring the power and time values for each task in the initial system condition, the changed power and time values can be estimated for various system conditions.Based on the estimated power and time data, the decision of task-level offloading is dynamically made through the proposed evaluation tool based on the power-time cost.Given the same application, Figure 2a shows the case where the power constraint Pc is small.In this case, an NMP with a lowpower configuration such as a low operating clock frequency is adopted.This indicates that the It is very common to change the operating condition of the systems due to the energy status or user requirements.After measuring the power and time values for each task in the initial system condition, the changed power and time values can be estimated for various system conditions.Based on the estimated power and time data, the decision of task-level offloading is dynamically made through the proposed evaluation tool based on the power-time cost.Given the same application, Figure 2a shows the case where the power constraint P c is small.In this case, an NMP with a low-power configuration such as a low operating clock frequency is adopted.This indicates that the processor performance of the NMP is significantly lower than that of the host.Naturally, the value of λ is high.In the most convex point of this curve, three tasks are offloaded to the NMP.For certain reasons, the required power constraint P c is not severe such as in the case of Figure 2b.Thus, the operating clock frequency of the host is increased.The power-time curve is estimated from the actual results of Figure 2a.The value of λ, which is also estimated becomes low.In this case, in the best offloading point, two tasks are offloaded to the NMP.Therefore, one task should move to the host.

NMP-Enabled System Organization
The organization of the NMP-enabled heterogeneous system assumed herein is as follows.A high-performance host processor and a low-performance NMP processor are used.The purpose of this paper is to propose a tool to measure and search the power-time performance when the HBM exhibits a processor hierarchy.Therefore, to reduce the complexity of the simulation, one host and one NMP are used.Table 1 shows the specifications of the host and NMP used in the simulation.Both are x86 processors and use a 32 KB L1 cache.To create a difference in the computing performance between the host and NMP, the host assumes an out-of-order CPU and additionally includes 1 MB L2 cache.The NMP meanwhile, assumes an in-order CPU and does not include an L2 cache.No bus-like channel exists between the host and the NMP, and data are exchanged only through the memory.The NMP is directly connected to the 4 GB HBM2 via TSV, whereas the host can access HBM2 through the interposer and the memory I/O interface.As mentioned earlier, the host must have absolute priority in the memory, and its memory access response time should not be affected by the NMP.To ensure the reliability of the system with these restrictions, the host and NMP do not access the memory simultaneously.Furthermore, the host-NMP organization requires a way to avoid a cache coherence violation because they do not use the cache coherence protocol through the bus.In this study, it is assumed that the cache write-backs the data and flushes itself when the core accessing memory changes.Through the above process, the cache of the newly accessing core is always empty, and the data of the previous core is recorded in the main memory.Therefore, a coherence problem does not occur.The cache flush may cause a performance degradation due to an increase in the cache miss.However, the overhead die to the cache flush will not be large because the shared data between tasks are already minimized when dividing an application into tasks.

Simulation Environment
It is time consuming to simulate various NMP offloading cases of all tasks to obtain the power and time.Therefore, for simplicity, each task is performed in the host-only and NMP-only environments.Subsequently, the power and time in NMP offloading are calculated by combining the results obtained in task units.To compare the optimal offloading point for various system conditions, the operating clock frequency is changed.For the host processor, the simulation is performed for four clock frequencies of 1, 2, 3, and 4 GHz, whereas, the NMP processor, four clock frequencies of 200, 400, 600, and 800 MHz are used.
The execution time of each task is measured in the Gem5 full system mode [24], whereas the power consumptions of the processors and caches are calculated using McPAT [25].The supply voltage changes according to the operating clock frequency [26].The 32-nm logic technology is assumed.The DRAM power consumption is modeled as a DRAM layer and logic layer, separately.When the host accesses the HBM, the energy per bit is known to be 6 to 7 pJ/bit [27].Conservatively, by assuming that the energy consumed by the DRAM layer and TSV is similar to HMC, it is estimated that 3.7 pJ/bit [28] and 2.3 pJ/bit are consumed in the logic layers, respectively.The NMP accesses the HBM directly without using an I/O interface.Thus, only the energy required for the DRAM layer access is considered for the power consumption calculation.In the case of DRAM static power, no difference is observed between the NMP and host.The power consumption is calculated using Equation ( 6) through the DRAM energy per bit, i.e., pJ/bit = mw/Gbps.
Three applications are used for the simulation.PARSEC [29] and SPLASH-2 [30] are widely used as benchmarks in multi-core environments, and VGG-F [31] is a CNN algorithm of recent interest.To construct an application consisting of tasks with different characteristics, each benchmark of PARSEC is used as a task.PARSEC is composed of various benchmarks with characteristics that are computation-intensive or data-intensive.Different benefits can be obtained from the host or NMP according to the benchmark, which is suitable for observing the performance change due to NMP offloading.In this simulation, nine benchmarks of bodytrack, canneal, dedup, fluidanimate, freqmine, streamcluster, swaptions, vips, and x264 are used.SPLASH-2 is primarily used in distributed shared memory machines and is dominated by high-performance computation and graphics-intensive benchmarks.As a second application, a computation-bound application is composed of 10 benchmarks known as barnes, cholesky, fft, fmm, lu_cb, ocean, radiosity, radix, volrend, and water_nsquared from SPLASH-2.For the third application, a machine-learning application known as data-demanding is chosen.VGG-F is a type of CNN developed by the VGG group.In this study, the VGG-F is divided into eleven tasks based on the layer.The 11 tasks correspond to five convolution layers, three fully connected layers, a relu layer with softmax, a local response normalization, and a max pooling layer, respectively.

Results and Discussion
The proposed cost-based search and the full search schemes are compared for PARSEC, SPLASH-2, and VGG-F.Simulations are performed on four system configurations to compare the accuracy in various power constraints.In each configuration, the NMP and the host have the following operating clock frequencies of (200 MHz, 1 GHz), (400 MHz, 1 GHz), (400 MHz, 3 GHz), and (600 MHz, 4 GHz).In Figure 3, the horizontal axis represents the power consumption, whereas the vertical axis represents the execution time.The best offloading points are obtained in four configurations and the power-time values at that time are shown.Although the full search curve is slightly lower on the left side than the proposed curve, the two curves tend to be similar.Recall that the proposed and full search schemes need 2N and 2 N computations, respectively, given N tasks.PARSEC, SPLASH-2, and VGG-F consists of nine, 10, and 11 tasks, respectively.In the full search scheme, 512, 1024, and 2048 computations are necessary for three applications, respectively.When the proposed scheme is used, the computational complexity is reduced by about 98%.
following operating clock frequencies of (200 MHz, 1 GHz), (400 MHz, 1 GHz), (400 MHz, 3 GHz), and (600 MHz, 4 GHz).In Figure 3, the horizontal axis represents the power consumption, whereas the vertical axis represents the execution time.The best offloading points are obtained in four configurations and the power-time values at that time are shown.Although the full search curve is slightly lower on the left side than the proposed curve, the two curves tend to be similar.Recall that the proposed and full search schemes need 2N and 2 N computations, respectively, given N tasks.PARSEC, SPLASH-2, and VGG-F consists of nine, 10, and 11 tasks, respectively.In the full search scheme, 512, 1024, and 2048 computations are necessary for three applications, respectively.When the proposed scheme is used, the computational complexity is reduced by about 98%.In Table 2, 16 system configurations are considered.The host can operate with the clock frequencies of 1, 2, 3, and 4 GHz, whereas the NMP can operate with the clock frequencies of 200, 400, 600, and 800 MHz.For three applications, the λ value, the number of NMP offloading tasks, and its power and time values are shown.Since the performance of the host or NMP processor is higher, the power-time curve moves to the lower right direction.Thus, the λ value also decreases.At the small λ value, it can be predicted that the task's NMP offloading is less preferred.For example, when the NMP operates at 200 MHz, the λ decreases as the host processor performance increases, and the number of NMP offloading tasks decreases.In most cases, the optimal power-time moves to the lower right.However, when the performance of the host processor is low and the performance of the NMP processor is high, i.e., the performance difference between the two processors is not extremely large, the number of NMP offloading tasks and the optimal power-time points are not well predicted.For example, in a PARSEC application, when the host performance is 1 GHz, the λ value decreases as the operating clock frequency of the NMP increases.However, the number of NMP offloading tasks increases, and the optimal power-time point decision is also inconsistent.This is because the difference between the hosts or the NMP selection is not large when it is more important to reduce the power consumption than the execution time.
Table 3 shows the results of dynamically determined offloading when the system condition is changed.The presented data are similar to Table 2.However, the power-time data of all tasks required to decide the best offloading point is not obtained by simulation but is estimated.In this simulation, the initial system configuration has the host with 2 GHz and the NMP with 400 MHz.Thus, the results of shaded cells in Table 3 are from the actual simulation data and it is exactly the same with the values in Table 2. To decide the offloading point in other system configurations, the required power-time data are scaled from the initial system condition without a running simulation.For example, suppose that the operating clock frequencies of the host and NMP are increased to 4 GHz and 600 MHz, respectively.The execution time is proportional to the operating clock frequency.Thus, the execution time of tasks on the host and the NMP are obtained by multiplying the initial data by 2 and 1.5 times, respectively.In the case of power data in the NMP and host, it is calculated as 1.386 times and 1.163 times as increasing by 200 MHz and 1 GHz, respectively.The scaling factor is obtained empirically.When comparing with the actually obtained data in Table 2, the errors of estimated power and time data at the best offloading point are 19% and 12%, respectively.However, the difference in the number of offloaded tasks is very small, which is, on average, 0.3 task.

Figure 1 .
Figure 1.Processor-centered and memory-centered approaches to overcome the memory bottleneck, (a) deep memory hierarchy, and (b) processor hierarchy.

Figure 1 .
Figure 1.Processor-centered and memory-centered approaches to overcome the memory bottleneck, (a) deep memory hierarchy, and (b) processor hierarchy.

Figure 2 .
Figure 2. Power-Time curve (a) when a power constraint is low and (b) when a power constraint is high.

Figure 2 .
Figure 2. Power-Time curve (a) when a power constraint is low and (b) when a power constraint is high.

Table 1 .
Host and NMP specification.

Table 2 .
The best offloading point from the actual data and its power-time performance.

Table 3 .
The best offloading point from the estimated data and its power-time performance.