CASPER: Embedding Power Estimation and Hardware-Controlled Power Management in a Cycle-Accurate Micro-Architecture Simulation Platform for Many-Core Multi-Threading Heterogeneous Processors

: Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs). While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these


Introduction
Emerging instruction set based multi-core processors [1][2][3][4][5] are significantly larger and more complex compared to their dual and quad core predecessors [6].Consisting of hundreds of cores on-chip, new heterogeneous many-cores are designed with bigger on-chip caches, complex interconnection topologies, and multiple customized IP cores for better performance.Large chip area however signifies high leakage and dynamic power dissipation.Hence architects increasingly use on-chip power controllers which use power-saving techniques such as power-gating, clock-gating and dynamic voltage and frequency scaling (DVFS) to minimize overall power dissipation of their designs.Existing cycle-accurate processor simulators extensively used for performance modeling and validation of processor micro-architectures [7,8] are inadequate to accurately capture the effect of the dimensions and dynamic interactions between the micro-architectural components such as number of cores, cache size and associativity and interconnection network topologies to name a few, on performance and power dissipation of simulated designs as well as accurately model the complex logic of the power controller.Such capabilities are quintessential to efficiently explore the vast micro-architectural design space of heterogeneous many-core processors created from a large number of design choices of the various micro-architectural parameters and achieve optimal designs with the right balance of power and performance.
Contemporary popular cycle-accurate simulators such as MPTLSim and NepSim target superscalar architectures and network processor architectures respectively and fall short on covering important features such as hardware multi-threading, in-order instruction pipeline, custom IP cores and heterogeneity-some of the fundamental micro-architectural aspects of emerging many-core designs.Hardware multi-threading in the cores is used to exploit latency hiding [8,9].Heterogeneous cores are used to achieve better power-performance balance for example in Netronome NFP-32.State-of-the-art simulators such as Simics and M5 on the other hand covers a wide range of micro-architectures, various instruction set architectures and pre-existing processor architectures, but do not capture two key elements of processor research-(i) interfaces to control a large number of tunable micro-architectural parameters such as load miss queue, store buffer size, branch prediction buffer size to name a few; and (ii) cycle-accurate models of on-chip power management units which enable power gating, clock gating and DVFS.Finally, existing simulators are often not scalable to hundreds of cores and do not support complete operating software stack which restricts their usability for a wide range of applications.
In this paper, we present Chip multi-threading Architecture Simulator for Performance, Energy and aRea (CASPER)-a SPARCV9 instruction set architecture based cycle-accurate power-aware heterogeneous multi-threading many-core processor micro-architecture simulation platform.The idea is to provide a simulation platform where the user can easily modify a wide range of tunable architectural parameters to evaluate the performance and estimate pre-silicon leakage and dynamic power dissipation of their designs.The platform also provides interfaces through which users can design and develop cycle-accurate models of power management algorithms in CASPER and evaluate strategies to increase energy efficiency of their designs.Our primary contributions include: (a) SPARCV9 ISA-CASPER is a simulation tool for cycle accurate full system simulation of 64-bit SPARCV9 instruction set architecture.After the success of Oracle's UltraSPARC T1, T2 and T3 processors and open-sourcing the T1 and T2 designs via the OpenSPARC project [10], there is an increased interest in SPARC instruction set based processor designs for example SimplyRISC [10] which makes this platform an important contribution.(b) In-Order Cores with Hardware Multi-Threading-A key shift in the design principle in many-core designs is to use a large number of simple low power cores to exploit a high degree of parallelism significantly observed in products such as Tilera [11], Intel Atom [12] and UltraSPARC T1 [13].At the same time, hardware multi-threading is used in the cores to exploit latency hiding and increase overall throughput.However, single-thread performance critically depends upon the number of hardware threads in a core.Hence, CASPER is designed to simulate simple in-order multi-threaded cores parameterized in terms of number of hardware threads per core.(c) Heterogeneous Cores-Exploiting heterogeneity enables designers to achieve optimal power-performance trade-offs in their designs.For example, a processor core with deeper store buffer can be most energy efficient in case of write intensive application which tends to utilize more store instructions compared to a core which is designed with a large data cache [14,15].This motivates us to design CASPER to simulate a set of heterogeneous cores where each core is structurally different from each other.In CASPER, a core can be optimized by tuning a large number of micro-architectural parameters namely number of hardware threads, number of pipeline stages, instruction and data cache (I$/D$) size, associativity and line size, and size of instruction and data virtual-to-physical address translation buffer (I-TLB/D-TLB), branch prediction buffer, instruction miss queue, load miss queue and store buffer.Users can specify the structure of each core independently in CASPER.(d) Shared Memory-Low latency on-chip shared memory system is used to optimize data sharing which is connected to the cores through an interconnection network.Therefore, in addition to the per cycle behavior of the heterogeneous multi-threading cores, the cycle by cycle functionality of chip level architectural features such as number of shared memory banks, control logic of the memory banks and interconnection structure are included in CASPER to accurately model the impact of all the micro-architectural features on overall processor performance and power.(e) Pre-Silicon Power Estimation-Although popular simulators do a good job in modeling performance through cycle accuracy and functional accuracy, they fall short in power estimation.Similar to performance, both dynamic and leakage power depends on the dimensions and dynamic interactions between the micro-architectural features of a processor.In CASPER, hardware models of the architectural features are synthesized, placed and routed using technology files to derive dynamic and leakage power dissipation values.These values are used in a cycle-accurate manner during simulation.Hence as silicon technology generation files become available, they can be used in CASPER to estimate power dissipation of simulated designs.Currently, technology files from 90 nm to 32 nm are freely available [16] and can be used in CASPER.Power-aware simulation tools such as Wattch [17] also consist of both leakage and dynamic power dissipation models of the micro-architectural features of processors.We intend to compare the accuracy of our methodology and Wattch in future work.(f) Power Control Unit-Dynamic power management (DPM) in multi-core processors involves a set of techniques which perform power-efficient computations under real-time constraints to achieve system throughput goals while minimizing power.DPM is executed by an integrated power management unit (PMU), which is typically implemented in software, hardware or a combination thereof.The PMU monitors and manages the power and performance of each core by dynamically adjusting its operating voltage and frequency.Hardware-controlled power management eliminates the computation overhead that the processor incurs for software based power management while performing workload performance and power estimations.Hence hardware power management realizes more accurate and real time impact on workload performance than slower reacting software power management can achieve.In CASPER, the PMU has a hierarchical structure.The local PMU exercises clock-gating at the stages of the instruction pipeline in the cores, where the global PMU enforces a power control policy where DVFS and power-gating of a core is decide by analyzing its utilization and wait times due to long latency memory accesses.(g) Operating Stack-A full SPARCV9 instruction set implemented in CASPER makes it more usable and programmable.Solaris 5.10 version operating system kernel is ported onto CASPER along with a complete libc software stack [18].This enables users to run any application on a simulated processor.In this study, we have used ENePBench a network packet processing benchmark to evaluate many-core designs.
CASPER is written in C++ programming language and has been flexibly threaded to take advantage of a wide variety of multi-core machines.On an Oracle T1000 server, CASPER can simulate approximately 100K instructions per second per hardware thread.The rest of the paper is organized as follows.Section 2 explains the processor model in addition to the configurable parameters in CASPER.Sections 3 and 4 explain the methodologies used to model performance and power/energy consumption in CASPER.Section 5 explains the benchmarks used and the outputs and capabilities of the simulation platform are explained in Section 6. Section 7 discusses related work.Finally, we conclude and discuss our future work in Section 8.

Processor Micro-Architecture
The processor model used in CASPER is shown in Figure 1.Each core is organized as single-issue in-order with fine-grained multi-threading (FGMT) [7].In-order FGMT cores utilize a single-issue six stage pipeline [8] shared between the hardware threads, enabling designers to achieve: (i) high throughput per-core by latency hiding; and (ii) minimize the power dissipation of a core by avoiding complex micro-architectural structures such as instruction issue queues, re-order buffers and history-based branch predictors typically used in superscalar and out-of-order micro-architectures.The six stage RISC pipeline is an alteration of the basic simplest five-stage in-order instruction pipeline Fetch, Decode, Execute, Memory and WriteBack.The sixth stage is a single-cycle thread switch scheduling stage which follows a round robin algorithm and selects an instruction from a hardware thread in ready state to issue to the decode stage.The D-stage includes a full SPARCV9 instruction set decoder as described in [19].Execution Unit (EXU) contains a standard RISC 64-bit ALU, an integer multiplier and divider.EXU constitute the E-stage of our pipeline.Load Store Unit (LSU) is the top level module which includes the micro-architectural components of the M-stage and W-stage.It includes the data TLB (D-TLB), data cache (D$), address space identifier queue (ASIQ), load miss queue (LMQ) and store buffer (SB).LMQ maintains D$ misses.SB is used to serialize the store instructions following the total store order (TSO) model.Stores are write through.Both LMQ and SB are separately maintained for each hardware thread while the D-TLB and D$ are shared.Special registers in SPARCV9 are accessed via the ASI queue and ASI operations are categorized as long latency operations as these instructions are asynchronous to the pipeline operation [19].Loads and stores are resolved in the L2cache and the returning packets are serialized and executed in the data fetch queue (DFQ) also a part of LSU.The Trap logic Unit (TLU) of SPARCV9 architecture used in CASPER is structurally similar to that of UltraSPARC T1 [20].
In addition to hardware multi-threading, clock-gating [21] in the pipeline stages enables us to minimize power dissipation by canceling the dynamic power in the idle stages compared to active blocks which consumes both dynamic and leakage power.Long latency operations in a hardware thread are typically blocking in nature.In CASPER, the long latency operations such as load misses and stores are non-blocking.We call this feature hardware scouting.This feature optimally utilizes the deeper load miss queues compared to other architectures such as UltraSPARC T1 [22,23] where the load miss queue contains only one entry.In average, this enhances the performance of a single thread by 2-5%.The complete set of tunable core-level micro-architectural parameters is shown in Table 1.At the chip level, N C cores are connected to the inclusive unified L2 cache via a crossbar interconnection network.L2 cache is divided into N B banks.N C and N B are parameterized in CASPER.Each L2 cache bank maintains separate queues negotiating core to L2 cache instruction miss/load miss/store packets from each core.An arbiter inside the banks selects packets from different queues in a round-robin fashion.The length of the queue is an important parameter as it affects the processing time of each L2 cache access time.The complete set of tunable chip-level micro-architectural parameters is shown in Table 2. Coherency is maintained in the L2 following the directory based cache coherency protocol.The L2 cache arrays consists of the reverse directories of both instruction and data L1 caches of all the cores.A read miss at the L2 cache populates an entire cache line and the corresponding cache line in the originating L1 cache instruction or data cache.All subsequent reads to the same address from any other core lead to a L2 cache hit and is populated directly from L2 cache.The L2 cache is inclusive and unified; hence it contains all the blocks which are also present in L1 instruction or data caches in the cores.Store instructions or the writes are always committed at the L2 cache first following the write through protocol.Hence, the L2 cache always has the most recent data.Once a write is committed in the L2 cache all the matching directory entries are invalidated which means that the L1 instruction and data cache entries are invalidated using special L2 cache to core messages.

Performance Measurement
Counters are used in CASPER in each core to measure the number of completed instructions individually for each hardware thread (Instr THREAD ) as well as for the entire core (Instr CORE ) every clock cycle.Counters are also attached to the L2 cache banks to monitor the load/store accesses from the cores.This enables the user to estimate average wait time for loads/stores per core and per hardware thread in each core.The wait time includes wait time of each load/store instruction in the LMQ or SB respectively, propagation time in the interconnection network and total L2 cache access time.

Component-Level Power Dissipation Modeling
To accurately model the area and the power dissipation of the architectural components we have (i) designed scalable hardware models of all pipelined and non-pipelined components of the processor in terms of corresponding architectural parameters; and (ii) derived power dissipations (dynamic + leakage) of the component models (written in VHDL) using the commercial synthesis tool Design Vision from Synopsys [24] which targets the Berkeley 45 nm Predictive Technology Model (PTM) technology library [25], and placement and routing tool Encounter from Cadence [26].The area and power dissipation values of I$ and D$ are derived using Cacti 4.0 [27].In case of the parameterized microarchitectural non-pipeline components in a core such as the LMQ, SB, MIL, IFQ, DFQ, and I/D-TLB area and power are found using a 1 GHz clock, and stored in lookup tables indexed according to the values of the micro-architectural parameter.The values from lookup tables are then used in the simulation to calculate the power dissipation of the core by capturing the activity factor α(t) from simulation, and integrating the product of power dissipation and α(t) over simulation time.The following equation is used to calculate the power dissipation of a pipeline stage: where α is the activity factor of that stage (α = 1 if that stage is active; α = 0 otherwise) which is reported by CASPER; P leakage and P dynamic are the pre-characterized leakage and dynamic power dissipations of the stage respectively.The pre-characterized values of area, leakage and dynamic power of core-level architectural blocks are shown in Table 3.The HDL models of all the core-level architectural blocks have been functionally validated using exhaustive test benches.The activity factor α is derived by tracking the switching of all the components in all the stages of the cores on a per cycle basis.As a given instruction is executed through the multiple stages of the instruction pipeline inside a core, the simulator tracks: (i) the intra-core components that are actively involved in the execution of that instruction; and (ii) the cycles during which that instruction uses any particular pipeline stage of a given component.Any component or a stage inside a component is assumed to be in two states-idle (not involved in the execution of an instruction) and active (process an instruction).For example, in case of a D$ load-miss, the occurrence of the miss will be identified in the M-stage.The load instruction will then be added to the LMQ and W-stage will be set to an idle state for the next clock cycle.A non-pipelined component is treated as a special case of a single stage pipelined one.We consider only leakage power dissipation in the idle state and both leakage and dynamic power dissipations in the active state.Figure 3 shows the total power dissipation of a single representative pipeline stage in one component.Note that the total power reduces to just the leakage part in the absence of a valid instruction in that stage (idle), and the average dynamic power of the stage is added when an instruction is processed (active).A certain pipeline stage of a component will switch to active state when it receives an instruction ready signal from its previous stage.In the absence of the instruction ready signal, the stage switches back to idle state.Note that the instruction ready signal is used to clock-gate (disable the clock to all logic of) an entire component or a single pipeline stage inside the component to save dynamic power.
Hence we only consider leakage power dissipation in the absence of an active instruction.In case of an instruction waiting for memory access or in the stall state due to a prior long latency operation, is assumed to be in active state.
Figure 4 illustrates the methodology of power measurement in the pipelined components in CASPER as seen during 5 clock cycles.The blue dotted lines show the amount of power dissipated by the pipeline stage as an instruction from a particular hardware thread is executed.Time increases vertically.For example, for the 5 clock cycles as shown in Figure 4

Power Dissipation Modeling of L2 Cache and Interconnection Network
The power and area of L2 cache and interconnection network however depend on the number of cores which makes it immensely time-consuming to synthesize, place and route all possible combinations.Hence we have used multiple linear regression [28] for this purpose.The training set required to derive the regression models of dynamic power and area of L2 cache arrays are measured by running Cacti 4.0 [27] for different configurations of L2 cache size, associativity and line sizes.Dynamic power dissipation measured in Watts of L2 cache is related to the size in megabytes, associativity and number of banks as shown in Equation 2. The error of estimate found for this model ranges between 0.524 and 2.09.Errors are estimated by comparing the model predicted and measure value of dynamic power dissipation for configurations of L2 cache size, associativity, linesize and number of banks.These configurations are different from the ones used to derive the training set._ L2 0 1 Size 2 Associativity 3 Similarly, the dynamic and leakage power dissipation measured in mill watts of a crossbar interconnection network is given by Equation 3and Equation 4 respectively.The training set required to derive the regression models of dynamic and leakage power of the crossbar are derived by synthesizing its hardware model parameterized in terms of number of L2 cache banks and number of cores using Synopsys Design Vision and Cadence Encounter.Note that dynamic and leakage power is related square of the number of cores (N C ) and number of cache banks (N B ) which means that the power dissipation scales super linearly with number of banks and cores.Thus, crossbar interconnects are not scalable.However, they provide high bandwidth required in our many-core designs.The regression model parameters and 95% confidence interval as reported by the statistical tool SPSS [29] are for dynamic and leakage power dissipation is shown in Tables 4 and 5 respectively.The confidence interval shows that the strength of the model is high.The models are further validated by comparing the model predicted power values and values measured by synthesizing 5 more combinations of number of L2 cache banks and number of cores different from the training set.We found that in these two cases, the errors of estimates ranges between 0.7 and 10.64.

Hardware Controlled Power Management in CASPER
Two hardware-controlled power management algorithms called Chipwide DVFS and MaxBIPS as proposed in [30] are implemented in CASPER.Note that all these algorithms continuously re-evaluate the voltage-frequency operating levels of the different cores, once every evaluation cycle.When not explicitly stated, one evaluation cycle corresponds to 1024 processor clock cycles in our simulations.The DVFS based GPMU algorithms rely on the assumption that when a given core switches from power mode A (voltage_A, frequency_A) in time interval N to power mode B (voltage_B, frequency_B) in time interval N + 1, the power and throughput in time interval N + 1 can be predicted using Equation (1).Note that the system frequency needs to scale along with the voltage to ensure that the operating frequency meets the timing constraints of the circuit whose delay changes linearly with the operating voltage [31].This assumes that the workload characteristics do not change from one time interval to next one, and there are no shared resource dependencies between tasks and cores.Table 6 explains the dependencies of power and throughput on the voltage and frequency levels of the cores.Table 6.Relationship of power and throughput in time interval N and N + 1.

Time Interval
The key idea of DVFS in Chipwide DVFS is to scale the voltages and frequencies of a single core or the entire processor during run-time to achieve specific throughputs while minimizing power dissipation, or to maximize throughput under a power budget.Equation (5) shows the quadratic and linear dependences of dynamic or switching power dissipation on the supply voltage and frequency respectively: where α is the switching probability, C is the total transistor gate (or sink) capacitance of the entire module, V dd is the supply voltage, and f is the clock frequency.Note that the system frequency needs to scale along with the voltage to satisfy the timing constraints of the circuit whose delay changes linearly with the operating voltage [30].DVFS algorithms can be implemented at different levels such as the processor micro-architecture (hardware), the operating system scheduler, or the compiler [32].
Figure 5 shows a conceptual diagram implementing DVFS on a multi-core processor.Darker shaded regions represent cores operating at high voltage, while lighter shaded regions represent cores operating at low voltage.The unshaded cores are in sleep mode.Chipwide DVFS is a global power management scheme that monitors the entire chip power consumption and performance, and enforces a uniform voltage-frequency operating point for all cores Voltage Supply

DC-DC Voltage Regulator
V dd high V dd low to minimize power dissipation under an overall throughput budget.This approach does not need any individual information about the power and performance of each core, and simply relies on entire chip throughput measurements to make power mode switching decisions.As a result, one high performance core can push the entire chip over throughput budget, thereby triggering DVFS to occur across all cores on-chip.A scaling down of voltage and frequency in cores which are not exceeding their throughput budgets will further reduce their throughputs.This may be undesirable, especially if these cores are running threads from different applications which run at different performance levels.The pseudo-code is shown in Table 7. Cumulative power dissipation is calculated by adding the power dissipation observed in the last evaluation cycle to the total power dissipation of Core i from time T = 0. Cumulative throughput similarly is the total number of instructions committed until now from time T = 0 including the instructions committed in the last evaluation cycle.Also, in this case the current core DVFS level is same across all the cores.The MaxBIPS algorithm [30] monitors the power consumption and performance at the global level and collects information about the entire chip throughput, as well as the throughput contributions of individual cores.The power mode for each core is then selected so as to minimize the power dissipation of the entire chip, while maximizing the system performance subject to the given throughput budget.The algorithm evaluates all the possible combinations of power modes for each core, and then chooses the one that minimizes the overall power dissipation and maximizes the overall system performance while meeting the throughput budget by examining all voltage/frequency pairs for each core.The cores are permitted to operate at different voltages and frequencies in MaxBIPS algorithm.A linear scaling of frequency with voltage is assumed in MaxBIPS [30].
Based on Table 6, the MaxBIPS algorithm predicts the estimated power and throughput for all possible combinations of cores and voltage/frequency modes (vf_mode) or scaling factors and selects the (core_i, vf_mode_j) that minimizes power dissipation, but maximizes throughput while meeting the required throughput budget.The pseudo-core of MaxBIPS algorithm is shown in Table 8, the Power Mode Combination i used in MaxBIPS algorithm is a lookup table storing all the possible combinations of DVFS levels across the cores.For example, if there are 4 cores and 3 DVFS levels as described in Table 6 which stores the predicted power consumption and throughput observed until the last evaluation cycle in the chip for all possible combinations of DVFS levels across the cores in the chip.The three DVFS levels used in Chipwide DVFS and MaxBIPS DPM are shown in Table 9.These voltage-frequency pairs have been verified using the experimental setup of Section 4. Note that performance predictions of the existing GPMU algorithms to be discussed this section do not consider the bottlenecks caused by shared memory access between cores.Please note that all synchronization between the cores is resolved in the L2 cache.During the cycle-accurate simulation, the L2 cache accesses from the cores are resolved using arbitration logic and queues in the L2 cache controllers.In case of L1 cache misses, packets are sent to the L2 cache which brings the data back in as load misses.An increased L1 cache miss hence means longer wait time for the instruction which caused the miss and effectively we observe the cycles per instruction (CPI) of the core to decrease.To evaluate the performance and power dissipation of candidate designs we have developed a benchmark suite called Embedded Network Packet Processing Benchmark (ENePBench) which emulates the IP packet processing tasks executed in a network router.The router workload varies according to internet usage where random number of IP packets arrive at random intervals.To meet a target bandwidth, the router has to: (i) process a required number of packets per second; and (ii) process individual packets within their latency constraints.The task flow is described in Figure 6.Incoming IPv6 packets are scheduled on the processing cores of the NeP based on respective packet types and priorities.Depending on the type of a packet different header and payload processing functions process the header and payload of the packet respectively.Processed packets are either routed towards the outward queues (in case of pass-through packets) or else terminated.
The packet processing functions of ENePBench are adapted from CommBench 0.5 [33].Routing table lookup function RTR, packet fragmentation function FRAG and traffic monitoring function TCP constitute the packet header functions.Packet payload processing functions include encryption (CAST), error detection (REED) and JPEG encoding and decoding as shown in Table 10.Functionally, IP packets are further classified into types TYPE0 to TYPE4 as shown in Table 11.The headers of all packets belonging to packet types TYPE0 to TYPE4 are used to lookup the IP routing table (RTR), managing packet fragmentation (FRAG) and traffic monitoring (TCP).The payload processing of the packet types, however, is different from each other.Packet types TYPE0, TYPE1 and TYPE2 are compute bound packets and are processed with encryption and error detection functions.In case of packet type TYPE3 and TYPE4, the packet payloads are processed with both compute bound encryption and error detection functions as well as data bound JPEG encoding/decoding functions.The two broad categories of IP Packets are hard real-time termed as real-time packets and soft real-time termed as content-delivery packets.Real-time packets are assigned with high priority whereas content-delivery packets are processed with lower priorities.Table 12 enlists the end-to-end transmission delays associated with each packet categories [34].The total propagation delay (source to destination) of real-time packets is less than 150 milliseconds (ms) and less than 10 s for content-delivery packets respectively [34].In practice 10 to 15 hops are allowed per packet which means the worst case processing time is approximately 10 ms in case of real-time packets and 1000 ms in case of content-delivery packets respectively [34] per intermediate router.For each packet the worst case processing time in a router includes the wait time in incoming packet queue, packet header and payload processing time and wait time in the output queues [35].Traditionally schedulers in NePs snoop on the incoming packet queues and upon packet arrival generate interrupts to the processing cores.A context switch mechanism is subsequently used to dispatch packets to the individual cores for further processing.Current systems however use a switching mechanism to directly move packets from incoming packet queues to the cores avoiding expensive signal interrupts [36].Hence, in our case we have not considered interrupt generation and context switch time to calculate worst case processing time.Also due to the low propagation time in current high bandwidth optical fiber networks we ignore the propagation time of packets through the network wires [37].In our methodology the individual cores are designed such that they are able to process packets within the worst case processing time.

Verification of CASPER
Functional correctness of candidate designs simulated in CASPER is verified using a set of diagnostic codes which are designed to test all the possible instruction and data paths in the stages of the pipeline in a core.Additional set of diagnostic codes are written in SPARCV9 assembly which consist of random combinations of instructions such that different system events such as traps, store buffer full and others are also asserted.To further verify the accuracy of CASPER, we have compared the total number of system events generated while executing 10 IP packets in the ENePBench in a real-life UltraSPARC T1000 machine consisting of an UltraSPARC T1 (T1) processor (T1) [19] to an exact UltraSPARC T1 prototype (T1_V) simulated in CASPER.UltraSPARC T1 is the closest in-order CMT variant to our CMT designs modeled in CASPER and consists of 8 cores and 4 hardware threads per core.The simulated processor in CASPER had equal number of cores, hardware threads per core, L1 and L2 caches as T1.Our results are tabulated in Table 13.Columns 3a, 3b, 4a, 4b, 5a, 5b and 6 in Table 13 compare the number of instructions committed, store buffer full event, I$ misses and D$ misses respectively in T1 and T1_V respectively.Column 6 shows that in average, the error in number of system events is less than 10%.
Table 13.Comparison between number of system events for 10 IP packets in (i) T1000 server with an UltraSPARC T1 processor and (ii) a T1 prototype simulated in CASPER.

Results
The power dissipation and throughput observed by varying the key micro-architectural components namely number of threads per core, data and instruction cache sizes per core, store buffer size per thread in a core and number of cores in the chip are showed in Figure 7 to Figure 21.Note that Hardware Power Management is not enabled for the experiments generating data shown for Figures 7 through 21.In each of the figures, power-performance trade-offs are shown by co-varying two micro-architectural parameters while the other parameters are kept at a constant value as described in the baseline architecture shown in Table 14.Cycles per instruction per core or CPI-per-core (lower is better) is measured by the total number of clock cycles during the runtime of a workload divided by the total number of committed instructions across all the hardware threads during that time and average cycles per instruction per thread or CPI-per-thread is measured by the number of clock cycles divided by the number of instructions committed in a hardware thread during the same time.All the data is based on the execution of compute bound packet type 1 (TYPE1) as described in Table 11.          Figure 7, Figure 8 and Figure 9 show the power dissipation, CPI-per-core and average CPI-per-thread for N T = 1, 4 and 8 respectively as data cache size is scaled from 4 KB to 64 KB.In Figure 7, the increase in D$ size reduces the data miss rate and hence both CPI-per-thread and CPI-per-core improve.Power dissipation however increases due to increasing D$ size.The figures also demonstrates the trade-offs between performance and power when number of threads is scaled from 1 to 8. The increase in the number of threads in a core means performance of individual threads is slowed down by as many cycles as the number of threads due to the round robin small latency thread scheduling scheme.CPI-per-core however is not linearly dependent on the number of threads.While factors such as increased cache sharing, increased pipeline sharing, lesser pipeline stalls improves CPI-per-core with thread-scaling, factors such as increased stall time at the store buffer, instruction miss queue and load miss queues tend to diminish it.Hence, we clearly see a non-linear pattern where CPI-per-core is higher in N T = 4 compared to N T = 8.In case of N T = 8, we observe 10% decrease in cache misses which results in lower CPI-pe-core compared to N T = 4.However, this is not the case when N T is increased from 1 to 4. Important to note that this behavior is application specific and hence reestablishes the non-linear co-dependencies between performance and the structure and behavior of the micro-architectural components.Figure 10, Figure 11 and Figure 12 show the power dissipation, CPI-per-core and average CPI-per-thread for N T = 1, 4 and 8 respectively as instruction cache size is scaled from 4 KB to 64 KB.Here also, CPI-per-thread and CPI-per-core improves with increasing I$ size as instruction misses decrease.Power dissipation however increases due to increasing I$ size.Figure 13, Figure 14 and Figure 15 show the power dissipation, CPI-per-core and average CPI-per-thread for store buffer sizes of 4 to 16 for N T = 1, 4 and 8 respectively.We observe similar increasing power consumption with increasing store buffer size.Both CPI-per-core and CPI-per-strand improve.In all these figures, the co-variance of core-level micro-architectural parameters D$ size, I$ size and SB size articulately demonstrates both the diminishing and positive effects of N T scaling.Power dissipation increases with N T .CPI-per-strand increases prohibitively affecting single thread performance due to shared pipeline, whereas CPI-per-core decreases showing improvement in throughput in a core as more instructions are executed in a core due to latency hiding.
Figure 16, Figure 18 and Figure 20 show the power dissipation and overall CPI observed in case of N T = 1, 4 and 8 respectively.Figure 17, Figure 19 and Figure 21 shows the peak power dissipation versus overall packet bandwidth observed in case of N T = 1, 4 and 8 respectively.Unlike the figures reporting core-level power dissipation and throughput, the overall power dissipation observed in the following figures include the cycle-accurate dynamic power consumption of the entire chip including all the cores, L2 cache and the crossbar interconnection.Interestingly, despite the consistent increase of peak power dissipation with increasing number of cores in the chip, packet bandwidth does not scale with number of cores due to the contention in the shared L2 cache.The diminishing effects of non-optimality can be observed especially in case of 128 cores.Packet bandwidth non-intuitively decreases as number of cores is scaled from 64 to 128.This further emphasizes the critical need of efficient and scalable micro-architectural power-aware design space exploration algorithms able to scan a wide range of possible design choices and find the optimal power-performance balance.In our case 32 cores is observed to be the optimal design since it shows the best power-performance balance.As shown in Figure 20, for the packet type TYPE1, with threads per core = 8, we observed both cache misses and pipeline stall reduce minimizing the CPI per core.In addition, with number of cores = 32, the wait time in the L2 cache queues was also minimum compared to the other core counts.Hence, in our case study of packet type TYP1, we found that with N T = 8, the optimal number of cores was 32.Increasing number of cores is diminishingly affecting throughput due to the non-optimal L2 cache micro-architecture which is divided into only 4 banks.Altering the number of L2 cache banks will mitigate contention and help increase packet bandwidth.
In Figures 22 and 23, we show the power and throughput data (with a throughput budget constrained to at 90% of peak throughput with any voltage and frequency scaling) for Chipwide DVFS and MaxBIPS policies for packet type 3 (TYPE3) which is a typical representative of all other packet types.The baseline architecture is displayed in Table 15.Values on the X-axis correspond to the number of evaluation cycles, where one evaluation cycle is the time period between consecutive runs of the power management algorithms.Where not explicitly stated, one evaluation cycle corresponds to 1024 processor clock cycles in our simulations.In Figure 22, the X-axis represents number of clock cycles and the Y-axis represents power (W).In Figure 23, the Y-axis represents throughput (in instructions per nanosecond-IPnS).As Figure 22 shows, the power consumption of MaxBIPS is much higher than Chipwide DVFS (the latter being the lower in power dissipation among the two methods).However the throughput of MaxBIPS is also higher than Chipwide DVFS.The percentage power-saving in all the cores in case of Chipwide DVFS and MaxBIPS is shown in Table 16. Figure 24 depicts the throughput per unit power (T/P) data for the two methods.Chipwide DVFS has the highest T/P values for the different packet types.Note that high T/P value for Chipwide DVFS arises from the fact that power dissipation in this scheme is substantially lower than other schemes, and not because the throughput is high.When implementing power management by Chipwide DVFS, any increase in the throughput of a single core over a target threshold triggers chipwide operating voltage (and hence, frequency) reductions in all cores, to save power.Hence, once the overall throughput exceeds the budget, all the cores have to adjust their power modes to a lower level.While this method reduces the overall power dissipation substantially, it also leads to excessive performance reductions in all cores as shown in Figure 24.A modification of the Chipwide DVFS algorithm required for achieving high performance is to assign a lower bound of throughput.Figures 25 and 26 show the power and throughput Chipwide DVFS characteristics (with a lower bound of throughput budget constrained to at 60% of peak throughput with all voltage-frequency levels) for packet type 3 (TYPE3).The power consumption and throughput of Chipwide DVFS are higher than those of MaxBIPS; this can be explained by the fact that the lower bound of throughput does not allow Chipwide DVFS to scale all the cores to lower voltage-frequency levels in order to guarantee the system performance.However the throughput per unit power of Chipwide DVFS is lower than those of MaxBIPS as Figure 27 demonstrates.The  17. Table 18 shows the power, throughput, and throughput per unit power in this case.In summary, experimental data show that when Chipwide DVFS is not enabled with lower bound of throughput, MaxBIPS has the highest throughput.Although Chipwide DVFS gives the highest throughput per unit power, its throughput, on average, is lower than that of MaxBIPs, which can be a constraining factor in high throughput systems that require throughputs close to the budget.When Chipwide DVFS is lower-bounded to 60% of peak throughput achievable by Chipwide DVFS, it produces the higher throughput and consumes the higher power between the two methods.This yields the lowest throughput per unit power for Chipwide DVFS, and MaxBIPS saves more power and achieves the highest throughput per unit power compared to the other two policies.Table 19 shows the relevant experimental results of two policies with different packet types.Table 20 shows the average power, average throughput, average throughput per unit power, average energy and average latency (execution time) of two power management policies while running about 7300 instructions for all the packet types (averaging is done over all packet types).Results show that on average, Chipwide DVFS consumes 17.7% more energy than MaxBIPS and has 2.34 times its latency.Figures 28 and 29 show the dynamic and leakage power dissipations, along with the hardware implementation areas for the Chipwide DVFS and MaxBIPS algorithms, as the number of cores is scaled.The total power dissipation of the MaxBIPS DPM hardware for an 8 core processor is around 101 µW, which is small compared to the average 20% power saving achieved.This justifies the use of on-chip power management units which enable substantial power saving while meeting the performance requirements of the packet processing application.

Related Work
First we will do a brief survey of existing general purpose processor simulators and then power-aware simulators.The authors of [15,[38][39][40][41][42][43] study variations of power and throughput in heterogeneous architectures.B. C. Lee and D. M. Brooks et al. minimize the overhead of micro-architectural design space exploration through statistical inference via regression models in [44].The models are derived using fast simulations.The work in [45] differs from these in that they use full-fledged simulations for predicting and comparing performance and power of various architectures.A combination of analytic performance models and simulation-based performance models is used in [45] to guide design space exploration for sensor nodes.All these techniques rely on efficient processor simulators for architecture characterization.
Virtutech Simics [46] is a full-system scalable functional simulator for embedded systems.The released versions support microprocessors such as PowerPC, x86, ARM and MIPS.Simics is also capable of simulating any digital device and communication bus.The simulator is able to simulate anything from a simple CPU + memory, to a complex SoC, to a custom board, to a rack of multiple boards, or a network of many computer systems.Simics is empowered with a suite of unique debugging toolset including reverse execution, tracing, fault-injection, checkpointing and other development tools.Similarly, Augmint [47] is an execution-driven multiprocessor simulator for Intel x86 architectures developed in University of Illinois, Urbana-Champagne.It can simulate uniprocessors as well as multiprocessors.The inflexibility in Augmint arises from the fact that the user needs to modify the source code to customize the simulator to model multiprocessor system.However both Simics and Augmint are not cycle-accurate and they model processors which do not have open-sourced architectures or instruction sets; this limits the potential for their use by the research community.Another execution-driven simulator is RSIM [48] which models shared-memory multiprocessors that aggressively exploit instruction-level parallelism (ILP).It also models an aggressive coherent memory system and interconnects, including contention at all resources.However throughput intensive applications which exploit task level parallelism are better implemented by the fine-grained multi-threaded cores that our proposed simulation framework models.Moreover we plan to model simple in-order processor pipelines which enable thread schedulers to use small-latency, something vital for meeting real-time constraints.General Execution-driven Multiprocessor Simulator (GEMS) [49] is an execution-driven simulator of SPARC-based multiprocessor system.It relies on functional processor simulator Simics and only provides cycle-accurate performance models when potential timing hazards are detected.GEMS Opal provides an out-of-order processor model.GEMS Ruby is a detailed memory system simulator.GEMS Specification Language including Cache Coherence (SLICC) is designed to develop different memory hierarchies and cache coherence models.The advantages of our simulator over the GEMS platform include its ability to (i) carry out full-chip cycle-accurate simulation with guaranteed fidelity which results in high confidence during broad micro-architecture explorations; and (ii) provide deep chip vision to the architect in terms of chip area requirement and run-time switching characteristics, energy consumption, and chip thermal profile.
SimFlex [50] is a simulator framework for large-scale multiprocessor systems.It includes (a) Flexus-a full-system simulation platform; and (b) SMARTS-a statistically derived model to reduce simulation time.It employs systematic sampling to measure only a very small portion of the entire application being simulated.A functional model is invoked between measurement periods, greatly speeding the overall simulation but results in a loss of accuracy and flexibility for making fine micro-architectural changes, because any such change necessitates regeneration of statistical functional models.SimFlex also includes FPGA-based co-simulation platform called the ProtoFlex.Our simulator can also be combined with an FPGA based emulation platform in future, but this is beyond the scope of this work.
MPTLsim [51] is a uop-accurate, cycle-accurate, full-system simulator for multi-core designs based on the X86-64 ISA.MPTLsim extends PTLsim [52], a publicly available single core simulator, with a host of additional features to support hyperthreading within a core and multiple cores, with detailed models for caches, on-chip interconnections and the memory data flow.MPTLsim incorporates detailed simulation models for cache controllers, interconnections and has built-in implementations of a number of cache coherency protocols.CASPER targets an open-sourced ISA and processor architecture which Sun Microsystems, Inc. has released under the OpenSPARC banner [4] for the research community.
NePSim2 [53] is an open source framework for analyzing and optimizing NP design and power dissipation at architecture level.It uses a cycle-accurate simulator for Intel's multi-core IXP2xxx NPs, and incorporates an automatic verification framework for testing and validation, and a power estimation model for measuring the power consumption of the simulated NP.To the best of our knowledge, it is the only NP simulator available to the research community.NePSim2 has been evaluated with cryptographic benchmark applications along with a number of basic test cases.However, the simulator is not readily scalable to explore a wide variety of NP architectures.
Wattch [17] proposed by David Brooks et al. is a multi-core micro-architectural power estimation and simulation platform.Wattch enables users to estimate power dissipation of only superscalar out-of-order multi-core micro-architectures.Out-of-order architectures consist of complex structures such as reservation stations, history-based branch predictors, common data-width buses and re-order buffers and hence are not always power-efficient.In case of low-energy processors such as embedded processors and network processors simple in-order processor pipelines are preferred due to their relatively low power consumption.McPAT [54] proposed by Norman Jouppi is another multi-core micro-architectural power estimation tool.McPAT provides power estimation of both out-of-order and in-order micro-architectures.Although efficient in estimating power dissipation for a wide range of values of the architectural parameters, McPAT is not cycle-accurate and hence incapable of capturing the dynamic interactions between the pipeline stages inside cores and other core-level micro-architectural structures, the shared memory structures and the interconnection network as all cores execute streams of instructions.

Conclusion
In this paper CASPER-a cycle-accurate simulator for shared memory many-core processors is presented.A variety of multi-threaded architectural parameters such as number of cores, number of threads per core, and cache sizes, to name a few, are tunable in the simulator.This allows the exploration of a vast many-core micro-architectural design space for throughput intensive high performance and embedded applications.Pre-characterized libraries containing scalable area, delay and power dissipation models of different hardware components are included in CASPER.This enables accurate power estimation and monitoring of dynamic and leakage power dissipation and area of designs at the high level architecture exploration stage.Additional hardware controlled power management modules are designed in CASPER which enables dynamic power saving.The power saving capabilities of two such dynamic power management algorithms namely Chipwide DVFS and MaxBIPS are discussed and their performance-power trade-offs are shown.

Figure 1 .
Figure 1.The shared memory processor model simulated in CASPER.N C heterogeneous cores are connected to N B banks of shared secondary cache via a crossbar interconnection network.Each core consists of S0 to SN are the pipeline stages, T 0 to T NT hardware threads, L1 I/D cache and I/D miss queues.

Figure 2 .
Figure 2. Block diagram of the stages of an in-order FGMT pipeline.

Figure 3 .
Figure 3. Power Dissipation transient for a single pipeline stage in a component.The area under the curve is the total Energy consumption.
the total contribution of Stage1 is given by Power stage1 = P dyn+lkg (due to instruction INSTR I from thread THR3) + P lkg + P dyn+lkg (due to instruction INSTR J from thread THR0) + P dyn+lkg (due to instruction INSTR K from thread THR2) + P lkgThe shaded parts correspond to active states of the stage (dynamic + leakage power), while the dotted parts correspond to idle states of the stage (only leakage power).Note that different stages have different values of dynamic and leakage power dissipations.

Figure 4 .
Figure 4. Power profile of a pipelined component where multiple instructions exist in different stages.Dotted parts of the pipeline are in idle state and add to the leakage power dissipation.Shaded parts of the pipeline are active and contribute towards both dynamic and leakage power dissipations.

Figure 5 .
Figure 5. Dynamic voltage and frequency scaling (DVFS) for a multi-core processor.

Figure 6 .
Figure 6.Pictorial representation of IP packet header and payload processing in two packet instances of different types.

Figure 7 .
Figure 7. Power dissipation versus CPI-per-thread in a 1-core 1-thread 1 MB L2 cache processor where the data cache size is varied from 4 KB to 64 KB for packet type TYPE1.

Figure 8 .
Figure 8. Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 4-thread 1 MB L2 cache processor where the data cache size is varied from 4 KB to 64 KB for packet type TYPE1.

Figure 9 .Figure 10 .
Figure 9. Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 8-thread 1 MB L2 cache processor where the data cache size is varied from 4 KB to 64 KB for packet type TYPE1.

Figure 11 .Figure 12 .
Figure 11.Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 4-thread 1 MB L2 cache processor where the instruction cache size is varied from 4 KB to 64 KB for packet type TYPE1.

Figure 13 .Figure 14 .
Figure13.Power dissipation and CPI-per-core in a 1-core 1-thread 1 MB L2 cache processor where the store buffer size is varied from 4 to 16 for packet type TYPE1.

Figure 15 .Figure 16 .
Figure 15.Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 8-thread 1 MB L2 cache processor where the store buffer size is varied from 4 to 16 for packet type TYPE1.

Figure 17 .Figure 18 .
Figure 17.Power dissipation and packet bandwidth as number of cores is scaled from 4 to 128.All the cores have N T = 1 for packet type TYPE1.

Figure 19 .Figure 20 .
Figure 19.Power dissipation and packet bandwidth as number of cores is scaled from 4 to 128.All the cores have N T = 4 for packet type TYPE1.

Figure 21 .
Figure 21.Power dissipation and packet bandwidth as number of cores is scaled from 4 to 128.All the cores have N T = 8 for packet type TYPE1.

Figure 24 .
Figure 24.Throughput per unit power data.
saving in case of Chipwide DVFS with and without lower bound on throughput is shown in Table

Figure 25 .
Figure 25.Power for Chipwide DVFS and MaxBIPS with a lower bound of throughput budget = 60% peak throughput.

Figure 28 .Figure 29 .
Figure 28.Dynamic power, leakage power and area of the Chipwide DVFS module in CASPER.

Table 1 .
Range and description of the core-level micro-architectural parameters.

Table 2 .
Range and description of chip-level micro-architectural parameters.

Table 3 .
Post-Layout Area, Dynamic and Leakage Power of VHDL Models.

Table 4 .
Regression model parameters of dynamic power of interconnection network.

Table 5 .
Regression model parameters of leakage power of interconnection network.
i { Get power dissipated by Core i in the last evaluation cycle; Get effective throughput of Core i in the last evaluation cycle; Sum up cumulative power dissipated by all cores in the last evaluation cycle; Sum up cumulative throughput of all cores in the last evaluation cycle
i { Update every core's new dvfs level with values in Selected_combination; }

Table 10 .
Packet processing functions in ENePBench.

Table 11 .
Packet Types used in ENePBench.

Table 12 .
Performance Targets for IP packet type.

Table 14 .
Baseline architecture for packet type TYPE1 to study the power-performance trade-offs in single-core designs.

Table 15 .
Baseline architecture used in the experiments for power management.

Table 18 .
Power, throughput, throughput per unit power of Chipwide DVFS with and without lower bound on throughput.

Table 19 .
Power, throughput, throughput per unit power of two policies for different packet types.

Table 20 .
Average power, average throughput, and average throughput per unit power, average energy, and average execution time of two discussed policies.