<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">jlpea</journal-id>
      <journal-title>Journal of Low Power Electronics and Applications</journal-title>
      <abbrev-journal-title abbrev-type="publisher">JLPEA</abbrev-journal-title>
      <abbrev-journal-title abbrev-type="pubmed">JLPEA</abbrev-journal-title>
      <issn pub-type="epub">2079-9268</issn>
      <publisher>
        <publisher-name>MDPI</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3390/jlpea2010030</article-id>
      <article-id pub-id-type="publisher-id">jlpea-02-00030</article-id>
      <article-categories>
        <subj-group>
          <subject>Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>CASPER: Embedding Power Estimation and Hardware-Controlled Power Management in a Cycle-Accurate Micro-Architecture Simulation Platform for Many-Core Multi-Threading Heterogeneous Processors</article-title>
      </title-group>
      
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Datta</surname>
            <given-names>Kushal</given-names>
          </name>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Mukherjee</surname>
            <given-names>Arindam</given-names>
          </name>
          <xref rid="c1-jlpea-02-00030" ref-type="corresp">*</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Cao</surname>
            <given-names>Guangyi</given-names>
          </name>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Tenneti</surname>
            <given-names>Rohith</given-names>
          </name>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Lakshmi</surname>
            <given-names>Vinay Vijendra Kumar</given-names>
          </name>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Ravindran</surname>
            <given-names>Arun</given-names>
          </name>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Joshi</surname>
            <given-names>Bharat S.</given-names>
          </name>
        </contrib>
      </contrib-group>
      <aff id="af1-jlpea-02-00030">ECE Department, UNC Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA; Email: <email>kdatta@uncc.edu</email> (K.D.); <email>gcao1@uncc.edu</email> (G.C.); <email>rtenneti@uncc.edu</email> (R.T.); <email>vvijendr@uncc.edu</email> (V.V.K.L.); <email>aravindr@uncc.edu</email> (A.R.); <email>bsjoshi@uncc.edu</email> (B.S.J.)</aff>
	  <author-notes>
        <corresp id="c1-jlpea-02-00030"><label>*</label> Author to whom correspondence should be addressed; Email: <email>amukherj@uncc.edu</email>; Tel.: +1-704-687-8417; Fax: +1-704-687-4762.</corresp>
      </author-notes>
      <pub-date pub-type="epub">
        <day>01</day>
        <month>02</month>
        <year>2012</year>
      </pub-date>
      <pub-date pub-type="collection">
	  <month>03</month>
        <year>2012</year>
      </pub-date>
      <volume>2</volume>
      <issue>1</issue>
      <fpage>30</fpage>
      <lpage>68</lpage>
      <history>
        <date date-type="received">
          <day>20</day>
          <month>10</month>
          <year>2011</year>
        </date>
        <date date-type="rev-recd">
          <day>18</day>
          <month>01</month>
          <year>2012</year>
        </date>
        <date date-type="accepted">
          <day>23</day>
          <month>01</month>
          <year>2012</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2012 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
        <copyright-year>2012</copyright-year>
        <license xmlns:xlink="http://www.w3.org/1999/xlink" license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0/">
          <p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p>
        </license>
      </permissions>
      <abstract>
        <p>Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs). While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these techniques at different levels of abstraction can potentially achieve maximal power savings. In this paper we present CASPER, a robust instruction trace driven cycle-accurate many-core multi-threading micro-architecture simulation platform where we have incorporated power estimation models of a wide variety of tunable many-core micro-architectural design parameters, thus enabling processor architects to explore a sufficiently large design space and achieve power-efficient designs. Additionally CASPER is designed to accommodate cycle-accurate models of hardware controlled power management units, enabling architects to experiment with and evaluate different autonomous power-saving mechanisms to study the run-time power-performance trade-offs in embedded many-core processors. We have implemented two such techniques in CASPER–<italic>Chipwide Dynamic Voltage and Frequency Scaling</italic>, and <italic>Performance Aware Core-Specific Frequency Scaling</italic>, which show average power savings of 35.9% and 26.2% on a baseline 4-core SPARC based architecture respectively. This power saving data accounts for the power consumption of the power management units themselves. The CASPER simulation platform also provides users with complete support of SPARCV9 instruction set enabling them to run a full operating system software stack, and hence a wide variety of benchmarking applications.</p>
      </abstract>
      <kwd-group>
        <kwd>simulation</kwd>
        <kwd>processor architectures</kwd>
        <kwd>modeling</kwd>
        <kwd>power consumption</kwd>
        <kwd>performance evaluation and estimation</kwd>
        <kwd>dynamic power management unit</kwd>
        <kwd>hardware based power management</kwd>
        <kwd>power estimation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec sec-type="intro">
      <title>1. Introduction</title>
      <p>Emerging instruction set based multi-core processors [<xref ref-type="bibr" rid="B1-jlpea-02-00030">1</xref>,<xref ref-type="bibr" rid="B2-jlpea-02-00030">2</xref>,<xref ref-type="bibr" rid="B3-jlpea-02-00030">3</xref>,<xref ref-type="bibr" rid="B4-jlpea-02-00030">4</xref>,<xref ref-type="bibr" rid="B5-jlpea-02-00030">5</xref>] are significantly larger and more complex compared to their dual and quad core predecessors [<xref ref-type="bibr" rid="B6-jlpea-02-00030">6</xref>]. Consisting of hundreds of cores on-chip, new heterogeneous <italic>many-cores</italic> are designed with bigger on-chip caches, complex interconnection topologies, and multiple customized IP cores for better performance. Large chip area however signifies high leakage and dynamic power dissipation. Hence architects increasingly use on-chip power controllers which use power-saving techniques such as power-gating, clock-gating and dynamic voltage and frequency scaling (DVFS) to minimize overall power dissipation of their designs. Existing cycle-accurate processor simulators extensively used for performance modeling and validation of processor micro-architectures [<xref ref-type="bibr" rid="B7-jlpea-02-00030">7</xref>,<xref ref-type="bibr" rid="B8-jlpea-02-00030">8</xref>] are inadequate to accurately capture the effect of the dimensions and dynamic interactions between the micro-architectural components such as number of cores, cache size and associativity and interconnection network topologies to name a few, on performance and power dissipation of simulated designs as well as accurately model the complex logic of the power controller. Such capabilities are quintessential to efficiently explore the vast micro-architectural design space of heterogeneous many-core processors created from a large number of design choices of the various micro-architectural parameters and achieve optimal designs with the right balance of power and performance. </p>
      <p>Contemporary popular cycle-accurate simulators such as MPTLSim and NepSim target superscalar architectures and network processor architectures respectively and fall short on covering important features such as hardware multi-threading, in-order instruction pipeline, custom IP cores and heterogeneity—some of the fundamental micro-architectural aspects of emerging many-core designs. Hardware multi-threading in the cores is used to exploit <italic>latency hiding</italic> [<xref ref-type="bibr" rid="B8-jlpea-02-00030">8</xref>,<xref ref-type="bibr" rid="B9-jlpea-02-00030">9</xref>]. Heterogeneous cores are used to achieve better power-performance balance for example in Netronome NFP-32. State-of-the-art simulators such as Simics and M5 on the other hand covers a wide range of micro-architectures, various instruction set architectures and pre-existing processor architectures, but do not capture two key elements of processor research—(i) interfaces to control a large number of tunable micro-architectural parameters such as load miss queue, store buffer size, branch prediction buffer size to name a few; and (ii) cycle-accurate models of on-chip power management units which enable power gating, clock gating and DVFS. Finally, existing simulators are often not scalable to hundreds of cores and do not support complete operating software stack which restricts their usability for a wide range of applications. </p>
      <p>In this paper, we present <bold>C</bold>hip multi-threading <bold>A</bold>rchitecture <bold>S</bold>imulator for <bold>P</bold>erformance, <bold>E</bold>nergy and a<bold>R</bold>ea (CASPER)—a SPARCV9 instruction set architecture based cycle-accurate power-aware heterogeneous multi-threading many-core processor micro-architecture simulation platform. The idea is to provide a simulation platform where the user can easily modify a wide range of tunable architectural parameters to evaluate the performance and estimate pre-silicon leakage and dynamic power dissipation of their designs. The platform also provides interfaces through which users can design and develop cycle-accurate models of power management algorithms in CASPER and evaluate strategies to increase energy efficiency of their designs. Our primary contributions include: </p>
      <list>
        <list-item>
          <p>(a) <bold>SPARCV9 ISA</bold>—CASPER is a simulation tool for cycle accurate full system simulation of 64-bit SPARCV9 instruction set architecture. After the success of Oracle’s UltraSPARC T1, T2 and T3 processors and open-sourcing the T1 and T2 designs via the OpenSPARC project [<xref ref-type="bibr" rid="B10-jlpea-02-00030">10</xref>], there is an increased interest in SPARC instruction set based processor designs for example SimplyRISC [<xref ref-type="bibr" rid="B10-jlpea-02-00030">10</xref>] which makes this platform an important contribution.</p>
        </list-item>
        <list-item>
          <p>(b) <bold>In-Order Cores with Hardware Multi-Threading</bold>—A key shift in the design principle in many-core designs is to use a large number of simple low power cores to exploit a high degree of parallelism significantly observed in products such as Tilera [<xref ref-type="bibr" rid="B11-jlpea-02-00030">11</xref>], Intel Atom [<xref ref-type="bibr" rid="B12-jlpea-02-00030">12</xref>] and UltraSPARC T1 [<xref ref-type="bibr" rid="B13-jlpea-02-00030">13</xref>]. At the same time, hardware multi-threading is used in the cores to exploit latency hiding and increase overall throughput. However, single-thread performance critically depends upon the number of hardware threads in a core. Hence, CASPER is designed to simulate simple in-order multi-threaded cores parameterized in terms of number of hardware threads per core.</p>
        </list-item>
        <list-item>
          <p>(c) <bold>Heterogeneous Cores</bold>—Exploiting heterogeneity enables designers to achieve optimal power-performance trade-offs in their designs. For example, a processor core with deeper store buffer can be most energy efficient in case of write intensive application which tends to utilize more store instructions compared to a core which is designed with a large data cache [<xref ref-type="bibr" rid="B14-jlpea-02-00030">14</xref>,<xref ref-type="bibr" rid="B15-jlpea-02-00030">15</xref>]. This motivates us to design CASPER to simulate a set of heterogeneous cores where each core is structurally different from each other. In CASPER, a core can be optimized by tuning a large number of micro-architectural parameters namely number of hardware threads, number of pipeline stages, instruction and data cache (I$/D$) size, associativity and line size, and size of instruction and data virtual-to-physical address translation buffer (I-TLB/D-TLB), branch prediction buffer, instruction miss queue, load miss queue and store buffer. Users can specify the structure of each core independently in CASPER.</p>
        </list-item>
        <list-item>
          <p>(d) <bold>Shared Memory</bold>—Low latency on-chip shared memory system is used to optimize data sharing which is connected to the cores through an interconnection network. Therefore, in addition to the per cycle behavior of the heterogeneous multi-threading cores, the cycle by cycle functionality of chip level architectural features such as number of shared memory banks, control logic of the memory banks and interconnection structure are included in CASPER to accurately model the impact of all the micro-architectural features on overall processor performance and power.</p>
        </list-item>
        <list-item>
          <p>(e) <bold>Pre-Silicon Power Estimation</bold>—Although popular simulators do a good job in modeling performance through cycle accuracy and functional accuracy, they fall short in power estimation. Similar to performance, both dynamic and leakage power depends on the dimensions and dynamic interactions between the micro-architectural features of a processor. In CASPER, hardware models of the architectural features are synthesized, placed and routed using technology files to derive dynamic and leakage power dissipation values. These values are used in a cycle-accurate manner during simulation. Hence as silicon technology generation files become available, they can be used in CASPER to estimate power dissipation of simulated designs. Currently, technology files from 90 nm to 32 nm are freely available [<xref ref-type="bibr" rid="B16-jlpea-02-00030">16</xref>] and can be used in CASPER. Power-aware simulation tools such as Wattch [<xref ref-type="bibr" rid="B17-jlpea-02-00030">17</xref>] also consist of both leakage and dynamic power dissipation models of the micro-architectural features of processors. We intend to compare the accuracy of our methodology and Wattch in future work.</p>
        </list-item>
        <list-item>
          <p>(f) <bold>Power Control Unit</bold>—Dynamic power management (DPM) in multi-core processors involves a set of techniques which perform power-efficient computations under real-time constraints to achieve system throughput goals while minimizing power. DPM is executed by an integrated power management unit (PMU), which is typically implemented in software, hardware or a combination thereof. The PMU monitors and manages the power and performance of each core by dynamically adjusting its operating voltage and frequency. Hardware-controlled power management eliminates the computation overhead that the processor incurs for software based power management while performing workload performance and power estimations. Hence hardware power management realizes more accurate and real time impact on workload performance than slower reacting software power management can achieve. In CASPER, the PMU has a hierarchical structure. The local PMU exercises clock-gating at the stages of the instruction pipeline in the cores, where the global PMU enforces a power control policy where DVFS and power-gating of a core is decide by analyzing its utilization and wait times due to long latency memory accesses.</p>
        </list-item>
        <list-item>
          <p>(g) <bold>Operating Stack</bold>—A full SPARCV9 instruction set implemented in CASPER makes it more usable and programmable. Solaris 5.10 version operating system kernel is ported onto CASPER along with a complete libc software stack [<xref ref-type="bibr" rid="B18-jlpea-02-00030">18</xref>]. This enables users to run any application on a simulated processor. In this study, we have used ENePBench a network packet processing benchmark to evaluate many-core designs.</p>
        </list-item>
      </list>
      <p>CASPER is written in C++ programming language and has been flexibly threaded to take advantage of a wide variety of multi-core machines. On an Oracle T1000 server, CASPER can simulate approximately 100K instructions per second per hardware thread. The rest of the paper is organized as follows. <xref ref-type="sec" rid="sec2-jlpea-02-00030">Section 2</xref> explains the processor model in addition to the configurable parameters in CASPER. <xref ref-type="sec" rid="sec3-jlpea-02-00030">Section 3</xref> and <xref ref-type="sec" rid="sec4-jlpea-02-00030">Section 4</xref> explain the methodologies used to model performance and power/energy consumption in CASPER. <xref ref-type="sec" rid="sec5-jlpea-02-00030">Section 5</xref> explains the benchmarks used and the outputs and capabilities of the simulation platform are explained in <xref ref-type="sec" rid="sec6-jlpea-02-00030">Section 6</xref>. <xref ref-type="sec" rid="sec7-jlpea-02-00030">Section 7</xref> discusses related work. Finally, we conclude and discuss our future work in Section 8.</p>
    </sec>
    <sec id="sec2-jlpea-02-00030">
      <title>2. Processor Micro-Architecture</title>
      <p>The processor model used in CASPER is shown in <xref ref-type="fig" rid="jlpea-02-00030-f001">Figure 1</xref>. Each core is organized as single-issue in-order with fine-grained multi-threading (FGMT) [<xref ref-type="bibr" rid="B7-jlpea-02-00030">7</xref>]. In-order FGMT cores utilize a single-issue six stage pipeline [<xref ref-type="bibr" rid="B8-jlpea-02-00030">8</xref>] shared between the hardware threads, enabling designers to achieve: (i) high throughput per-core by <italic>latency hiding</italic>; and (ii) minimize the power dissipation of a core by avoiding complex micro-architectural structures such as instruction issue queues, re-order buffers and history-based branch predictors typically used in superscalar and out-of-order micro-architectures. The six stage RISC pipeline is an alteration of the basic simplest five-stage in-order instruction pipeline Fetch, Decode, Execute, Memory and WriteBack. The sixth stage is a single-cycle thread switch scheduling stage which follows a round robin algorithm and selects an instruction from a hardware thread in ready state to issue to the decode stage. </p>
      <fig id="jlpea-02-00030-f001" position="anchor">
        <label>Figure 1</label>
        <caption>
          <p>The shared memory processor model simulated in CASPER. <italic>N<sub>C</sub></italic> heterogeneous cores are connected to <italic>N<sub>B</sub></italic> banks of shared secondary cache via a crossbar interconnection network. Each core consists of S0 to SN are the pipeline stages, <italic>T<sub>0</sub></italic> to <italic>T<sub>NT</sub></italic> hardware threads, L1 <italic>I/D</italic> cache and <italic>I/D</italic> miss queues.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g001.tif"/>
      </fig>
      <p>Each core contains <italic>N<sub>T</sub></italic> number of hardware threads. <italic>N<sub>T</sub></italic> is parameterized in CASPER and can be different from one core to another. The 64-bit pipeline in each core is divided into 6 stages—Instruction-Fetch (F-stage), Thread-Schedule (S-stage), Branch-and-Decode (D-stage), Execution (E-stage), Memory-Access (M-stage) and Write-back (W-stage) as shown in <xref ref-type="fig" rid="jlpea-02-00030-f002">Figure 2</xref>.</p>
      <p>The F-stage implemented inside Instruction Fetch Unit (IFU) includes the instruction address translation buffer (I-TLB), instruction cache (I$), missed instruction list (MIL) and the integer register file (IRF). MIL is used to serialize I$ misses and send these type of packets from the core to the L2 cache. A similar structure called the Instruction Fetch Queue (IFQ) manages returning I$ miss packets. The size of MIL and IFQ are parameterized in CASPER and is same for all the threads. IRF contains 160 total registers used to support the entire SPARCV9 instruction set. Each hardware thread privately owns IRF, MIL and IFQ whereas I-TLB and I$ are shared. The S-stage which is also part of IFU contains the thread scheduler and thread finite state machine.</p>
	  <fig id="jlpea-02-00030-f002" position="anchor">
        <label>Figure 2</label>
        <caption>
          <p>Block diagram of the stages of an in-order FGMT pipeline.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g002.tif"/>
      </fig>
           
	 <p>The D-stage includes a full SPARCV9 instruction set decoder as described in [<xref ref-type="bibr" rid="B19-jlpea-02-00030">19</xref>]. Execution Unit (EXU) contains a standard RISC 64-bit ALU, an integer multiplier and divider. EXU constitute the E-stage of our pipeline. Load Store Unit (LSU) is the top level module which includes the micro-architectural components of the M-stage and W-stage. It includes the data TLB (D-TLB), data cache (D$), address space identifier queue (ASIQ), load miss queue (LMQ) and store buffer (SB). LMQ maintains D$ misses. SB is used to serialize the store instructions following the <italic>total store order</italic> (TSO) model. Stores are <italic>write through</italic>. Both LMQ and SB are separately maintained for each hardware thread while the D-TLB and D$ are shared. Special registers in SPARCV9 are accessed via the ASI queue and ASI operations are categorized as long latency operations as these instructions are asynchronous to the pipeline operation [<xref ref-type="bibr" rid="B19-jlpea-02-00030">19</xref>]. Loads and stores are resolved in the L2cache and the returning packets are serialized and executed in the data fetch queue (DFQ) also a part of LSU. The Trap logic Unit (TLU) of SPARCV9 architecture used in CASPER is structurally similar to that of UltraSPARC T1 [<xref ref-type="bibr" rid="B20-jlpea-02-00030">20</xref>].</p>
      <p>In addition to hardware multi-threading, <italic>clock-gating</italic> [<xref ref-type="bibr" rid="B21-jlpea-02-00030">21</xref>] in the pipeline stages enables us to minimize power dissipation by canceling the dynamic power in the idle stages compared to active blocks which consumes both dynamic and leakage power. Long latency operations in a hardware thread are typically <italic>blocking</italic> in nature. In CASPER, the long latency operations such as load misses and stores are <italic>non-blocking</italic>. We call this feature <italic>hardware scouting</italic>. This feature optimally utilizes the deeper load miss queues compared to other architectures such as UltraSPARC T1 [<xref ref-type="bibr" rid="B22-jlpea-02-00030">22</xref>,<xref ref-type="bibr" rid="B23-jlpea-02-00030">23</xref>] where the load miss queue contains only one entry. In average, this enhances the performance of a single thread by 2–5%. The complete set of tunable core-level micro-architectural parameters is shown in <xref ref-type="table" rid="jlpea-02-00030-t001">Table 1</xref>.</p>
      <table-wrap id="jlpea-02-00030-t001" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t001_Table 1</object-id>
        <label>Table 1</label>
        <caption>
          <p>Range and description of the core-level micro-architectural parameters.</p>
        </caption>
        <table rules="all" style="border:solid thin">
          <thead>
            <tr>
              <th align="left" valign="middle">Name</th>
              <th align="left" valign="middle">Range</th>
              <th align="left" valign="middle">Increment</th>
              <th align="left" valign="middle">Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="left" valign="middle">1. <italic>N<sub>T</sub></italic></td>
              <td align="left" valign="middle">1 to 16</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Threads per core</td>
            </tr>
            <tr>
              <td align="left" valign="middle">2. MIL Size Per Thread</td>
              <td align="left" valign="middle">1 to 32</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Used to enqueue the I$ misses</td>
            </tr>
            <tr>
              <td align="left" valign="middle">3. IFQ Size Per Thread</td>
              <td align="left" valign="middle">1 to 32</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Used to process the returning I$ miss packets</td>
            </tr>
            <tr>
              <td align="left" valign="middle">4. ITLB Size</td>
              <td align="left" valign="middle">1 to 256</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Virtual to Physical address translation buffer for instruction addresses</td>
            </tr>
            <tr>
              <td align="left" valign="middle">5. L1 I<sub>Cache</sub> Associativity</td>
              <td align="left" valign="middle">2 to 32</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Set-associativity of I$</td>
            </tr>
            <tr>
              <td align="left" valign="middle">6. L1 I<sub>Cache</sub> Line Size</td>
              <td align="left" valign="middle">8 to 64</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Block size of I$</td>
            </tr>
            <tr>
              <td align="left" valign="middle">7. L1 I<sub>Cache</sub> Size</td>
              <td align="left" valign="middle">1 KB to 64 KB</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Total I$ size</td>
            </tr>
            <tr>
              <td align="left" valign="middle">8. DTLB Size</td>
              <td align="left" valign="middle">1 to 256</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Virtual to Physical address translation buffer for data addresses</td>
            </tr>
            <tr>
              <td align="left" valign="middle">9. L1 D<sub>Cache</sub> Associativity</td>
              <td align="left" valign="middle">2 to 8</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Set-associativity of D$</td>
            </tr>
            <tr>
              <td align="left" valign="middle">10. L1 D<sub>Cache</sub> Line Size</td>
              <td align="left" valign="middle">8 to 64</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Block size of D$</td>
            </tr>
            <tr>
              <td align="left" valign="middle">11. L1 D<sub>Cache</sub> Size</td>
              <td align="left" valign="middle">1 KB to 64 KB</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Total D$ size</td>
            </tr>
            <tr>
              <td align="left" valign="middle">12. Load Miss Queue (LMQ) Size Per Thread</td>
              <td align="left" valign="middle">1 to 32</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Used to enqueue all the D$ misses</td>
            </tr>
            <tr>
              <td align="left" valign="middle">13. SB Size Per Thread</td>
              <td align="left" valign="middle">1 to 64</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Used to serialize the store instructions following the TSO model [<xref ref-type="bibr" rid="B20-jlpea-02-00030">20</xref>]</td>
            </tr>
            <tr>
              <td align="left" valign="middle">14. DFQ Size Per Thread</td>
              <td align="left" valign="middle">1 to 32</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Used to enqueue all the packets returning from L2 cache</td>
            </tr>
            <tr>
              <td align="left" valign="middle">15. ASI Queue Size Per Thread</td>
              <td align="left" valign="middle">1 to 16</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Used to serialize all Address Space Identifier register reads/writes</td>
            </tr>
          </tbody>
        </table>
      </table-wrap>
      <p>At the chip level, <italic>N<sub>C</sub></italic> cores are connected to the inclusive unified L2 cache via a crossbar interconnection network. L2 cache is divided into <italic>N<sub>B</sub></italic> banks. <italic>N<sub>C</sub></italic> and <italic>N<sub>B</sub></italic> are parameterized in CASPER. Each L2 cache bank maintains separate queues negotiating core to L2 cache instruction miss/load miss/store packets from each core. An arbiter inside the banks selects packets from different queues in a round-robin fashion. The length of the queue is an important parameter as it affects the processing time of each L2 cache access time. The complete set of tunable chip-level micro-architectural parameters is shown in <xref ref-type="table" rid="jlpea-02-00030-t002">Table 2</xref>.</p>
      <table-wrap id="jlpea-02-00030-t002" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t002_Table 2</object-id>
        <label>Table 2</label>
        <caption>
          <p>Range and description of chip-level micro-architectural parameters.</p>
        </caption>
        <table rules="all" style="border:solid thin">
          <thead>
            <tr align="center">
              <th valign="middle">Name</th>
              <th valign="middle">Range</th>
              <th valign="middle">Increment</th>
              <th valign="middle">Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="left" valign="middle">16. L2$<sub>QSize</sub></td>
              <td align="left" valign="middle">4 to 16</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">L2 cache input queue size per core</td>
            </tr>
            <tr>
              <td align="left" valign="middle">17. Size</td>
              <td align="left" valign="middle">4 to 512 MB</td>
              <td align="left" valign="middle">1 MB</td>
              <td align="left" valign="middle">Total L2 size</td>
            </tr>
            <tr>
              <td align="left" valign="middle">18. Associativity</td>
              <td align="left" valign="middle">8 to 64</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Set-associativity of L2 cache</td>
            </tr>
            <tr>
              <td align="left" valign="middle">19. Line Size</td>
              <td align="left" valign="middle">8 to 128</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Line size of L2 cache</td>
            </tr>
            <tr>
              <td align="left" valign="middle">20. <italic>N<sub>B</sub></italic>
              </td>
              <td align="left" valign="middle">4 to 128</td>
              <td align="left" valign="middle">Power of 2</td>
              <td align="left" valign="middle">Number of L2 banks</td>
            </tr>
            <tr>
              <td align="left" valign="middle">21. <italic>N<sub>C</sub></italic>
              </td>
              <td align="left" valign="middle">1 to 250</td>
              <td align="left" valign="middle">Increment of 1</td>
              <td align="left" valign="middle">Number of processing cores</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      <p>Coherency is maintained in the L2 following the <italic>directory based</italic> cache coherency protocol. The L2 cache arrays consists of the reverse directories of both instruction and data L1 caches of all the cores. A read miss at the L2 cache populates an entire cache line and the corresponding cache line in the originating L1 cache instruction or data cache. All subsequent reads to the same address from any other core lead to a L2 cache hit and is populated directly from L2 cache. The L2 cache is inclusive and unified; hence it contains all the blocks which are also present in L1 instruction or data caches in the cores. Store instructions or the writes are always committed at the L2 cache first following the <italic>write through</italic> protocol. Hence, the L2 cache always has the most recent data. Once a write is committed in the L2 cache all the matching directory entries are invalidated which means that the L1 instruction and data cache entries are invalidated using special L2 cache to core messages.</p>
      <sec>
        <title>2.1. Performance Measurement</title>
        <p>Counters are used in CASPER in each core to measure the number of completed instructions individually for each hardware thread (Instr<sub>THREAD</sub>) as well as for the entire core (Instr<sub>CORE</sub>) every clock cycle. Counters are also attached to the L2 cache banks to monitor the load/store accesses from the cores. This enables the user to estimate average wait time for loads/stores per core and per hardware thread in each core. The wait time includes wait time of each load/store instruction in the LMQ or SB respectively, propagation time in the interconnection network and total L2 cache access time.</p>
      </sec>
      <sec>
        <title>2.2. Component-Level Power Dissipation Modeling</title>
        <p>To accurately model the area and the power dissipation of the architectural components we have (i) designed scalable hardware models of all pipelined and non-pipelined components of the processor in terms of corresponding architectural parameters; and (ii) derived power dissipations (dynamic + leakage) of the component models (written in VHDL) using the commercial synthesis tool Design Vision from Synopsys [<xref ref-type="bibr" rid="B24-jlpea-02-00030">24</xref>] which targets the Berkeley 45 nm Predictive Technology Model (PTM) technology library [<xref ref-type="bibr" rid="B25-jlpea-02-00030">25</xref>], and placement and routing tool Encounter from Cadence [<xref ref-type="bibr" rid="B26-jlpea-02-00030">26</xref>]. The area and power dissipation values of I$ and D$ are derived using Cacti 4.0 [<xref ref-type="bibr" rid="B27-jlpea-02-00030">27</xref>]. In case of the parameterized micro-architectural non-pipeline components in a core such as the LMQ, SB, MIL, IFQ, DFQ, and I/D-TLB area and power are found using a 1 GHz clock, and stored in lookup tables indexed according to the values of the micro-architectural parameter. The values from lookup tables are then used in the simulation to calculate the power dissipation of the core by capturing the <italic>activity factor α</italic>(<italic>t</italic>) from simulation, and integrating the product of power dissipation and <italic>α</italic>(<italic>t</italic>) over simulation time. The following equation is used to calculate the power dissipation of a pipeline stage:</p>
		<disp-formula>
		<italic>P<sub>stage</sub></italic>(<italic>t</italic>) = <italic>P<sub>leakage</sub></italic>(<italic>t</italic>) + <italic>αP<sub>dynamic</sub></italic>(<italic>t</italic>)
		<label>(1)</label>
		</disp-formula>

        
        <p>where <italic>α</italic> is the <italic>activity factor</italic> of that stage (<italic>α</italic> = 1 if that stage is <italic>active</italic>; <italic>α</italic> = 0 otherwise) which is reported by CASPER; <italic>P<sub>leakage</sub></italic> and <italic>P<sub>dynamic</sub></italic> are the pre-characterized leakage and dynamic power dissipations of the stage respectively. The pre-characterized values of area, leakage and dynamic power of core-level architectural blocks are shown in <xref ref-type="table" rid="jlpea-02-00030-t003">Table 3</xref>. The HDL models of all the core-level architectural blocks have been functionally validated using exhaustive test benches.</p>
        <table-wrap id="jlpea-02-00030-t003" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t003_Table 3</object-id>
          <label>Table 3</label>
          <caption>
            <p>Post-Layout Area, Dynamic and Leakage Power of VHDL Models.</p>
          </caption>
          <table rules="all" style="border:solid thin">
<thead>
              <tr align="center">
                <th valign="middle">HDL Model</th>
                <th valign="middle">Area (mm<sup>2</sup>)</th>
                <th valign="middle">Dynamic Power (mW)</th>
                <th valign="middle">Leakage Power (μW)</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">RAM (16)</td>
                <td align="center" valign="middle">0.022</td>
                <td align="center" valign="middle">1.03</td>
                <td align="center" valign="middle">17.81</td>
              </tr>
              <tr>
                <td align="left" valign="middle">CAM (16)</td>
                <td align="center" valign="middle">0.066</td>
                <td align="center" valign="middle">3.51</td>
                <td align="center" valign="middle">67.70</td>
              </tr>
              <tr>
                <td align="left" valign="middle">FIFO (16) for 8 threads</td>
                <td align="center" valign="middle">0.3954</td>
                <td align="center" valign="middle">165</td>
                <td align="center" valign="middle">1200.00</td>
              </tr>
              <tr>
                <td align="left" valign="middle">TLB (64)</td>
                <td align="center" valign="middle">0.0178</td>
                <td align="center" valign="middle">21.11</td>
                <td align="center" valign="middle">92.60</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Cache (32 KB)</td>
                <td align="center" valign="middle">0.0149</td>
                <td align="center" valign="middle">28.3</td>
                <td align="center" valign="middle">-</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Integer Register File</td>
                <td align="center" valign="middle">0.5367</td>
                <td align="center" valign="middle">11.92</td>
                <td align="center" valign="middle">4913.7</td>
              </tr>
              <tr>
                <td align="left" valign="middle">IFU</td>
                <td align="center" valign="middle">0.0451</td>
                <td align="center" valign="middle">3280.1</td>
                <td align="center" valign="middle">378.39</td>
              </tr>
              <tr>
                <td align="left" valign="middle">EXU</td>
                <td align="center" valign="middle">0.0307</td>
                <td align="center" valign="middle">786.99</td>
                <td align="center" valign="middle">301.94</td>
              </tr>
              <tr>
                <td align="left" valign="middle">LSU</td>
                <td align="center" valign="middle">0.8712</td>
                <td align="center" valign="middle">5495.3</td>
                <td align="center" valign="middle">6848.30</td>
              </tr>
              <tr>
                <td align="left" valign="middle">TLU</td>
                <td align="center" valign="middle">0.064</td>
                <td align="center" valign="middle">1302.2</td>
                <td align="center" valign="middle">553.8458</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Multiplier</td>
                <td align="center" valign="middle">0.0324</td>
                <td align="center" valign="middle">23.74</td>
                <td align="center" valign="middle">383.88</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Crossbar–8 cores × 4 L2 banks</td>
                <td align="center" valign="middle">0.2585</td>
                <td align="center" valign="middle">50.92</td>
                <td align="center" valign="middle">1390</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>The activity factor α is derived by tracking the switching of all the components in all the stages of the cores on a per cycle basis. As a given instruction is executed through the multiple stages of the instruction pipeline inside a core, the simulator tracks: (i) the intra-core components that are actively involved in the execution of that instruction; and (ii) the cycles during which that instruction uses any particular pipeline stage of a given component. Any component or a stage inside a component is assumed to be in two states–<italic>idle</italic> (not involved in the execution of an instruction) and <italic>active</italic> (process an instruction). For example, in case of a D$ load-miss, the occurrence of the miss will be identified in the M-stage. The load instruction will then be added to the LMQ and W-stage will be set to an <italic>idle</italic> state for the next clock cycle. A non-pipelined component is treated as a special case of a single stage pipelined one. We consider only leakage power dissipation in the <italic>idle</italic> state and both leakage and dynamic power dissipations in the <italic>active</italic> state. <xref ref-type="fig" rid="jlpea-02-00030-f003">Figure 3</xref> shows the total power dissipation of a single representative pipeline stage in one component. Note that the total power reduces to just the leakage part in the absence of a valid instruction in that stage (<italic>idle</italic>), and the average dynamic power of the stage is added when an instruction is processed (<italic>active</italic>).</p>
        <fig id="jlpea-02-00030-f003" position="anchor">
          <label>Figure 3</label>
          <caption>
            <p>Power Dissipation transient for a single pipeline stage in a component. The area under the curve is the total Energy consumption.</p>
          </caption>
          <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g003.tif"/>
        </fig>
        <p>A certain pipeline stage of a component will switch to <italic>active</italic> state when it receives an <italic>instruction ready</italic> signal from its previous stage. In the absence of the <italic>instruction ready</italic> signal, the stage switches back to <italic>idle</italic> state. Note that the <italic>instruction ready</italic> signal is used to clock-gate (disable the clock to all logic of) an entire component or a single pipeline stage inside the component to save dynamic power. Hence we only consider leakage power dissipation in the absence of an active instruction. In case of an instruction waiting for memory access or in the stall state due to a prior long latency operation, is assumed to be in <italic>active</italic> state.</p>
        <p><xref ref-type="fig" rid="jlpea-02-00030-f004">Figure 4</xref> illustrates the methodology of power measurement in the pipelined components in CASPER as seen during 5 clock cycles. The blue dotted lines show the amount of power dissipated by the pipeline stage as an instruction from a particular hardware thread is executed. Time increases vertically. For example, for the 5 clock cycles as shown in <xref ref-type="fig" rid="jlpea-02-00030-f004">Figure 4</xref> the total contribution of Stage1 is given by</p>
        <disp-formula>
		<italic>Power<sub>stage</sub></italic><sub>1</sub> = <italic>P<sub>dyn+lkg</sub></italic>(<italic>due to instruction INSTR<sup>I</sup> from thread THR</italic>3) + <italic>P<sub>lkg</sub></italic> + 
		</disp-formula>
       
	   <disp-formula>
		<italic>P<sub>dyn+lkg</sub></italic>(<italic>due to instruction INSTR<sup>J</sup> from thread THR</italic>0) + 
		</disp-formula>
       
	   <disp-formula>
		<italic>P<sub>dyn+lkg</sub></italic>(<italic>due to instruction INSTR<sup>K</sup> from thread THR</italic>2) + <italic>P<sub>lkg</sub></italic>
		</disp-formula>
        <p>The shaded parts correspond to <italic>active</italic> states of the stage (dynamic + leakage power), while the dotted parts correspond to <italic>idle</italic> states of the stage (only leakage power). Note that different stages have different values of dynamic and leakage power dissipations.</p>
      <fig id="jlpea-02-00030-f004" position="anchor">
          <label>Figure 4</label>
          <caption>
            <p>Power profile of a pipelined component where multiple instructions exist in different stages. Dotted parts of the pipeline are in idle state and add to the leakage power dissipation. Shaded parts of the pipeline are active and contribute towards both dynamic and leakage power dissipations.</p>
          </caption>
          <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g004.tif"/>
        </fig>
        
	  </sec>
      <sec>
        <title>2.3. Power Dissipation Modeling of L2 Cache and Interconnection Network</title>
        <p>The power and area of L2 cache and interconnection network however depend on the number of cores which makes it immensely time-consuming to synthesize, place and route all possible combinations. Hence we have used <italic>multiple linear regression</italic> [<xref ref-type="bibr" rid="B28-jlpea-02-00030">28</xref>] for this purpose. The training set required to derive the regression models of dynamic power and area of L2 cache arrays are measured by running Cacti 4.0 [<xref ref-type="bibr" rid="B27-jlpea-02-00030">27</xref>] for different configurations of L2 cache size, associativity and line sizes. Dynamic power dissipation measured in Watts of L2 cache is related to the size in megabytes, associativity and number of banks <italic>N<sub>B</sub></italic> as shown in Equation 2. The error of estimate found for this model ranges between 0.524 and 2.09. Errors are estimated by comparing the model predicted and measure value of dynamic power dissipation for configurations of L2 cache size, associativity, linesize and number of banks. These configurations are different from the ones used to derive the training set.</p>
        <disp-formula>
		L2<italic><sub>dynamic_power</sub></italic> = <italic>c</italic>0 + <italic>c</italic>1 × Size + <italic>c</italic>2 × Associativity - <italic>c</italic>3 × <italic>N<sub>B</sub></italic> 
		<label>(2)</label>
		</disp-formula>
        <p>Similarly, the dynamic and leakage power dissipation measured in mill watts of a crossbar interconnection network is given by Equation 3 and Equation 4 respectively. The training set required to derive the regression models of dynamic and leakage power of the crossbar are derived by synthesizing its hardware model parameterized in terms of number of L2 cache banks and number of cores using Synopsys Design Vision and Cadence Encounter. Note that dynamic and leakage power is related square of the number of cores (<italic>N<sub>C</sub></italic>) and number of cache banks (<italic>N<sub>B</sub></italic>) which means that the power dissipation scales super linearly with number of banks and cores. Thus, crossbar interconnects are not scalable. However, they provide high bandwidth required in our many-core designs. The regression model parameters and 95% confidence interval as reported by the statistical tool SPSS [<xref ref-type="bibr" rid="B29-jlpea-02-00030">29</xref>] are for dynamic and leakage power dissipation is shown in <xref ref-type="table" rid="jlpea-02-00030-t004">Table 4</xref> and <xref ref-type="table" rid="jlpea-02-00030-t005">Table 5</xref> respectively. The confidence interval shows that the strength of the model is high. The models are further validated by comparing the model predicted power values and values measured by synthesizing 5 more combinations of number of L2 cache banks and number of cores different from the training set. We found that in these two cases, the errors of estimates ranges between 0.7 and 10.64.</p>
        <disp-formula id="jlpea-02-00030-i001">
		<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-i001.tif"/>
		<label>(3)</label>
		</disp-formula>

        <disp-formula id="jlpea-02-00030-i002">
		<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-i002.tif"/>
		<label>(4)</label>
		</disp-formula>
      <table-wrap id="jlpea-02-00030-t004" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t004_Table 4</object-id>
          <label>Table 4</label>
          <caption>
            <p>Regression model parameters of dynamic power of interconnection network.</p>
          </caption>
          <table rules="all" style="border:solid thin">
            <thead>
              <tr align="center">
                <th valign="middle"> </th>
                <th valign="middle"> </th>
                <th colspan="2" valign="middle">95% Confidence Interval</th>
              </tr>
              <tr align="center">
                <th valign="middle">Parameter</th>
                <th valign="middle">Standard Error</th>
                <th valign="middle">Lower Bound</th>
                <th valign="middle">Upper Bound</th>
              </tr>
            </thead>
            <tbody>
              <tr align="center">
                <td valign="middle"><italic>b</italic>0</td>
                <td valign="middle">0.003</td>
                <td valign="middle">0.191</td>
                <td valign="middle">0.203</td>
              </tr>
              <tr align="center">
                <td valign="middle"><italic>b</italic>1</td>
                <td valign="middle">0.198</td>
                <td valign="middle">1.628</td>
                <td valign="middle">2.501</td>
              </tr>
            </tbody>
          </table>
		  </table-wrap>
		  
		  
        <table-wrap id="jlpea-02-00030-t005" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t005_Table 5</object-id>
          <label>Table 5</label>
          <caption>
            <p>Regression model parameters of leakage power of interconnection network.</p>
          </caption>
          <table rules="all" style="border:solid thin">
            <thead>
              <tr align="center">
                <th valign="middle"> </th>
                <th valign="middle"> </th>
                <th colspan="2" valign="middle">95% Confidence Interval</th>
              </tr>
              <tr align="center">
                <th valign="middle">Parameter</th>
                <th valign="middle">Standard Error</th>
                <th valign="middle">Lower Bound</th>
                <th valign="middle">Upper Bound</th>
              </tr>
            </thead>
            <tbody>
              <tr align="center">
                <td valign="middle"><italic>b</italic>0</td>
                <td valign="middle">0.001</td>
                <td valign="middle">0.016</td>
                <td valign="middle">0.017</td>
              </tr>
              <tr align="center">
                <td valign="middle"><italic>b</italic>1</td>
                <td valign="middle">0.018</td>
                <td valign="middle">0.180</td>
                <td valign="middle">0.255</td>
              </tr>
            </tbody>
          </table></table-wrap>
        
	  </sec>
      <sec>
        <title>2.4. Hardware Controlled Power Management in CASPER</title>
        <p>Two hardware-controlled power management algorithms called Chipwide DVFS and MaxBIPS as proposed in [<xref ref-type="bibr" rid="B30-jlpea-02-00030">30</xref>] are implemented in CASPER. Note that all these algorithms continuously re-evaluate the voltage-frequency operating levels of the different cores, once every evaluation cycle. When not explicitly stated, one evaluation cycle corresponds to 1024 processor clock cycles in our simulations. The DVFS based GPMU algorithms rely on the assumption that when a given core switches from power mode <italic>A</italic> (voltage_<italic>A</italic>, frequency_<italic>A</italic>) in time interval <italic>N</italic> to power mode <italic>B</italic> (voltage_<italic>B</italic>, frequency_<italic>B</italic>) in time interval <italic>N</italic> + 1, the power and throughput in time interval <italic>N</italic> + 1 can be predicted using Equation (1). Note that the system frequency needs to scale along with the voltage to ensure that the operating frequency meets the timing constraints of the circuit whose delay changes linearly with the operating voltage [<xref ref-type="bibr" rid="B31-jlpea-02-00030">31</xref>]. This assumes that the workload characteristics do not change from one time interval to next one, and there are no shared resource dependencies between tasks and cores. <xref ref-type="table" rid="jlpea-02-00030-t006">Table 6</xref> explains the dependencies of power and throughput on the voltage and frequency levels of the cores.</p>
        <table-wrap id="jlpea-02-00030-t006" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t006_Table 6</object-id>
          <label>Table 6</label>
          <caption>
            <p>Relationship of power and throughput in time interval <italic>N</italic> and <italic>N</italic> + 1.</p>
          </caption>
          <table rules="all" style="border:solid thin">
            <thead>
              <tr align="center">
                <th valign="middle">Time Interval</th>
                <th valign="middle">
                  <italic>N</italic>                </th>
                <th valign="middle"><italic>N</italic> + 1</th>
              </tr>
            </thead>
            <tbody>
              <tr align="center">
                <td rowspan="2" valign="middle">Mode</td>
                <td rowspan="2" valign="middle">(<italic>v</italic>, <italic>f</italic>)</td>
                <td valign="middle">(<italic>v’</italic>, <italic>f’</italic>)</td>
              </tr>
              <tr align="center">
                <td valign="middle"><italic>f’</italic> = <italic>f</italic> (<italic>v’</italic>/<italic>v</italic>)</td>
              </tr>
              <tr align="center">
                <td valign="middle">Throughput</td>
                <td valign="middle">
                  <italic>T</italic>                </td>
                <td valign="middle"><italic>T’</italic> = <italic>T</italic> × 
				(<italic>f’</italic>/<italic>f</italic>)</td>
              </tr>
              <tr align="center">
                <td valign="middle">Dynamic Power</td>
                <td valign="middle">
                  <italic>P</italic>                </td>
                <td valign="middle"><italic>P’</italic> = <italic>P</italic> × (<italic>v’</italic>/<italic>v</italic>)<sup>2</sup> × (<italic>f’</italic>/<italic>f</italic>)</td>
              </tr>
            </tbody>
          </table>
		  </table-wrap>
        <p>The key idea of DVFS in Chipwide DVFS is to scale the voltages and frequencies of a single core or the entire processor during run-time to achieve specific throughputs while minimizing power dissipation, or to maximize throughput under a power budget. Equation (5) shows the quadratic and linear dependences of dynamic or switching power dissipation on the supply voltage and frequency respectively:</p>
        <disp-formula id="jlpea-02-00030-i003">
		<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-i003.tif"/>
		<label>(5)</label>
		</disp-formula>
        <p>where <italic>α</italic> is the switching probability, C is the total transistor gate (or sink) capacitance of the entire module, <italic>V<sub>dd</sub></italic> is the supply voltage, and <italic>f</italic> is the clock frequency. Note that the system frequency needs to scale along with the voltage to satisfy the timing constraints of the circuit whose delay changes linearly with the operating voltage [<xref ref-type="bibr" rid="B30-jlpea-02-00030">30</xref>]. DVFS algorithms can be implemented at different levels such as the processor micro-architecture (hardware), the operating system scheduler, or the compiler [<xref ref-type="bibr" rid="B32-jlpea-02-00030">32</xref>]. <xref ref-type="fig" rid="jlpea-02-00030-f005">Figure 5</xref> shows a conceptual diagram implementing DVFS on a multi-core processor. Darker shaded regions represent cores operating at high voltage, while lighter shaded regions represent cores operating at low voltage. The unshaded cores are in sleep mode.</p>
        <fig id="jlpea-02-00030-f005" position="anchor">
          <label>Figure 5</label>
          <caption>
            <p>Dynamic voltage and frequency scaling (DVFS) for a multi-core processor.</p>
          </caption>
          <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g005.tif"/>
        </fig>
        <p>Chipwide DVFS is a global power management scheme that monitors the entire chip power consumption and performance, and enforces a uniform voltage-frequency operating point for all cores to minimize power dissipation under an overall throughput budget. This approach does not need any individual information about the power and performance of each core, and simply relies on entire chip throughput measurements to make power mode switching decisions. As a result, one high performance core can push the entire chip over throughput budget, thereby triggering DVFS to occur across all cores on-chip. A scaling down of voltage and frequency in cores which are not exceeding their throughput budgets will further reduce their throughputs. This may be undesirable, especially if these cores are running threads from different applications which run at different performance levels. The pseudo-code is shown in <xref ref-type="table" rid="jlpea-02-00030-t007">Table 7</xref>. Cumulative power dissipation is calculated by adding the power dissipation observed in the last evaluation cycle to the total power dissipation of <italic>Core<sub>i</sub></italic> from time <italic>T</italic> = 0. Cumulative throughput similarly is the total number of instructions committed until now from time <italic>T</italic> = 0 including the instructions committed in the last evaluation cycle. Also, in this case the current core DVFS level is same across all the cores.</p>
        <table-wrap id="jlpea-02-00030-t007" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t007_Table 7</object-id>
          <label>Table 7</label>
          <caption>
            <p>Pseudo Code of Chipwide DVFS.</p>
          </caption>
          <table>
            <tbody>
              <tr>
                <td align="left" valign="middle">/* this algorithm continuously executes once every evaluation cycle */</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Get_current_core_dvfs_level;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">For all <italic>Cores<sub>i</sub></italic> {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Get power dissipated by <italic>Core<sub>i</sub></italic> in the last evaluation cycle; </td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Get effective throughput of 
                  <italic>Core<sub>i</sub></italic> in the last evaluation cycle;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Sum up cumulative power dissipated by all cores in the last evaluation cycle;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Sum up cumulative throughput of all cores in the last evaluation cycle;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">}</td>
              </tr>
              <tr>
                <td align="left" valign="middle">If (Overall throughput of all cores &gt; throughput budget) {</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> if (current_core_dvfs_level &gt; lowest_dvfs_level) {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">      Lower down current_core_dvfs_level to next level;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  }</td>
              </tr>
              <tr>
                <td align="left" valign="middle">}</td>
              </tr>
              <tr>
                <td align="left" valign="middle">For all <italic>Cores<sub>i</sub></italic> {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">   Update every core’s new dvfs level;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">   }</td>
              </tr>
            </tbody>
          </table>
		  </table-wrap>
        <p>The MaxBIPS algorithm [<xref ref-type="bibr" rid="B30-jlpea-02-00030">30</xref>] monitors the power consumption and performance at the global level and collects information about the entire chip throughput, as well as the throughput contributions of individual cores. The power mode for each core is then selected so as to minimize the power dissipation of the entire chip, while maximizing the system performance subject to the given throughput budget. The algorithm evaluates all the possible combinations of power modes for each core, and then chooses the one that minimizes the overall power dissipation and maximizes the overall system performance while meeting the throughput budget by examining all voltage/frequency pairs for each core. The cores are permitted to operate at different voltages and frequencies in MaxBIPS algorithm. A linear scaling of frequency with voltage is assumed in MaxBIPS [<xref ref-type="bibr" rid="B30-jlpea-02-00030">30</xref>]. </p>
        <p>Based on <xref ref-type="table" rid="jlpea-02-00030-t006">Table 6</xref>, the MaxBIPS algorithm predicts the estimated power and throughput for all possible combinations of cores and voltage/frequency modes (<italic>vf</italic>_mode) or scaling factors and selects the (core_<italic>i</italic>, <italic>vf</italic>_mode_<italic>j</italic>) that minimizes power dissipation, but maximizes throughput while meeting the required throughput budget. The pseudo-core of MaxBIPS algorithm is shown in <xref ref-type="table" rid="jlpea-02-00030-t008">Table 8</xref>, the <italic>Power Mode Combination<sub>i</sub></italic> used in MaxBIPS algorithm is a lookup table storing all the possible combinations of DVFS levels across the cores. For example, if there are 4 cores and 3 DVFS levels as described in <xref ref-type="table" rid="jlpea-02-00030-t006">Table 6</xref> which stores the predicted power consumption and throughput observed until the last evaluation cycle in the chip for all possible combinations of DVFS levels across the cores in the chip.</p>
        <table-wrap id="jlpea-02-00030-t008" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t008_Table 8</object-id>
          <label>Table 8</label>
          <caption>
            <p>Pseudo-code of MaxBIPS DVFS.</p>
          </caption>
          <table>
            <tbody>
              <tr>
                <td align="left" valign="middle">/* this algorithm continuously executes once every evaluation cycle */</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Define_power_mode_combinations;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Initialize Min_power;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Initialize Max_throughput;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Initialize Selected_combination;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">--voltage frequency (power mode) combinations for different cores</td>
              </tr>
              <tr>
                <td align="left" valign="middle">For all <italic>Cores<sub>i</sub></italic> {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  dvfsLevel = Get current DVFS level of <italic>Core<sub>i</sub></italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> Get power dissipated by <italic>Core<sub>i</sub></italic> in the last evaluation cycle; </td>
              </tr>
              <tr>
                <td align="left" valign="middle"> Get effective throughput of <italic>Core<sub>i</sub></italic> in the last evaluation cycle;</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> }</td>
              </tr>
              <tr>
                <td align="left" valign="middle">For all Power_Mode_Combinations<sub>j</sub> {</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> For all <italic>Cores<sub>k</sub></italic> {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  Calculate predicted throughput value of core <italic>k</italic> in combination_<italic>j</italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">   --Using power_mode_combination, Equation (2)</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  Calculate predicted power value of core <italic>k</italic> in combination_<italic>j</italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">   --Using power_mode_combinations, Equation (2)</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  Accumulate predicted throughputs of all cores in combination_<italic>j</italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  Accumulate predicted power dissipations of all cores in combination_<italic>j</italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> } </td>
              </tr>
              <tr>
                <td align="left" valign="middle"> If (overall_predicted_throughput of all cores &lt;= throughput budget) {</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> If (Max_throughput &lt; overall_predicted_throughput of all cores) {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">     Max_throughput = overall_predicted_throughput of all cores;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Min_power = overall_predicted_power of all cores;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Selected_combination = <italic>j</italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  }</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  If (Max_throughput = overall_predicted_throughput of all cores) {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">     Max_throughput = overall_predicted_throughput of all cores;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">   If (Min_power &gt;= overall_predicted_power of all cores)</td>
              </tr>
              <tr>
                <td align="left" valign="middle">    Min_power = overall_predicted_power of all cores;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">   Selected_combination = <italic>j</italic>;</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  }</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> } </td>
              </tr>
              <tr>
                <td align="left" valign="middle"> }</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  For all <italic>Cores<sub>i</sub></italic> {</td>
              </tr>
              <tr>
                <td align="left" valign="middle">  Update every core’s new dvfs level with values in Selected_combination;</td>
              </tr>
              <tr>
                <td align="left" valign="middle"> }</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>The three DVFS levels used in Chipwide DVFS and MaxBIPS DPM are shown in <xref ref-type="table" rid="jlpea-02-00030-t009">Table 9</xref>. These voltage-frequency pairs have been verified using the experimental setup of <xref ref-type="sec" rid="sec4-jlpea-02-00030">Section 4</xref>. Note that performance predictions of the existing GPMU algorithms to be discussed this section do not consider the bottlenecks caused by shared memory access between cores.Please note that all synchronization between the cores is resolved in the L2 cache. During the cycle-accurate simulation, the L2 cache accesses from the cores are resolved using arbitration logic and queues in the L2 cache controllers. In case of L1 cache misses, packets are sent to the L2 cache which brings the data back in as load misses. An increased L1 cache miss hence means longer wait time for the instruction which caused the miss and effectively we observe the cycles per instruction (CPI) of the core to decrease.</p>
        <table-wrap id="jlpea-02-00030-t009" position="anchor">
          <object-id pub-id-type="pii">jlpea-02-00030-t009_Table 9</object-id>
          <label>Table 9</label>
          <caption>
            <p>DVFS Levels used in Chipwide DVFS and MaxBIPS.</p>
          </caption>
          <table rules="all" style="border:solid thin">
            <thead>
              <tr align="center">
                <th valign="middle">DVFS Level ID</th>
                <th valign="middle">Voltage-Frequency Combination</th>
              </tr>
            </thead>
            <tbody>
              <tr align="center">
                <td valign="middle">DVFS_LEVEL_0</td>
                <td valign="middle">0.85 V, 0.85 GHz</td>
              </tr>
              <tr align="center">
                <td valign="middle">DVFS_LEVEL_1</td>
                <td valign="middle">1.7 V, 1.7 GHz</td>
              </tr>
              <tr align="center">
                <td valign="middle">DVFS_LEVEL_2</td>
                <td valign="middle">1.7 V, 3.4 GHz</td>
              </tr>
            </tbody>
          </table>
		  </table-wrap>
      </sec>
    </sec>
    <sec id="sec3-jlpea-02-00030">
      <title>3. Embedded Processor Benchmark (ENePBench)</title>
      <fig id="jlpea-02-00030-f006" position="anchor">
        <label>Figure 6</label>
        <caption>
          <p>Pictorial representation of IP packet header and payload processing in two packet instances of different types.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g006.tif"/>
      </fig>
      
	  <p>To evaluate the performance and power dissipation of candidate designs we have developed a benchmark suite called <italic>Embedded Network Packet Processing Benchmark</italic> (ENePBench) which emulates the IP packet processing tasks executed in a network router. The router workload varies according to internet usage where random number of IP packets arrive at random intervals. To meet a target bandwidth, the router has to: (i) process a required number of packets per second; and (ii) process individual packets within their latency constraints. The task flow is described in <xref ref-type="fig" rid="jlpea-02-00030-f006">Figure 6</xref>. Incoming IPv6 packets are scheduled on the processing cores of the NeP based on respective packet types and priorities. Depending on the type of a packet different header and payload processing functions process the header and payload of the packet respectively. Processed packets are either routed towards the outward queues (in case of pass-through packets) or else terminated.</p>
      <p>The packet processing functions of ENePBench are adapted from CommBench 0.5 [<xref ref-type="bibr" rid="B33-jlpea-02-00030">33</xref>]. Routing table lookup function RTR, packet fragmentation function FRAG and traffic monitoring function TCP constitute the packet header functions. Packet payload processing functions include encryption (CAST), error detection (REED) and JPEG encoding and decoding as shown in <xref ref-type="table" rid="jlpea-02-00030-t010">Table 10</xref>.</p>
      <table-wrap id="jlpea-02-00030-t010" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t010_Table 10</object-id>
        <label>Table 10</label>
        <caption>
          <p>Packet processing functions in ENePBench.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle">Function Type</th>
              <th align="center" valign="middle">Function Name</th>
              <th valign="middle">Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td rowspan="3" align="left" valign="middle">Header Processing Functions</td>
              <td align="center" valign="middle">
                <bold>RTR</bold>
              </td>
              <td align="left" valign="middle">A Radix-Tree routing table lookup program </td>
            </tr>
            <tr>
              <td align="center" valign="middle">
                <bold>FRAG</bold>
              </td>
              <td align="left" valign="middle">An IP packet fragmentation code</td>
            </tr>
            <tr>
              <td align="center" valign="middle">
                <bold>TCP</bold>
              </td>
              <td align="left" valign="middle">A traffic monitoring application</td>
            </tr>
            <tr>
              <td rowspan="3" align="left" valign="middle">Payload Processing Functions</td>
              <td align="center" valign="middle">
                <bold>CAST</bold>
              </td>
              <td align="left" valign="middle">A 128 bit block cipher algorithm</td>
            </tr>
            <tr>
              <td align="center" valign="middle">
                <bold>REED</bold>
              </td>
              <td align="left" valign="middle">An implementation of Reed-Solomon Forward Error Correction scheme</td>
            </tr>
            <tr>
              <td align="center" valign="middle">
                <bold>JPEG</bold>
              </td>
              <td align="left" valign="middle">A lossy image data compression algorithm</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Packet Scheduler</td>
              <td align="center" valign="middle">
                <bold>DRR</bold>
              </td>
              <td align="left" valign="middle">Deficit Round Robin fair scheduling algorithm</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      <p>Functionally, IP packets are further classified into types TYPE0 to TYPE4 as shown in <xref ref-type="table" rid="jlpea-02-00030-t011">Table 11</xref>. The headers of all packets belonging to packet types TYPE0 to TYPE4 are used to lookup the IP routing table (RTR), managing packet fragmentation (FRAG) and traffic monitoring (TCP). The payload processing of the packet types, however, is different from each other. Packet types TYPE0, TYPE1 and TYPE2 are compute bound packets and are processed with encryption and error detection functions. In case of packet type TYPE3 and TYPE4, the packet payloads are processed with both compute bound encryption and error detection functions as well as data bound JPEG encoding/decoding functions.</p>
      <table-wrap id="jlpea-02-00030-t011" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t011_Table 11</object-id>
        <label>Table 11</label>
        <caption>
          <p>Packet Types used in ENePBench.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle">Packet Type</th>
              <th valign="middle">Header Functions</th>
              <th valign="middle">Data Functions</th>
              <th valign="middle">Characteristic</th>
              <th valign="middle">Type of Service</th>
            </tr>
          </thead>
          <tbody>
            <tr align="center">
              <td valign="middle">TYPE0</td>
              <td valign="middle">RTR, FRAG, TCP</td>
              <td valign="middle">REED</td>
              <td valign="middle">Compute Bound</td>
              <td valign="middle">Real Time</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE1</td>
              <td valign="middle">RTR, FRAG, TCP</td>
              <td valign="middle">CAST</td>
              <td valign="middle">Compute Bound</td>
              <td valign="middle">Real Time</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE2</td>
              <td valign="middle">RTR, FRAG, TCP</td>
              <td valign="middle">CAST, REED</td>
              <td valign="middle">Compute Bound</td>
              <td valign="middle">Content-Delivery</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE3</td>
              <td valign="middle">RTR, FRAG, TCP</td>
              <td valign="middle">REED, JPEG</td>
              <td valign="middle">Data Bound</td>
              <td valign="middle">Content-Delivery</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE4</td>
              <td valign="middle">RTR, FRAG, TCP</td>
              <td valign="middle">CAST, REED, JPEG</td>
              <td valign="middle">Data Bound</td>
              <td valign="middle">Content-Delivery</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      <p>The two broad categories of IP Packets are hard real-time termed as <italic>real-time</italic> packets and soft real-time termed as <italic>content-delivery</italic> packets. <italic>Real-time</italic> packets are assigned with high priority whereas <italic>content-delivery</italic> packets are processed with lower priorities. <xref ref-type="table" rid="jlpea-02-00030-t012">Table 12</xref> enlists the end-to-end transmission delays associated with each packet categories [<xref ref-type="bibr" rid="B34-jlpea-02-00030">34</xref>]. The total propagation delay (source to destination) of <italic>real-time</italic> packets is less than 150 milliseconds (ms) and less than 10 s for <italic>content-delivery</italic> packets respectively [<xref ref-type="bibr" rid="B34-jlpea-02-00030">34</xref>].</p>
      <table-wrap id="jlpea-02-00030-t012" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t012_Table 12</object-id>
        <label>Table 12</label>
        <caption>
          <p>Performance Targets for IP packet type.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle">Application/Packet Type</th>
              <th valign="middle">Data Rate</th>
              <th valign="middle">Size</th>
              <th valign="middle">End-to-end Delay</th>
              <th valign="middle">Description</th>
            </tr>
          </thead>
          <tbody>
            <tr align="center">
              <td valign="middle">Audio</td>
              <td valign="middle">4–64 (KB/s)</td>
              <td valign="middle">&lt;1 KB</td>
              <td valign="middle">&lt;150 ms</td>
              <td valign="middle">Conversational Audio</td>
            </tr>
            <tr align="center">
              <td valign="middle">Video</td>
              <td valign="middle">16–384 (KB/s)</td>
              <td valign="middle">~10 KB</td>
              <td valign="middle">&lt;150 ms</td>
              <td valign="middle">Interactive video</td>
            </tr>
            <tr align="center">
              <td valign="middle">Data</td>
              <td valign="middle">-</td>
              <td valign="middle">~10 KB</td>
              <td valign="middle">&lt;250 ms</td>
              <td valign="middle">Bulk data</td>
            </tr>
            <tr align="center">
              <td valign="middle">Still Image</td>
              <td valign="middle">-</td>
              <td valign="middle">&lt;100 KB</td>
              <td valign="middle">&lt;10 s</td>
              <td valign="middle">Images/Movie clips</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      <p>In practice 10 to 15 hops are allowed per packet which means the <italic>worst case processing time</italic> is approximately 10 ms in case of real-time packets and 1000 ms in case of content-delivery packets respectively [<xref ref-type="bibr" rid="B34-jlpea-02-00030">34</xref>] per intermediate router. For each packet the <italic>worst case processing time</italic> in a router includes the wait time in incoming packet queue, packet header and payload processing time and wait time in the output queues [<xref ref-type="bibr" rid="B35-jlpea-02-00030">35</xref>]. Traditionally schedulers in NePs snoop on the incoming packet queues and upon packet arrival generate <italic>interrupts</italic> to the processing cores. A <italic>context switch</italic> mechanism is subsequently used to dispatch packets to the individual cores for further processing. Current systems however use a switching mechanism to directly move packets from incoming packet queues to the cores avoiding expensive signal interrupts [<xref ref-type="bibr" rid="B36-jlpea-02-00030">36</xref>]. Hence, in our case we have not considered interrupt generation and context switch time to calculate <italic>worst case processing time.</italic> Also due to the low propagation time in current high bandwidth optical fiber networks we ignore the propagation time of packets through the network wires [<xref ref-type="bibr" rid="B37-jlpea-02-00030">37</xref>]. In our methodology the individual cores are designed such that they are able to process packets within the <italic>worst case processing time</italic>.</p>
    </sec>
    <sec id="sec4-jlpea-02-00030">
      <title>4. Verification of CASPER</title>
      <p>Functional correctness of candidate designs simulated in CASPER is verified using a set of diagnostic codes which are designed to test all the possible instruction and data paths in the stages of the pipeline in a core. Additional set of diagnostic codes are written in SPARCV9 assembly which consist of random combinations of instructions such that different system events such as traps, store buffer full and others are also asserted. To further verify the accuracy of CASPER, we have compared the total number of system events generated while executing 10 IP packets in the ENePBench in a real-life UltraSPARC T1000 machine consisting of an UltraSPARC T1 (T1) processor (T1) [<xref ref-type="bibr" rid="B19-jlpea-02-00030">19</xref>] to an exact UltraSPARC T1 prototype (T1_V) simulated in CASPER. UltraSPARC T1 is the closest in-order CMT variant to our CMT designs modeled in CASPER and consists of 8 cores and 4 hardware threads per core. The simulated processor in CASPER had equal number of cores, hardware threads per core, L1 and L2 caches as T1. Our results are tabulated in <xref ref-type="table" rid="jlpea-02-00030-t013">Table 13</xref>. Columns 3a, 3b, 4a, 4b, 5a, 5b and 6 in <xref ref-type="table" rid="jlpea-02-00030-t013">Table 13</xref> compare the number of instructions committed, store buffer full event, I$ misses and D$ misses respectively in T1 and T1_V respectively. Column 6 shows that in average, the error in number of system events is less than 10%.</p>
      <table-wrap id="jlpea-02-00030-t013" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t013_Table 13</object-id>
        <label>Table 13</label>
        <caption>
          <p>Comparison between number of system events for 10 IP packets in (i) T1000 server with an UltraSPARC T1 processor and (ii) a T1 prototype simulated in CASPER.</p>
        </caption>
       <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th rowspan="2" valign="middle">Packet Type</th>
              <th rowspan="2" valign="middle">Clock Ticks (in 10<sup>6</sup>)</th>
              <th colspan="2" valign="middle">Instr_cnt (in 10<sup>6</sup>)</th>
              <th colspan="2" valign="middle">SB_full (in 10<sup>3</sup>)</th>
              <th colspan="2" valign="middle">IC_misses (in 10<sup>3</sup>)</th>
              <th colspan="2" valign="middle">DC_misses (in 10<sup>3</sup>)</th>
              <th rowspan="2" valign="middle">Avg. Error (%)</th>
            </tr>
            <tr>
              <th valign="middle">T1</th>
              <th valign="middle">T1_V</th>
              <th valign="middle">T1</th>
              <th valign="middle">T1_V</th>
              <th valign="middle">T1</th>
              <th valign="middle">T1_V</th>
              <th valign="middle">T1</th>
              <th valign="middle">T1_V</th>
      </tr>
          </thead>
          <tbody>
            <tr align="center">
              <td valign="middle">TYPE0</td>
              <td valign="middle">0.674</td>
              <td valign="middle">0.255</td>
              <td valign="middle">0.255</td>
              <td valign="middle">5.0</td>
              <td valign="middle">4.9</td>
              <td valign="middle">2.6</td>
              <td valign="middle">2.6</td>
              <td valign="middle">1.56</td>
              <td valign="middle">1.59</td>
              <td valign="middle">2.01</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE1</td>
              <td valign="middle">0.673</td>
              <td valign="middle">0.254</td>
              <td valign="middle">0.254</td>
              <td valign="middle">5.4</td>
              <td valign="middle">5.6</td>
              <td valign="middle">2.5</td>
              <td valign="middle">2.4</td>
              <td valign="middle">1.50</td>
              <td valign="middle">1.6</td>
              <td valign="middle">7.35</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE2</td>
              <td valign="middle">0.612</td>
              <td valign="middle">0.26</td>
              <td valign="middle">0.258</td>
              <td valign="middle">5.1</td>
              <td valign="middle">5.2</td>
              <td valign="middle">2.6</td>
              <td valign="middle">2.5</td>
              <td valign="middle">1.51</td>
              <td valign="middle">1.52</td>
              <td valign="middle">4.0</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE3</td>
              <td valign="middle">2.257</td>
              <td valign="middle">0.90</td>
              <td valign="middle">0.892</td>
              <td valign="middle">12.9</td>
              <td valign="middle">12.7</td>
              <td valign="middle">3.5</td>
              <td valign="middle">3.9</td>
              <td valign="middle">6.84</td>
              <td valign="middle">6.84</td>
              <td valign="middle">5.7</td>
            </tr>
            <tr align="center">
              <td valign="middle">TYPE4</td>
              <td valign="middle">2.259</td>
              <td valign="middle">0.94</td>
              <td valign="middle">0.896</td>
              <td valign="middle">18.9</td>
              <td valign="middle">17.1</td>
              <td valign="middle">3.5</td>
              <td valign="middle">3.6</td>
              <td valign="middle">6.89</td>
              <td valign="middle">6.89</td>
              <td valign="middle">9.5</td>
            </tr>
          </tbody>
        </table></table-wrap>
    </sec>
    <sec sec-type="results" id="sec5-jlpea-02-00030">
      <title>5. Results</title>
      <p>The power dissipation and throughput observed by varying the key micro-architectural components namely number of threads per core, data and instruction cache sizes per core, store buffer size per thread in a core and number of cores in the chip are showed in <xref ref-type="fig" rid="jlpea-02-00030-f007">Figure 7</xref> to <xref ref-type="fig" rid="jlpea-02-00030-f021">Figure 21</xref>. Note that Hardware Power Management is not enabled for the experiments generating data shown for <xref ref-type="fig" rid="jlpea-02-00030-f007">Figure 7</xref> through <xref ref-type="fig" rid="jlpea-02-00030-f021">Figure 21</xref>. In each of the figures, power-performance trade-offs are shown by co-varying two micro-architectural parameters while the other parameters are kept at a constant value as described in the baseline architecture shown in <xref ref-type="table" rid="jlpea-02-00030-t014">Table 14</xref>. Cycles per instruction per core or CPI-per-core (lower is better) is measured by the total number of clock cycles during the runtime of a workload divided by the total number of committed instructions across all the hardware threads during that time and average cycles per instruction per thread or CPI-per-thread is measured by the number of clock cycles divided by the number of instructions committed in a hardware thread during the same time. All the data is based on the execution of compute bound packet type 1 (TYPE1) as described in <xref ref-type="table" rid="jlpea-02-00030-t011">Table 11</xref>.</p>
      <fig id="jlpea-02-00030-f007" position="anchor">
        <label>Figure 7</label>
        <caption>
          <p>Power dissipation <italic>versus</italic> CPI-per-thread in a 1-core 1-thread 1 MB L2 cache processor where the data cache size is varied from 4 KB to 64 KB for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g007.tif"/>
      </fig>
       <fig id="jlpea-02-00030-f008" position="anchor">
        <label>Figure 8</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 4-thread 1 MB L2 cache processor where the data cache size is varied from 4 KB to 64 KB for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g008.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f009" position="anchor">
        <label>Figure 9</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 8-thread 1 MB L2 cache processor where the data cache size is varied from 4 KB to 64 KB for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g009.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f010" position="anchor">
        <label>Figure 10</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 1-thread 1 MB L2 cache processor where the instruction cache size is varied from 4 KB to 64 KB for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g010.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f011" position="anchor">
        <label>Figure 11</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 4-thread 1 MB L2 cache processor where the instruction cache size is varied from 4 KB to 64 KB for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g011.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f012" position="anchor">
        <label>Figure 12</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 8-thread 1 MB L2 cache processor where the instruction cache size is varied from 4 KB to 64 KB for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g012.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f013" position="anchor">
        <label>Figure 13</label>
        <caption>
          <p>Power dissipation and CPI-per-core in a 1-core 1-thread 1 MB L2 cache processor where the store buffer size is varied from 4 to 16 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g013.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f014" position="anchor">
        <label>Figure 14</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 4-thread 1 MB L2 cache processor where the store buffer size is varied from 4 to 16 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g014.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f015" position="anchor">
        <label>Figure 15</label>
        <caption>
          <p>Power dissipation, CPI-per-core and CPI-per-thread in a 1-core 8-thread 1 MB L2 cache processor where the store buffer size is varied from 4 to 16 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g015.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f016" position="anchor">
        <label>Figure 16</label>
        <caption>
          <p>Power dissipation and overall CPI trade-offs as number of cores is scaled from 4 to 128. All the cores have <italic>N<sub>T</sub></italic> = 1 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g016.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f017" position="anchor">
        <label>Figure 17</label>
        <caption>
          <p>Power dissipation and packet bandwidth as number of cores is scaled from 4 to 128. All the cores have <italic>N<sub>T</sub></italic> = 1 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g017.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f018" position="anchor">
        <label>Figure 18</label>
        <caption>
          <p>Power dissipation and CPI as number of cores is scaled from 4 to 128. All the cores have <italic>N<sub>T</sub></italic> = 4 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g018.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f019" position="anchor">
        <label>Figure 19</label>
        <caption>
          <p>Power dissipation and packet bandwidth as number of cores is scaled from 4 to 128. All the cores have <italic>N<sub>T</sub></italic> = 4 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g019.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f020" position="anchor">
        <label>Figure 20</label>
        <caption>
          <p>Power dissipation and CPI as number of cores is scaled from 4 to 128. All the cores have <italic>N<sub>T</sub></italic> = 8 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g020.tif"/>
      </fig>
      
	  <fig id="jlpea-02-00030-f021" position="anchor">
        <label>Figure 21</label>
        <caption>
          <p>Power dissipation and packet bandwidth as number of cores is scaled from 4 to 128. All the cores have <italic>N<sub>T</sub></italic> = 8 for packet type TYPE1.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g021.tif"/>
      </fig>
      
	  <table-wrap id="jlpea-02-00030-t014" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t014_Table 14</object-id>
        <label>Table 14</label>
        <caption>
          <p>Baseline architecture for packet type TYPE1 to study the power-performance trade-offs in single-core designs.</p>
        </caption>
         <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle">Field</th>
              <th align="center" valign="middle">Value</th>
    </tr>
          </thead>
          <tbody>
            <tr>
              <td align="left" valign="middle">Number of threads</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Data cache size</td>
              <td align="center" valign="middle">8 KB</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Data cache associativity</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Data cache line size</td>
              <td align="center" valign="middle">32</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Instruction cache size</td>
              <td align="center" valign="middle">16 KB</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Instruction cache associativity</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Instruction cache line size</td>
              <td align="center" valign="middle">32</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Store Buffer size</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Load Miss Queue size</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">ASI Queue size</td>
              <td align="center" valign="middle">2</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache size</td>
              <td align="center" valign="middle">4 MB</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache banks</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache associativity</td>
              <td align="center" valign="middle">16</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache line size</td>
              <td align="center" valign="middle">64</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      <p><xref ref-type="fig" rid="jlpea-02-00030-f007">Figure 7</xref>, <xref ref-type="fig" rid="jlpea-02-00030-f008">Figure 8</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f009">Figure 9</xref> show the power dissipation, CPI-per-core and average CPI-per-thread for <italic>N<sub>T</sub></italic> = 1, 4 and 8 respectively as data cache size is scaled from 4 KB to 64 KB. In <xref ref-type="fig" rid="jlpea-02-00030-f007">Figure 7</xref>, the increase in D$ size reduces the data miss rate and hence both CPI-per-thread and CPI-per-core improve. Power dissipation however increases due to increasing D$ size. The figures also demonstrates the trade-offs between performance and power when number of threads is scaled from 1 to 8. The increase in the number of threads in a core means performance of individual threads is slowed down by as many cycles as the number of threads due to the round robin small latency thread scheduling scheme. CPI-per-core however is not linearly dependent on the number of threads. While factors such as increased cache sharing, increased pipeline sharing, lesser pipeline stalls improves CPI-per-core with thread-scaling, factors such as increased stall time at the store buffer, instruction miss queue and load miss queues tend to diminish it. Hence, we clearly see a non-linear pattern where CPI-per-core is higher in <italic>N<sub>T</sub></italic> = 4 compared to <italic>N<sub>T</sub></italic> = 8. In case of <italic>N<sub>T</sub></italic> = 8, we observe 10% decrease in cache misses which results in lower CPI-pe-core compared to <italic>N<sub>T</sub></italic> = 4. However, this is not the case when <italic>N<sub>T</sub></italic> is increased from 1 to 4. Important to note that this behavior is application specific and hence reestablishes the non-linear co-dependencies between performance and the structure and behavior of the micro-architectural components. <xref ref-type="fig" rid="jlpea-02-00030-f010">Figure 10</xref>, <xref ref-type="fig" rid="jlpea-02-00030-f011">Figure 11</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f012">Figure 12</xref> show the power dissipation, CPI-per-core and average CPI-per-thread for <italic>N<sub>T</sub></italic> = 1, 4 and 8 respectively as instruction cache size is scaled from 4 KB to 64 KB. Here also, CPI-per-thread and CPI-per-core improves with increasing I$ size as instruction misses decrease. Power dissipation however increases due to increasing I$ size. <xref ref-type="fig" rid="jlpea-02-00030-f013">Figure 13</xref>, <xref ref-type="fig" rid="jlpea-02-00030-f014">Figure 14</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f015">Figure 15</xref> show the power dissipation, CPI-per-core and average CPI-per-thread for store buffer sizes of 4 to 16 for <italic>N<sub>T</sub></italic> = 1, 4 and 8 respectively. We observe similar increasing power consumption with increasing store buffer size. Both CPI-per-core and CPI-per-strand improve. In all these figures, the co-variance of core-level micro-architectural parameters D$ size, I$ size and SB size articulately demonstrates both the diminishing and positive effects of <italic>N<sub>T</sub></italic> scaling. Power dissipation increases with <italic>N<sub>T</sub></italic>. CPI-per-strand increases prohibitively affecting single thread performance due to shared pipeline, whereas CPI-per-core decreases showing improvement in throughput in a core as more instructions are executed in a core due to latency hiding.</p>
     <p><xref ref-type="fig" rid="jlpea-02-00030-f016">Figure 16</xref>, <xref ref-type="fig" rid="jlpea-02-00030-f018">Figure 18</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f020">Figure 20</xref> show the power dissipation and overall CPI observed in case of <italic>N<sub>T</sub></italic> = 1, 4 and 8 respectively. <xref ref-type="fig" rid="jlpea-02-00030-f017">Figure 17</xref>, <xref ref-type="fig" rid="jlpea-02-00030-f019">Figure 19</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f021">Figure 21</xref> shows the peak power dissipation <italic>versus</italic> overall packet bandwidth observed in case of <italic>N<sub>T</sub></italic> = 1, 4 and 8 respectively. Unlike the figures reporting core-level power dissipation and throughput, the overall power dissipation observed in the following figures include the cycle-accurate dynamic power consumption of the entire chip including all the cores, L2 cache and the crossbar interconnection. Interestingly, despite the consistent increase of peak power dissipation with increasing number of cores in the chip, packet bandwidth does not scale with number of cores due to the contention in the shared L2 cache. The diminishing effects of non-optimality can be observed especially in case of 128 cores. Packet bandwidth non-intuitively decreases as number of cores is scaled from 64 to 128. This further emphasizes the critical need of efficient and scalable micro-architectural power-aware design space exploration algorithms able to scan a wide range of possible design choices and find the optimal power-performance balance. In our case 32 cores is observed to be the optimal design since it shows the best power-performance balance. As shown in <xref ref-type="fig" rid="jlpea-02-00030-f020">Figure 20</xref>, for the packet type TYPE1, with threads per core = 8, we observed both cache misses and pipeline stall reduce minimizing the CPI per core. In addition, with number of cores = 32, the wait time in the L2 cache queues was also minimum compared to the other core counts. Hence, in our case study of packet type TYP1, we found that with <italic>N<sub>T</sub></italic> = 8, the optimal number of cores was 32. Increasing number of cores is diminishingly affecting throughput due to the non-optimal L2 cache micro-architecture which is divided into only 4 banks. Altering the number of L2 cache banks will mitigate contention and help increase packet bandwidth.</p>
	<p>In <xref ref-type="fig" rid="jlpea-02-00030-f022">Figure 22</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f023">Figure 23</xref>, we show the power and throughput data (with a throughput budget constrained to at 90% of peak throughput with any voltage and frequency scaling) for Chipwide DVFS and MaxBIPS policies for packet type 3 (TYPE3) which is a typical representative of all other packet types. The baseline architecture is displayed in <xref ref-type="table" rid="jlpea-02-00030-t015">Table 15</xref>. Values on the X-axis correspond to the number of evaluation cycles, where one evaluation cycle is the time period between consecutive runs of the power management algorithms. Where not explicitly stated, one evaluation cycle corresponds to 1024 processor clock cycles in our simulations. In <xref ref-type="fig" rid="jlpea-02-00030-f022">Figure 22</xref>, the X-axis represents number of clock cycles and the Y-axis represents power (W). In <xref ref-type="fig" rid="jlpea-02-00030-f023">Figure 23</xref>, the Y-axis represents throughput (in instructions per nanosecond-IPnS).</p>
      <fig id="jlpea-02-00030-f022" position="anchor">
        <label>Figure 22</label>
        <caption>
          <p>Power for Chipwide DVFS and MaxBIPS.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g022.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f023" position="anchor">
        <label>Figure 23</label>
        <caption>
          <p>Throughput for Chipwide DVFS and MaxBIPS.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g023.tif"/>
      </fig>
      
	  <table-wrap id="jlpea-02-00030-t015" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t015_Table 15</object-id>
        <label>Table 15</label>
        <caption>
          <p>Baseline architecture used in the experiments for power management.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle">Fields</th>
              <th align="center" valign="middle">Value</th>
    </tr>
          </thead>
          <tbody>
            <tr>
              <td align="left" valign="middle">Number of threads</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Data cache size</td>
              <td align="center" valign="middle">8 KB</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Data cache associativity</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Data cache line size</td>
              <td align="center" valign="middle">32</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Instruction cache size</td>
              <td align="center" valign="middle">16 KB</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Instruction cache associativity</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Instruction cache line size</td>
              <td align="center" valign="middle">32</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Store Buffer size</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Load Miss Queue size</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">ASI Queue size</td>
              <td align="center" valign="middle">2</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Number of Cores</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache size</td>
              <td align="center" valign="middle">4 MB</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache banks</td>
              <td align="center" valign="middle">4</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache associativity</td>
              <td align="center" valign="middle">16</td>
            </tr>
            <tr>
              <td align="left" valign="middle">L2 cache line size</td>
              <td align="center" valign="middle">64</td>
            </tr>
          </tbody>
        </table>
      </table-wrap>
      <p>As <xref ref-type="fig" rid="jlpea-02-00030-f022">Figure 22</xref> shows, the power consumption of MaxBIPS is much higher than Chipwide DVFS (the latter being the lower in power dissipation among the two methods). However the throughput of MaxBIPS is also higher than Chipwide DVFS. The percentage power-saving in all the cores in case of Chipwide DVFS and MaxBIPS is shown in <xref ref-type="table" rid="jlpea-02-00030-t016">Table 16</xref>.</p>
      <table-wrap id="jlpea-02-00030-t016" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t016_Table 16</object-id>
        <label>Table 16</label>
        <caption>
          <p>Percentage power-saving due to Chipwide DVFS and MaxBIPS.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle"> </th>
              <th valign="middle">All Cores Running at 3.4 GHz</th>
              <th valign="middle">Chipwide DVFS</th>
              <th valign="middle">MaxBIPS</th>
            </tr>
          </thead>
          <tbody>
            <tr align="center">
              <td valign="middle">Power (W)</td>
              <td valign="middle">0.0</td>
              <td valign="middle">35.9</td>
              <td valign="middle">26.2</td>
            </tr>
          </tbody>
        </table></table-wrap>
      <p><xref ref-type="fig" rid="jlpea-02-00030-f024">Figure 24</xref> depicts the throughput per unit power (T/P) data for the two methods. Chipwide DVFS has the highest T/P values for the different packet types. Note that high T/P value for Chipwide DVFS arises from the fact that power dissipation in this scheme is substantially lower than other schemes, and not because the throughput is high. When implementing power management by Chipwide DVFS, any increase in the throughput of a single core over a target threshold triggers chipwide operating voltage (and hence, frequency) reductions in all cores, to save power. Hence, once the overall throughput exceeds the budget, all the cores have to adjust their power modes to a lower level. While this method reduces the overall power dissipation substantially, it also leads to excessive performance reductions in all cores as shown in <xref ref-type="fig" rid="jlpea-02-00030-f024">Figure 24</xref>.</p>
      <fig id="jlpea-02-00030-f024" position="anchor">
        <label>Figure 24</label>
        <caption>
          <p>Throughput per unit power data.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g024.tif"/>
      </fig>
      <p>A modification of the Chipwide DVFS algorithm required for achieving high performance is to assign a lower bound of throughput. <xref ref-type="fig" rid="jlpea-02-00030-f025">Figure 25</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f026">Figure 26</xref> show the power and throughput Chipwide DVFS characteristics (with a lower bound of throughput budget constrained to at 60% of peak throughput with all voltage-frequency levels) for packet type 3 (TYPE3). The power consumption and throughput of Chipwide DVFS are higher than those of MaxBIPS; this can be explained by the fact that the lower bound of throughput does not allow Chipwide DVFS to scale all the cores to lower voltage-frequency levels in order to guarantee the system performance. However the throughput per unit power of Chipwide DVFS is lower than those of MaxBIPS as <xref ref-type="fig" rid="jlpea-02-00030-f027">Figure 27</xref> demonstrates. The percentage power-saving in case of Chipwide DVFS with and without lower bound on throughput is shown in <xref ref-type="table" rid="jlpea-02-00030-t017">Table 17</xref>. <xref ref-type="table" rid="jlpea-02-00030-t018">Table 18</xref> shows the power, throughput, and throughput per unit power in this case. </p>
      <fig id="jlpea-02-00030-f025" position="anchor">
        <label>Figure 25</label>
        <caption>
          <p>Power for Chipwide DVFS and MaxBIPS with a lower bound of throughput budget = 60% peak throughput.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g025.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f026" position="anchor">
        <label>Figure 26</label>
        <caption>
          <p>Throughput for Chipwide DVFS and MaxBIPS with a lower bound of throughput budget = 60% peak throughput.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g026.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f027" position="anchor">
        <label>Figure 27</label>
        <caption>
          <p>Throughput per unit power data (Chipwide DVFS with lower bound throughput).</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g027.tif"/>
      </fig>
      
	  <table-wrap id="jlpea-02-00030-t017" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t017_Table 17</object-id>
        <label>Table 17</label>
        <caption>
          <p>Percentage power-saving due to Chipwide DVFS and MaxBIPS.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr align="center">
              <th valign="middle"> </th>
              <th valign="middle">All Cores Running at 3.4 GHz</th>
              <th valign="middle">Chipwide DVFS</th>
              <th valign="middle">MaxBIPS</th>
            </tr>
          </thead>
          <tbody>
            <tr align="center">
              <td valign="middle">Power (W)</td>
              <td valign="middle">0.0</td>
              <td valign="middle">25.9</td>
              <td valign="middle">18.6</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      
	  <table-wrap id="jlpea-02-00030-t018" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t018_Table 18</object-id>
        <label>Table 18</label>
        <caption>
          <p>Power, throughput, throughput per unit power of Chipwide DVFS with and without lower bound on throughput.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr>
              <th rowspan="2" align="center" valign="middle"> </th>
              <th colspan="3" align="center" valign="middle">With Lower Bound 60% of Peak T</th>
              <th colspan="3" align="center" valign="middle">Without Lower Bound of Throughput</th>
            </tr>
            <tr>
              <th align="center" valign="middle">Power in one time interval (W)</th>
              <th align="center" valign="middle">Throughput in one time interval (IPnS)</th>
              <th align="center" valign="middle">Throughput per unit power (IPnS/W)</th>
              <th align="center" valign="middle">Power in one time interval (W)</th>
              <th align="center" valign="middle">Throughput in one time interval (IPnS)</th>
              <th align="center" valign="middle">Throughput per unit power (IPnS/W)</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="center" valign="middle">Chipwide DVFS</td>
              <td align="center" valign="middle">0.252</td>
              <td align="center" valign="middle">0.332</td>
              <td align="center" valign="middle">1.318</td>
              <td align="center" valign="middle">0.105</td>
              <td align="center" valign="middle">0.184</td>
              <td align="center" valign="middle">1.752</td>
            </tr>
          </tbody>
        </table>
      </table-wrap>
      <p>In summary, experimental data show that when Chipwide DVFS is not enabled with lower bound of throughput, MaxBIPS has the highest throughput. Although Chipwide DVFS gives the highest throughput per unit power, its throughput, on average, is lower than that of MaxBIPs, which can be a constraining factor in high throughput systems that require throughputs close to the budget. When Chipwide DVFS is lower-bounded to 60% of peak throughput achievable by Chipwide DVFS, it produces the higher throughput and consumes the higher power between the two methods. This yields the lowest throughput per unit power for Chipwide DVFS, and MaxBIPS saves more power and achieves the highest throughput per unit power compared to the other two policies. <xref ref-type="table" rid="jlpea-02-00030-t019">Table 19</xref> shows the relevant experimental results of two policies with different packet types.</p>
      <table-wrap id="jlpea-02-00030-t019" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t019_Table 19</object-id>
        <label>Table 19</label>
        <caption>
          <p>Power, throughput, throughput per unit power of two policies for different packet types.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr>
              <th rowspan="2" align="center" valign="middle"> </th>
              <th colspan="3" align="center" valign="middle">Chipwide DVFS Without Lower Bound</th>
              <th colspan="3" align="center" valign="middle">MaxBIPS</th>
            </tr>
            <tr>
              <th align="center" valign="middle">P (W)</th>
              <th align="center" valign="middle">T (IPnS)</th>
              <th align="center" valign="middle">T/P (IPnS/W)</th>
              <th align="center" valign="middle">P (W)</th>
              <th align="center" valign="middle">T (IPnS)</th>
              <th align="center" valign="middle">T/P (IPnS/W)</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="center" valign="middle">TYPE0</td>
              <td align="center" valign="middle">3.72</td>
              <td align="center" valign="middle">6.61</td>
              <td align="center" valign="middle">1.78</td>
              <td align="center" valign="middle">7.53</td>
              <td align="center" valign="middle">10.62</td>
              <td align="center" valign="middle">1.41</td>
            </tr>
            <tr>
              <td align="center" valign="middle">TYPE1</td>
              <td align="center" valign="middle">3.72</td>
              <td align="center" valign="middle">6.65</td>
              <td align="center" valign="middle">1.79</td>
              <td align="center" valign="middle">7.54</td>
              <td align="center" valign="middle">10.65</td>
              <td align="center" valign="middle">1.41</td>
            </tr>
            <tr>
              <td align="center" valign="middle">TYPE2</td>
              <td align="center" valign="middle">3.93</td>
              <td align="center" valign="middle">6.65</td>
              <td align="center" valign="middle">1.69</td>
              <td align="center" valign="middle">7.83</td>
              <td align="center" valign="middle">10.65</td>
              <td align="center" valign="middle">1.36</td>
            </tr>
            <tr>
              <td align="center" valign="middle">TYPE3</td>
              <td align="center" valign="middle">3.72</td>
              <td align="center" valign="middle">6.64</td>
              <td align="center" valign="middle">1.78</td>
              <td align="center" valign="middle">7.54</td>
              <td align="center" valign="middle">10.64</td>
              <td align="center" valign="middle">1.41</td>
            </tr>
            <tr>
              <td align="center" valign="middle">TYPE4</td>
              <td align="center" valign="middle">3.72</td>
              <td align="center" valign="middle">6.64</td>
              <td align="center" valign="middle">1.78</td>
              <td align="center" valign="middle">7.54</td>
              <td align="center" valign="middle">10.63</td>
              <td align="center" valign="middle">1.41</td>
            </tr>
            <tr>
              <td align="center" valign="middle">Average</td>
              <td align="center" valign="middle">3.75</td>
              <td align="center" valign="middle">6.64</td>
              <td align="center" valign="middle">1.76</td>
              <td align="center" valign="middle">7.58</td>
              <td align="center" valign="middle">10.64</td>
              <td align="center" valign="middle">1.40</td>
            </tr>
          </tbody>
        </table>
       </table-wrap>
      <p><xref ref-type="table" rid="jlpea-02-00030-t020">Table 20</xref> shows the average power, average throughput, average throughput per unit power, average energy and average latency (execution time) of two power management policies while running about 7300 instructions for all the packet types (averaging is done over all packet types). Results show that on average, Chipwide DVFS consumes 17.7% more energy than MaxBIPS and has 2.34 times its latency.</p>
      <table-wrap id="jlpea-02-00030-t020" position="anchor">
        <object-id pub-id-type="pii">jlpea-02-00030-t020_Table 20</object-id>
        <label>Table 20</label>
        <caption>
          <p>Average power, average throughput, and average throughput per unit power, average energy, and average execution time of two discussed policies.</p>
        </caption>
        <table rules="all" style="border:solid thin">
<thead>
            <tr>
              <th align="center" valign="middle"> </th>
              <th align="center" valign="middle">P_average (W)</th>
              <th align="center" valign="middle">T_average (IPnS)</th>
              <th align="center" valign="middle">T/P_average (IPnS/W)</th>
              <th align="center" valign="middle">Energy_average (nJ)</th>
              <th align="center" valign="middle">Average Latency (nS)</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="center" valign="middle">Chipwide without lower bound</td>
              <td align="center" valign="middle">3.75</td>
              <td align="center" valign="middle">6.64</td>
              <td align="center" valign="middle">1.77</td>
              <td align="center" valign="middle">3.371</td>
              <td align="center" valign="middle">34,816</td>
            </tr>
            <tr>
              <td align="center" valign="middle">MaxBIPS</td>
              <td align="center" valign="middle">7.58</td>
              <td align="center" valign="middle">10.64</td>
              <td align="center" valign="middle">1.40</td>
              <td align="center" valign="middle">2.864</td>
              <td align="center" valign="middle">14,848</td>
            </tr>
          </tbody>
        </table>
		</table-wrap>
      <p><xref ref-type="fig" rid="jlpea-02-00030-f028">Figure 28</xref> and <xref ref-type="fig" rid="jlpea-02-00030-f029">Figure 29</xref> show the dynamic and leakage power dissipations, along with the hardware implementation areas for the Chipwide DVFS and MaxBIPS algorithms, as the number of cores is scaled. The total power dissipation of the MaxBIPS DPM hardware for an 8 core processor is around 101 µW, which is small compared to the average 20% power saving achieved. This justifies the use of on-chip power management units which enable substantial power saving while meeting the performance requirements of the packet processing application.</p>
      <fig id="jlpea-02-00030-f028" position="anchor">
        <label>Figure 28</label>
        <caption>
          <p>Dynamic power, leakage power and area of the Chipwide DVFS module in CASPER.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g028.tif"/>
      </fig>
      <fig id="jlpea-02-00030-f029" position="anchor">
        <label>Figure 29</label>
        <caption>
          <p>Dynamic power, leakage power and area of the MaxBIPS module in CASPER.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jlpea-02-00030-g029.tif"/>
      </fig>
    </sec>
    <sec id="sec6-jlpea-02-00030">
      <title>6. Related Work</title>
      <p>First we will do a brief survey of existing general purpose processor simulators and then power-aware simulators. The authors of [<xref ref-type="bibr" rid="B15-jlpea-02-00030">15</xref>,<xref ref-type="bibr" rid="B38-jlpea-02-00030">38</xref>,<xref ref-type="bibr" rid="B39-jlpea-02-00030">39</xref>,<xref ref-type="bibr" rid="B40-jlpea-02-00030">40</xref>,<xref ref-type="bibr" rid="B41-jlpea-02-00030">41</xref>,<xref ref-type="bibr" rid="B42-jlpea-02-00030">42</xref>,<xref ref-type="bibr" rid="B43-jlpea-02-00030">43</xref>] study variations of power and throughput in heterogeneous architectures. B. C. Lee and D. M. Brooks <italic>et al</italic>. minimize the overhead of micro-architectural design space exploration through statistical inference via regression models in [<xref ref-type="bibr" rid="B44-jlpea-02-00030">44</xref>]. The models are derived using fast simulations. The work in [<xref ref-type="bibr" rid="B45-jlpea-02-00030">45</xref>] differs from these in that they use full-fledged simulations for predicting and comparing performance and power of various architectures. A combination of analytic performance models and simulation-based performance models is used in [<xref ref-type="bibr" rid="B45-jlpea-02-00030">45</xref>] to guide design space exploration for sensor nodes. All these techniques rely on efficient processor simulators for architecture characterization.</p>
      <p>Virtutech Simics [<xref ref-type="bibr" rid="B46-jlpea-02-00030">46</xref>] is a full-system scalable functional simulator for embedded systems. The released versions support microprocessors such as PowerPC, x86, ARM and MIPS. Simics is also capable of simulating any digital device and communication bus. The simulator is able to simulate anything from a simple CPU + memory, to a complex SoC, to a custom board, to a rack of multiple boards, or a network of many computer systems. Simics is empowered with a suite of unique debugging toolset including reverse execution, tracing, fault-injection, checkpointing and other development tools. Similarly, Augmint [<xref ref-type="bibr" rid="B47-jlpea-02-00030">47</xref>] is an execution-driven multiprocessor simulator for Intel x86 architectures developed in University of Illinois, Urbana-Champagne. It can simulate uniprocessors as well as multiprocessors. The inflexibility in Augmint arises from the fact that the user needs to modify the source code to customize the simulator to model multiprocessor system. However both Simics and Augmint are not cycle-accurate and they model processors which do not have open-sourced architectures or instruction sets; this limits the potential for their use by the research community. Another execution-driven simulator is RSIM [<xref ref-type="bibr" rid="B48-jlpea-02-00030">48</xref>] which models shared-memory multiprocessors that aggressively exploit instruction-level parallelism (ILP). It also models an aggressive coherent memory system and interconnects, including contention at all resources. However throughput intensive applications which exploit task level parallelism are better implemented by the fine-grained multi-threaded cores that our proposed simulation framework models. Moreover we plan to model simple in-order processor pipelines which enable thread schedulers to use small-latency, something vital for meeting real-time constraints.</p>
      <p>General Execution-driven Multiprocessor Simulator (GEMS) [<xref ref-type="bibr" rid="B49-jlpea-02-00030">49</xref>] is an execution-driven simulator of SPARC-based multiprocessor system. It relies on functional processor simulator Simics and only provides cycle-accurate performance models when potential timing hazards are detected. GEMS Opal provides an out-of-order processor model. GEMS Ruby is a detailed memory system simulator. GEMS Specification Language including Cache Coherence (SLICC) is designed to develop different memory hierarchies and cache coherence models. The advantages of our simulator over the GEMS platform include its ability to (i) carry out full-chip cycle-accurate simulation with guaranteed fidelity which results in high confidence during broad micro-architecture explorations; and (ii) provide <italic>deep chip vision</italic> to the architect in terms of chip area requirement and run-time switching characteristics, energy consumption, and chip thermal profile.</p>
      <p>SimFlex [<xref ref-type="bibr" rid="B50-jlpea-02-00030">50</xref>] is a simulator framework for large-scale multiprocessor systems. It includes (a) Flexus–a full-system simulation platform; and (b) SMARTS–a statistically derived model to reduce simulation time. It employs systematic sampling to measure only a very small portion of the entire application being simulated. A functional model is invoked between measurement periods, greatly speeding the overall simulation but results in a loss of accuracy and flexibility for making fine micro-architectural changes, because any such change necessitates regeneration of statistical functional models. SimFlex also includes FPGA-based co-simulation platform called the ProtoFlex. Our simulator can also be combined with an FPGA based emulation platform in future, but this is beyond the scope of this work. </p>
      <p>MPTLsim [<xref ref-type="bibr" rid="B51-jlpea-02-00030">51</xref>] is a uop-accurate, cycle-accurate, full-system simulator for multi-core designs based on the X86-64 ISA. MPTLsim extends PTLsim [<xref ref-type="bibr" rid="B52-jlpea-02-00030">52</xref>], a publicly available single core simulator, with a host of additional features to support hyperthreading within a core and multiple cores, with detailed models for caches, on-chip interconnections and the memory data flow. MPTLsim incorporates detailed simulation models for cache controllers, interconnections and has built-in implementations of a number of cache coherency protocols. CASPER targets an open-sourced ISA and processor architecture which Sun Microsystems, Inc. has released under the OpenSPARC banner [<xref ref-type="bibr" rid="B4-jlpea-02-00030">4</xref>] for the research community.</p>
      <p>NePSim2 [<xref ref-type="bibr" rid="B53-jlpea-02-00030">53</xref>] is an open source framework for analyzing and optimizing NP design and power dissipation at architecture level. It uses a cycle-accurate simulator for Intel's multi-core IXP2xxx NPs, and incorporates an automatic verification framework for testing and validation, and a power estimation model for measuring the power consumption of the simulated NP. To the best of our knowledge, it is the only NP simulator available to the research community. NePSim2 has been evaluated with cryptographic benchmark applications along with a number of basic test cases. However, the simulator is not readily scalable to explore a wide variety of NP architectures.</p>
      <p>Wattch [<xref ref-type="bibr" rid="B17-jlpea-02-00030">17</xref>] proposed by David Brooks <italic>et al.</italic> is a multi-core micro-architectural power estimation and simulation platform. Wattch enables users to estimate power dissipation of only superscalar out-of-order multi-core micro-architectures. Out-of-order architectures consist of complex structures such as reservation stations, history-based branch predictors, common data-width buses and re-order buffers and hence are not always power-efficient. In case of low-energy processors such as embedded processors and network processors simple in-order processor pipelines are preferred due to their relatively low power consumption. McPAT [<xref ref-type="bibr" rid="B54-jlpea-02-00030">54</xref>] proposed by Norman Jouppi is another multi-core micro-architectural power estimation tool. McPAT provides power estimation of both out-of-order and in-order micro-architectures. Although efficient in estimating power dissipation for a wide range of values of the architectural parameters, McPAT is not cycle-accurate and hence incapable of capturing the dynamic interactions between the pipeline stages inside cores and other core-level micro-architectural structures, the shared memory structures and the interconnection network as all cores execute streams of instructions.</p>
    </sec>
    <sec id="sec7-jlpea-02-00030">
      <title>7. Conclusion</title>
      <p>In this paper CASPER—a cycle-accurate simulator for shared memory many-core processors is presented. A variety of multi-threaded architectural parameters such as number of cores, number of threads per core, and cache sizes, to name a few, are tunable in the simulator. This allows the exploration of a vast many-core micro-architectural design space for throughput intensive high performance and embedded applications. Pre-characterized libraries containing scalable area, delay and power dissipation models of different hardware components are included in CASPER. This enables accurate power estimation and monitoring of dynamic and leakage power dissipation and area of designs at the high level architecture exploration stage. Additional hardware controlled power management modules are designed in CASPER which enables dynamic power saving. The power saving capabilities of two such dynamic power management algorithms namely Chipwide DVFS and MaxBIPS are discussed and their performance-power trade-offs are shown. </p>
    </sec>
    
  </body>
  <back>
    <ack>
      <title>Acknowledgment</title>
      <p>We acknowledge the OpenSPARC Team from Sun Microsystems, Inc. for their collaboration concerning OpenSPARC models, SAM and SPARCV9 instructions. We also acknowledge them for evaluating CASPER and awarding it the first prize in the category “Best submission that makes a substantial contribution to the OpenSPARC community” as part of the OpenSPARC Community Innovation Award Contest in 2008.</p>
    </ack>
	<ref-list>
      <title>References</title>
      <ref id="B1-jlpea-02-00030">
        <label>1.</label>
        <citation citation-type="web">
          <article-title>Netronome Heterogeneous Reference Architecture</article-title>
          <publisher-name>Netronome Inc.</publisher-name>
          <year>2010</year>
          <access-date>(accessed on 1 Februray 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.netronome.com/pages/heterogeneous-architecture" ext-link-type="uri">http://www.netronome.com/pages/heterogeneous-architecture</ext-link></comment>
        </citation>
      </ref>
      <ref id="B2-jlpea-02-00030">
        <label>2.</label>
        <citation citation-type="web">
          <collab>Cisco Inc.</collab>
          <article-title>The Cisco QuantumFlow Processor: Cisco’s Next Generation Network Processor</article-title>
          <year>2010</year>
          <access-date>(accessed on 1 Februray 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.cisco.com/en/US/prod/collateral/routers/ps9343/solution_overview_c22-448936.html" ext-link-type="uri">http://www.cisco.com/en/US/prod/collateral/routers/ps9343/solution_overview_c22-448936.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B3-jlpea-02-00030">
        <label>3.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Lindholm</surname>
              <given-names>E.</given-names>
            </name>
            <name>
              <surname>Nickolls</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Oberman</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Montrym</surname>
              <given-names>J.</given-names>
            </name>
          </person-group>
          <article-title>NVIDIA Tesla: A Unified Graphics and Computing Architecture</article-title>
          <source>IEEE Micro</source>
          <year>2008</year>
          <volume>28</volume>
          <fpage>39</fpage>
          <lpage>55</lpage>
          <pub-id pub-id-type="doi">10.1109/MM.2008.31</pub-id>
        </citation>
      </ref>
      <ref id="B4-jlpea-02-00030">
        <label>4.</label>
        <citation citation-type="web">
          <article-title>OpenSPARC T1/T2</article-title>
          <year>2007</year>
          <access-date>(accessed on 1 Februray 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.opensparc.net" ext-link-type="uri">http://www.opensparc.net</ext-link></comment>
        </citation>
      </ref>
      <ref id="B5-jlpea-02-00030">
        <label>5.</label>
        <citation citation-type="web">
          <article-title>Oracle’s SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B Server Architecture</article-title>
          <publisher-name>Oracle Corp</publisher-name>
          <year>2009</year>
          <access-date>(accessed on 1 Februray 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.opensparc.net" ext-link-type="uri">http://www.opensparc.net</ext-link></comment>
        </citation>
      </ref>
      <ref id="B6-jlpea-02-00030">
        <label>6.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Stackhouse</surname>
              <given-names>B.</given-names>
            </name>
            <name>
              <surname>Bhimji</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Bostak</surname>
              <given-names>C.</given-names>
            </name>
            <name>
              <surname>Bradley</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>Cherkauer</surname>
              <given-names>B.</given-names>
            </name>
            <name>
              <surname>Desai</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Francom</surname>
              <given-names>E.</given-names>
            </name>
            <name>
              <surname>Gowan</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Gronowski</surname>
              <given-names>P.</given-names>
            </name>
            <name>
              <surname>Krueger</surname>
              <given-names>D.</given-names>
            </name>
            <etal/>
          </person-group>
          <article-title>A 65 nm 2-Billion Transistor Quad-Core Itanium Processor</article-title>
          <source>IEEE J. Solid-State Circ.</source>
          <year>2009</year>
          <volume>44</volume>
          <fpage>18</fpage>
          <lpage>31</lpage>
          <pub-id pub-id-type="doi">10.1109/JSSC.2008.2007150</pub-id>
        </citation>
      </ref>
      <ref id="B7-jlpea-02-00030">
        <label>7.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Spracklen</surname>
              <given-names>L.</given-names>
            </name>
            <name>
              <surname>Abraham</surname>
              <given-names>S.G.</given-names>
            </name>
          </person-group>
          <article-title>Chip Multithreading: Opportunities and Challenges</article-title>
          <source>Proceedings of 11th International Symposium on High-Performance Computer Architecture (HPCA-11)</source>
          <conf-loc>San Francisco, CA, USA</conf-loc>
          <conf-date>12–16 February, 2005</conf-date>
          <fpage>248</fpage>
          <lpage>252</lpage>
        </citation>
      </ref>
      <ref id="B8-jlpea-02-00030">
        <label>8.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Tullsen</surname>
              <given-names>D.M.</given-names>
            </name>
            <name>
              <surname>Eggers</surname>
              <given-names>S.J.</given-names>
            </name>
            <name>
              <surname>Levy</surname>
              <given-names>H.M.</given-names>
            </name>
          </person-group>
          <article-title>Simultaneous multithreading: Maximizing on-chip parallelism</article-title>
          <source>Proceedings of the 22nd International Symposium on Computer Architecture</source>
          <conf-loc>Santa Margherita Ligure, Italy</conf-loc>
          <conf-date>22–24 June 1995</conf-date>
          <fpage>392</fpage>
          <lpage>403</lpage>
        </citation>
      </ref>
      <ref id="B9-jlpea-02-00030">
        <label>9.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Eggers</surname>
              <given-names>S.J.</given-names>
            </name>
            <name>
              <surname>Emer</surname>
              <given-names>J.S.</given-names>
            </name>
            <name>
              <surname>Leby</surname>
              <given-names>H.M.</given-names>
            </name>
            <name>
              <surname>Lo</surname>
              <given-names>J.L.</given-names>
            </name>
            <name>
              <surname>Stamm</surname>
              <given-names>R.L.</given-names>
            </name>
            <name>
              <surname>Tullsen</surname>
              <given-names>D.M.</given-names>
            </name>
          </person-group>
          <article-title>Simultaneous multithreading: a platform for next-generation processors</article-title>
          <source>IEEE Micro</source>
          <year>1997</year>
          <volume>17</volume>
          <fpage>12</fpage>
          <lpage>19</lpage>
        </citation>
      </ref>
      <ref id="B10-jlpea-02-00030">
        <label>10.</label>
        <citation citation-type="web">
          <article-title>SimplyRISC S1 Core on FPGA</article-title>
          <year>2007</year>
          <access-date>(accessed on 1 Februray 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.opensparc.net/projects/directory.html" ext-link-type="uri">http://www.opensparc.net/projects/directory.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B11-jlpea-02-00030">
        <label>11.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Bell</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Edwards</surname>
              <given-names>B.</given-names>
            </name>
            <name>
              <surname>Amann</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Conlin</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Joyce</surname>
              <given-names>K.</given-names>
            </name>
            <name>
              <surname>Leung</surname>
              <given-names>V.</given-names>
            </name>
            <name>
              <surname>MacKay</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Reif</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Bao</surname>
              <given-names>L.</given-names>
            </name>
            <name>
              <surname>Brown</surname>
              <given-names>J.</given-names>
            </name>
          </person-group>
          <article-title>TILE64—Processor: A 64-Core SoC with Mesh Interconnect</article-title>
          <source>Proceedings of the IEEE International Solid-StateCircuits Conference (ISSCC)</source>
          <conf-loc>San Francisco, CA, USA</conf-loc>
          <conf-date>3-7 February 2008</conf-date>
          <fpage>88</fpage>
          <lpage>89</lpage>
        </citation>
      </ref>
      <ref id="B12-jlpea-02-00030">
        <label>12.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Beavers</surname>
              <given-names>B.</given-names>
            </name>
          </person-group>
          <article-title>The story behind the intel atom processor success</article-title>
          <source>IEEE Des. Test Comput.</source>
          <year>2009</year>
          <volume>26</volume>
          <fpage>8</fpage>
          <lpage>13</lpage>
          <pub-id pub-id-type="doi">10.1109/MDT.2009.44</pub-id>
        </citation>
      </ref>
      <ref id="B13-jlpea-02-00030">
        <label>13.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Kongetira</surname>
              <given-names>P.</given-names>
            </name>
            <name>
              <surname>Aingaran</surname>
              <given-names>K.</given-names>
            </name>
            <name>
              <surname>Olukotun</surname>
              <given-names>K.</given-names>
            </name>
          </person-group>
          <article-title>Niagara: A 32-way multithreaded Sparc processor</article-title>
          <source>IEEE Micro</source>
          <year>2005</year>
          <volume>25</volume>
          <fpage>21</fpage>
          <lpage>29</lpage>
        <pub-id pub-id-type="doi">10.1109/MM.2005.35</pub-id></citation>
      </ref>
      <ref id="B14-jlpea-02-00030">
        <label>14.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Kumar</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Farkas</surname>
              <given-names>K.I.</given-names>
            </name>
            <name>
              <surname>Jouppi</surname>
              <given-names>N.P.</given-names>
            </name>
            <name>
              <surname>Ranganathan</surname>
              <given-names>P.</given-names>
            </name>
            <name>
              <surname>Tullsen</surname>
              <given-names>D.M.</given-names>
            </name>
          </person-group>
          <article-title>Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction</article-title>
          <source>Proceedings of 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-36)</source>
          <conf-loc>San Diego, CA, USA</conf-loc>
          <conf-date>3–5 December 2003</conf-date>
          <fpage>81</fpage>
          <lpage>92</lpage>
        </citation>
      </ref>
      <ref id="B15-jlpea-02-00030">
        <label>15.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Kumar</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Tullsen</surname>
              <given-names>D.M.</given-names>
            </name>
            <name>
              <surname>Ranganathan</surname>
              <given-names>P.</given-names>
            </name>
            <name>
              <surname>Jouppi</surname>
              <given-names>N.P.</given-names>
            </name>
            <name>
              <surname>Farkas</surname>
              <given-names>K.I.</given-names>
            </name>
          </person-group>
          <article-title>Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance</article-title>
          <source>Proceedings of the 31st International Symposium on Computer Architecture</source>
          <conf-loc>München, Germany</conf-loc>
          <conf-date>19–23 June 2004</conf-date>
          <fpage>64</fpage>
          <lpage>75</lpage>
        </citation>
      </ref>
      <ref id="B16-jlpea-02-00030">
        <label>16.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Zhao</surname>
              <given-names>W.</given-names>
            </name>
            <name>
              <surname>Li</surname>
              <given-names>X.</given-names>
            </name>
            <name>
              <surname>Nowak</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Cao</surname>
              <given-names>Y.</given-names>
            </name>
          </person-group>
          <article-title>Predictive Technology Modeling for 32nm Low Power Design</article-title>
          <source>Proceedings of 2007 International Semiconductor Device Research Symposium</source>
          <conf-loc>College Park, MD, USA</conf-loc>
          <conf-date>12–14 December 2007</conf-date>
          <fpage>1</fpage>
          <lpage>2</lpage>
        </citation>
      </ref>
      <ref id="B17-jlpea-02-00030">
        <label>17.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Brooks</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>Tiwari</surname>
              <given-names>V.</given-names>
            </name>
            <name>
              <surname>Martonosi</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>Wattch: A framework for architectural-level power analysis and optimizations</article-title>
          <source>Proceedings of the 27th International Symposium on Computer Architecture</source>
          <conf-loc>Vancouver, Canada</conf-loc>
          <conf-date>14 June 2000</conf-date>
          <fpage>83</fpage>
          <lpage>94</lpage>
        </citation>
      </ref>
      <ref id="B18-jlpea-02-00030">
        <label>18.</label>
        <citation citation-type="web">
          <article-title>Oracle Corperation. Oracle Solaris Studio 11 Overview</article-title>
          <year>2011</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.oracle.com/technetwork/server-storage/solarisstudio/downloads/index.html" ext-link-type="uri">http://www.oracle.com/technetwork/server-storage/solarisstudio/downloads/index.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B19-jlpea-02-00030">
        <label>19.</label>
        <citation citation-type="web">
          <article-title>Sun Microsystems Inc. OpenSPARC T1 Micro-Archiecture Specification</article-title>
          <year>2006</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.opensparc.net/opensparc-t1/index.html" ext-link-type="uri">http://www.opensparc.net/opensparc-t1/index.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B20-jlpea-02-00030">
        <label>20.</label>
        <citation citation-type="web">
          <collab>Sun Microsystems Inc.</collab>
          <article-title>UltraSPARC Architecture 2007, Privileged and Non-Privileged Instructions</article-title>
          <year>2007</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.opensparc.net/opensparc-t1/index.html" ext-link-type="uri">http://www.opensparc.net/opensparc-t1/index.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B21-jlpea-02-00030">
        <label>21.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Brooks</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>Martonos</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance</article-title>
          <source>ACM Trans. Comput. Syst.</source>
          <year>2000</year>
          <volume>18</volume>
          <fpage>89</fpage>
          <lpage>126</lpage>
          <pub-id pub-id-type="doi">10.1145/350853.350856</pub-id>
        </citation>
      </ref>
      <ref id="B22-jlpea-02-00030">
        <label>22.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Leon</surname>
              <given-names>A.S.</given-names>
            </name>
            <name>
              <surname>Langley</surname>
              <given-names>B.</given-names>
            </name>
            <name>
              <surname>Jinuk Luke</surname>
              <given-names>S.</given-names>
            </name>
          </person-group>
          <article-title>The UltraSPARC T1 Processor: CMT Reliability</article-title>
          <source>Proceedings of CICC '06 IEEE Custom Integrated Circuits Conference</source>
          <conf-loc>San Jose, CA, USA</conf-loc>
          <conf-date>10–13 September 2006</conf-date>
          <fpage>555</fpage>
          <lpage>562</lpage>
        </citation>
      </ref>
      <ref id="B23-jlpea-02-00030">
        <label>23.</label>
        <citation citation-type="web">
          <collab>Sun Microsystems Inc.</collab>
          <article-title>OpenSPARC T2 System-On-Chip (SOC) Microarchitecture Specification</article-title>
          <year>2008</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.opensparc.net/opensparc-t2/index.html" ext-link-type="uri">http://www.opensparc.net/opensparc-t2/index.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B24-jlpea-02-00030">
        <label>24.</label>
        <citation citation-type="web">
          <collab>Synopsys Inc.</collab>
          <article-title>DFT Compiler Datasheet</article-title>
          <year>2009</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.synopsys.com/tools/implementation/rtlsynthesis/pages/dftcompiler.aspx" ext-link-type="uri">http://www.synopsys.com/tools/implementation/rtlsynthesis/pages/dftcompiler.aspx</ext-link></comment>
        </citation>
      </ref>
      <ref id="B25-jlpea-02-00030">
        <label>25.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Zhao</surname>
              <given-names>W.</given-names>
            </name>
            <name>
              <surname>Cao</surname>
              <given-names>Y.</given-names>
            </name>
          </person-group>
          <article-title>New generation of predictive technology model for sub-45 nm design exploration</article-title>
          <source>ACM Trans. Comput. Syst.</source>
          <year>2007</year>
          <volume>3</volume>
          <fpage>585</fpage>
          <lpage>590</lpage>
        </citation>
      </ref>
      <ref id="B26-jlpea-02-00030">
        <label>26.</label>
        <citation citation-type="web">
          <article-title>Cadence Encounter</article-title>
          <year>2009</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.cadence.com/products/ld/rtl_compiler/" ext-link-type="uri">http://www.cadence.com/products/ld/rtl_compiler/</ext-link></comment>
        </citation>
      </ref>
      <ref id="B27-jlpea-02-00030">
        <label>27.</label>
        <citation citation-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Tarjan</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>Thoziyoor</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Jouppi</surname>
              <given-names>N.P.</given-names>
            </name>
          </person-group>
          <source>CACTI 4.0</source>
          <publisher-name>HP Laboratories</publisher-name>
          <publisher-loc>Palo Alto, CA, USA</publisher-loc>
          <year>2006</year>
          <comment>June</comment>
        </citation>
      </ref>
      <ref id="B28-jlpea-02-00030">
        <label>28.</label>
        <citation citation-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Gifi</surname>
              <given-names>A.</given-names>
            </name>
          </person-group>
          <source>Nonlinear Multivariate Analysis</source>
          <publisher-name>John Wiley &amp; Sons</publisher-name>
          <publisher-loc>Hoboken, NJ, USA</publisher-loc>
          <year>1989</year>
        </citation>
      </ref>
      <ref id="B29-jlpea-02-00030">
        <label>29.</label>
        <citation citation-type="web">
          <article-title>SPSS Statistical Tool</article-title>
          <year>2011</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www-01.ibm.com/software/analytics/spss/" ext-link-type="uri">http://www-01.ibm.com/software/analytics/spss/</ext-link></comment>
        </citation>
      </ref>
      <ref id="B30-jlpea-02-00030">
        <label>30.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Isci</surname>
              <given-names>C.</given-names>
            </name>
            <name>
              <surname>Buyuktosunoglu</surname>
              <given-names>A.</given-names>
            </name>
            <name>
              <surname>Cher</surname>
              <given-names>C.-Y.</given-names>
            </name>
            <name>
              <surname>Bose</surname>
              <given-names>P.</given-names>
            </name>
            <name>
              <surname>Martonosi</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget</article-title>
          <source>Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture</source>
          <conf-loc>Orlando, FL, USA</conf-loc>
          <conf-date>9–13 December 2006</conf-date>
          <fpage>347</fpage>
          <lpage>358</lpage>
        </citation>
      </ref>
      <ref id="B31-jlpea-02-00030">
        <label>31.</label>
        <citation citation-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Benini</surname>
              <given-names>L.</given-names>
            </name>
            <name>
              <surname>Micheli</surname>
              <given-names>G.D.</given-names>
            </name>
          </person-group>
          <source>Dynamic Power Management: Design Techniques and CAD Tools</source>
          <publisher-name>Kluwer Academic</publisher-name>
          <publisher-loc>USA</publisher-loc>
          <year>1997</year>
        </citation>
      </ref>
      <ref id="B32-jlpea-02-00030">
        <label>32.</label>
        <citation citation-type="book">
          <article-title>Kushal Datta, Yue Liu, Arindam Mukherjee, Arun Ravindran and Bharat Joshi, Hardware Techniques for Autonomous Power Saving in Embedded Many-Core Processors</article-title>
          <source>Multi-Core Embedded Systems</source>
          <person-group person-group-type="editor">
            <name>
              <surname>Kornaros</surname>
              <given-names>G.</given-names>
            </name>
          </person-group>
          <publisher-name>CRC Press and Taylor &amp; Francis Group</publisher-name>
          <publisher-loc>Boca Raton, FL, USA</publisher-loc>
          <year>2009</year>
          <volume>Chapter 10</volume>
        </citation>
      </ref>
      <ref id="B33-jlpea-02-00030">
        <label>33.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Wolf</surname>
              <given-names>T.</given-names>
            </name>
            <name>
              <surname>Franklin</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>CommBench-a Telecommunications Benchmark for Network Processors</article-title>
          <source>Proceedings of 2000 IEEE International Symposium on Performance Analysis of Systems and Software ISPASS</source>
          <conf-loc>Austin, TX, USA</conf-loc>
          <conf-date>24–25 April 2000</conf-date>
          <fpage>154</fpage>
          <lpage>162</lpage>
        </citation>
      </ref>
      <ref id="B34-jlpea-02-00030">
        <label>34.</label>
        <citation citation-type="web">
          <person-group person-group-type="author">
            <name>
              <surname>Tee</surname>
              <given-names>A.</given-names>
            </name>
            <name>
              <surname>Cleveland</surname>
              <given-names>J.R.</given-names>
            </name>
            <name>
              <surname>Chang</surname>
              <given-names>J.W.</given-names>
            </name>
          </person-group>
          <article-title>Implication of End-user QoS requirements on PHY &amp; MAC. Technical Report from IEEE 802 Executive Committee Study Group on Mobile Broadband Wireless Access; C802.2-03/106;2003</article-title>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.ieee802.org/20/Contribs/C802.20-03-106.ppt" ext-link-type="uri">http://www.ieee802.org/20/Contribs/C802.20-03-106.ppt</ext-link></comment>
        </citation>
      </ref>
      <ref id="B35-jlpea-02-00030">
        <label>35.</label>
        <citation citation-type="web">
          <person-group person-group-type="author">
            <name>
              <surname>Rosewarne</surname>
              <given-names>C.</given-names>
            </name>
          </person-group>
          <source>Network Processors</source>
          <publisher-name>Calyptech Ltd.</publisher-name>
          <publisher-loc>Melbourne, Australia</publisher-loc>
          <year>2004</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.calyptech.com/resources/resources/#networkprocessor" ext-link-type="uri">http://www.calyptech.com/resources/resources/#networkprocessor</ext-link></comment>
        </citation>
      </ref>
      <ref id="B36-jlpea-02-00030">
        <label>36.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Regnier</surname>
              <given-names>G.</given-names>
            </name>
            <name>
              <surname>Minturn</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>McAlpine</surname>
              <given-names>G.</given-names>
            </name>
            <name>
              <surname>Saletore</surname>
              <given-names>V.</given-names>
            </name>
            <name>
              <surname>Foong</surname>
              <given-names>A.</given-names>
            </name>
          </person-group>
          <article-title>ETA: Experience with an Intel<sup>®</sup>; Xeon<sup>TM</sup>; Processor as a Packet Processing Engine</article-title>
          <source>Proceedings of 11th Symposium on High Performance Interconnects</source>
          <publisher-name>Palo Alto, CA, USA</publisher-name>
          <publisher-loc>20–22 August 2003</publisher-loc>
          <fpage>76</fpage>
          <lpage>82</lpage>
        </citation>
      </ref>
      <ref id="B37-jlpea-02-00030">
        <label>37.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Roberts</surname>
              <given-names>L.G.</given-names>
            </name>
          </person-group>
          <article-title>A radical new router</article-title>
          <source>IEEE Spectrum</source>
          <year>2009</year>
          <volume>46</volume>
          <fpage>34</fpage>
          <lpage>39</lpage>
          <pub-id pub-id-type="doi">10.1109/MSPEC.2009.5109450</pub-id>
        </citation>
      </ref>
      <ref id="B38-jlpea-02-00030">
        <label>38.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Grochowski</surname>
              <given-names>E.</given-names>
            </name>
            <name>
              <surname>Ronen</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Shen</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Hong</surname>
              <given-names>W.</given-names>
            </name>
          </person-group>
          <article-title>Best of Both Latency and Throughput</article-title>
          <source>Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors2004, ICCD 2004</source>
          <conf-loc>San Jose, CA, USA</conf-loc>
          <conf-date>11–13 October 2004</conf-date>
          <fpage>236</fpage>
          <lpage>243</lpage>
        </citation>
      </ref>
      <ref id="B39-jlpea-02-00030">
        <label>39.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Annavaram</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Grochowski</surname>
              <given-names>E.</given-names>
            </name>
            <name>
              <surname>Shen</surname>
              <given-names>J.P.</given-names>
            </name>
          </person-group>
          <article-title>Mitigating Amdahl’s Law through EPI Throttling</article-title>
          <source>Proceedings of the 32nd Annual International Symposium on Computer Architecture</source>
          <conf-loc>Madison, WI, USA</conf-loc>
          <conf-date>4–8 June 2005</conf-date>
          <fpage>298</fpage>
          <lpage>309</lpage>
        </citation>
      </ref>
      <ref id="B40-jlpea-02-00030">
        <label>40.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Rakesh</surname>
              <given-names>K.</given-names>
            </name>
            <name>
              <surname>Keith</surname>
              <given-names>I.F.</given-names>
            </name>
            <name>
              <surname>Norman</surname>
              <given-names>P.J.</given-names>
            </name>
            <name>
              <surname>Ranganathan</surname>
              <given-names>P.</given-names>
            </name>
            <name>
              <surname>Tullsen</surname>
              <given-names>D.M.</given-names>
            </name>
          </person-group>
          <article-title>Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction</article-title>
          <source>Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture</source>
          <conf-loc>San Diego, CA, USA</conf-loc>
          <conf-date>3–5 December 2003</conf-date>
          <fpage>81</fpage>
          <lpage>92</lpage>
        </citation>
      </ref>
      <ref id="B41-jlpea-02-00030">
        <label>41.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Balakrishnan</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Rajwar</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Upton</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Lai</surname>
              <given-names>K.</given-names>
            </name>
          </person-group>
          <article-title>The Impact of Performance Asymmetry in Emerging Multicore Architectures</article-title>
          <source>Proceedings of 32nd International Symposium on Computer Architecture, 2005. ISCA ’05</source>
          <conf-loc>Madison, WI, USA</conf-loc>
          <conf-date>4–8 June 2005</conf-date>
          <fpage>506</fpage>
          <lpage>517</lpage>
        </citation>
      </ref>
      <ref id="B42-jlpea-02-00030">
        <label>42.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Morad</surname>
              <given-names>T.Y.</given-names>
            </name>
            <name>
              <surname>Weiser</surname>
              <given-names>U.C.</given-names>
            </name>
            <name>
              <surname>Kolodnyt</surname>
              <given-names>A.</given-names>
            </name>
            <name>
              <surname>Valero</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Ayguade</surname>
              <given-names>E.</given-names>
            </name>
          </person-group>
          <article-title>Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors</article-title>
          <source>Comput. Archit. Lett.</source>
          <year>2006</year>
          <volume>5</volume>
          <fpage>14</fpage>
          <lpage>17</lpage>
          <pub-id pub-id-type="doi">10.1109/L-CA.2006.14</pub-id>
        </citation>
      </ref>
      <ref id="B43-jlpea-02-00030">
        <label>43.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Ye</surname>
              <given-names>W.</given-names>
            </name>
            <name>
              <surname>Vijaykrishnan</surname>
              <given-names>N.</given-names>
            </name>
            <name>
              <surname>Kandemir</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Irwin</surname>
              <given-names>M.J.</given-names>
            </name>
          </person-group>
          <article-title>The Design and Use of simplepower: A Cycle-Accurate Energy Estimation Tool</article-title>
          <source>Proceedings of 37th Design Automation Conference</source>
          <conf-loc>Los Angeles, CA, USA</conf-loc>
          <conf-date>5–9 June 2000</conf-date>
          <fpage>340</fpage>
          <lpage>345</lpage>
        </citation>
      </ref>
      <ref id="B44-jlpea-02-00030">
        <label>44.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Benjamin</surname>
              <given-names>C.L.</given-names>
            </name>
            <name>
              <surname>David</surname>
              <given-names>M.B.</given-names>
            </name>
          </person-group>
          <article-title>Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction</article-title>
          <source>Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems</source>
          <conf-loc>San Jose, CA, USA</conf-loc>
          <conf-date>December 2006</conf-date>
        </citation>
      </ref>
      <ref id="B45-jlpea-02-00030">
        <label>45.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Amol</surname>
              <given-names>B.</given-names>
            </name>
            <name>
              <surname>Jingzhao</surname>
              <given-names>O.</given-names>
            </name>
            <name>
              <surname>Viktor</surname>
              <given-names>K.P.</given-names>
            </name>
          </person-group>
          <article-title>Towards Automatic Synthesis of a Class of Application—Specific Sensor Networks</article-title>
          <source>Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems</source>
          <conf-loc>Grenoble, France</conf-loc>
          <conf-date>8–11 October 2002</conf-date>
        </citation>
      </ref>
      <ref id="B46-jlpea-02-00030">
        <label>46.</label>
        <citation citation-type="web">
          <collab>Virtutech</collab>
          <source>Virtutech Simics Multi-Processor Simulator Software</source>
          <year>2008</year>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.simics.net/forum/about.html" ext-link-type="uri">https://www.simics.net/forum/about.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B47-jlpea-02-00030">
        <label>47.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Herbert</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Marculescu</surname>
              <given-names>D.</given-names>
            </name>
          </person-group>
          <article-title>Analysis of dynamic voltage/frequency scaling in chip-multiprocessors</article-title>
          <source>Proceedings of the 2007 International Symposium on Low Power Electronics and Design</source>
          <conf-loc>Portland, OR, USA</conf-loc>
          <conf-date>27–29 August 2007</conf-date>
          <fpage>38</fpage>
          <lpage>43</lpage>
        </citation>
      </ref>
      <ref id="B48-jlpea-02-00030">
        <label>48.</label>
        <citation citation-type="web">
           <article-title>RSIM</article-title>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://rsim.cs.illinois.edu/rsim/dist.html" ext-link-type="uri">http://rsim.cs.illinois.edu/rsim/dist.html</ext-link></comment>
        </citation>
      </ref>
      <ref id="B49-jlpea-02-00030">
        <label>49.</label>
        <citation citation-type="web">
          <article-title>Multifacet GEMS</article-title>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.cs.wisc.edu/gems/" ext-link-type="uri">http://www.cs.wisc.edu/gems/</ext-link></comment>
        </citation>
      </ref>
      <ref id="B50-jlpea-02-00030">
        <label>50.</label>
        <citation citation-type="web">
          <article-title>SimFlex</article-title>
          <access-date>(accessed on 1 February 2012)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://si2.epfl.ch/~parsacom/projects/simflex/" ext-link-type="uri">http://si2.epfl.ch/~parsacom/projects/simflex/</ext-link></comment>
        </citation>
      </ref>
      <ref id="B51-jlpea-02-00030">
        <label>51.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Zeng</surname>
              <given-names>H.</given-names>
            </name>
            <name>
              <surname>Yourst</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Ghose</surname>
              <given-names>K.</given-names>
            </name>
            <name>
              <surname>Ponomarev</surname>
              <given-names>D.</given-names>
            </name>
          </person-group>
          <article-title>MPTLsim: A simulator for X86 multicore processors</article-title>
          <source>Proceedings of 46th ACM/IEEE Design Automation Conference (DAC ’09)</source>
          <conf-loc>San Francisco, CA, USA</conf-loc>
          <conf-date>26–31 July 2009</conf-date>
          <fpage>226</fpage>
          <lpage>231</lpage>
        </citation>
      </ref>
      <ref id="B52-jlpea-02-00030">
        <label>52.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Yourst</surname>
              <given-names>M.T.</given-names>
            </name>
          </person-group>
          <article-title>PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator</article-title>
          <source>Proceedings of IEEE International Symposium on Performance Analysis of Systems &amp; Software, 2007 (ISPASS 2007)</source>
          <conf-loc>San Jose, CA, USA</conf-loc>
          <conf-date>25–27 April 2007</conf-date>
          <fpage>23</fpage>
          <lpage>34</lpage>
        </citation>
      </ref>
      <ref id="B53-jlpea-02-00030">
        <label>53.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Luo</surname>
              <given-names>Y.</given-names>
            </name>
            <name>
              <surname>Yang</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Bhuyan</surname>
              <given-names>L.N.</given-names>
            </name>
            <name>
              <surname>Zhao</surname>
              <given-names>L.</given-names>
            </name>
          </person-group>
          <article-title>NePSim: A network processor simulator with a power evaluation framework</article-title>
          <source>IEEE Micro</source>
          <year>2004</year>
          <volume>24</volume>
          <fpage>34</fpage>
          <lpage>44</lpage>
        </citation>
      </ref>
      <ref id="B54-jlpea-02-00030">
        <label>54.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Li</surname>
              <given-names>S.</given-names>
            </name>
            <name>
              <surname>Ahn</surname>
              <given-names>J.H.</given-names>
            </name>
            <name>
              <surname>Strong</surname>
              <given-names>R.D.</given-names>
            </name>
            <name>
              <surname>Brockman</surname>
              <given-names>J.B.</given-names>
            </name>
            <name>
              <surname>Tullsen</surname>
              <given-names>D.M.</given-names>
            </name>
            <name>
              <surname>Jouppi</surname>
              <given-names>N.P.</given-names>
            </name>
          </person-group>
          <article-title>McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures</article-title>
          <source>Proceedings of 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-42</source>
          <conf-loc>New York, NY, USA</conf-loc>
          <conf-date>12–16 December 2009</conf-date>
          <fpage>469</fpage>
          <lpage>480</lpage>
        </citation>
      </ref>
    </ref-list>
  </back>
</article>
