A Technique for Improving Lifetime of Non-Volatile Caches Using Write-Minimization

While non-volatile memories (NVMs) provide high-density and low-leakage, they also have low write-endurance. This, along with the write-variation introduced by the cache management policies, can lead to very small cache lifetime. In this paper, we propose ENLIVE, a technique for ENhancing the LIfetime of non-Volatile cachEs. Our technique uses a small SRAM (static random access memory) storage, called HotStore. ENLIVE detects frequently written blocks and transfers them to the HotStore so that they can be accessed with smaller latency and energy. This also reduces the number of writes to the NVM cache which improves its lifetime. We present microarchitectural schemes for managing the HotStore. Simulations have been performed using an x86-64 simulator and benchmarks from SPEC2006 suite. We observe that ENLIVE provides higher improvement in lifetime and better performance and energy efficiency than two state-of-the-art techniques for improving NVM cache lifetime. ENLIVE provides 8.47×, 14.67× and 15.79× improvement in lifetime or two, four and eight core systems, respectively. In addition, it works well for a range of system and algorithm parameters and incurs only small overhead.


Introduction
Recent trends of chip-miniaturization and CMOS (complementary metal-oxide semiconductor) scaling have led to a large increase in the number of cores on a chip [1].To feed data to these cores and avoid off-chip accesses, the size of last level caches has significantly increased, for example, Intel's Xeon E7-8870 processor has 30 MB last level cache (LLC) [1].Conventionally, caches have been designed using SRAM (static random access memory), since it provides high performance and write endurance.However, SRAM also has low density and high leakage energy which leads to increased energy consumption and temperature of the chip.With ongoing scaling of operating voltage, the critical charge required to flip the value stored in a memory cell has been decreasing [2], and this poses a serious concern for reliability of charge-based memories such as SRAM.
Non-volatile memories (NVMs) hold the promise of providing a low-leakage, high-density alternative to SRAM for designing on-chip caches [3][4][5][6].NVMs such as spin transfer torque RAM (STT-RAM) and resistive RAM (ReRAM) have several attractive features, such as read latency comparable to that of SRAM, high-density and CMOS compatibility [7,8].Further, NVMs are expected to scale to much smaller feature sizes than the charge-based memories since they rely on resistivity rather than charge as the information carrier.
A crucial limitation of NVMs, however, is that their write endurance is orders of magnitude lower than that of SRAM and DRAM (dynamic random access memory).For example, while the write-endurance of SRAM and DRAM are in the range of 10 16 , this value for ReRAM is only 10 11 [9].For STT-RAM, although a write-endurance value of 10 15 has been predicted [10], the best write-endurance test so far shows only 4 × 10 12 [5,11].Process variation may further reduce these values by an order of magnitude [12].
Further, existing cache management techniques have been designed for optimizing performance and they do not take the write-endurance of device into account.Hence, due to the temporal locality of program access, they may greatly increase the number of accesses to a few blocks of the cache.Due to failure of these blocks, the actual device lifetime may be much smaller than what is expected assuming uniform distribution of writes.Figure 1 shows an example of this where the number of writes to different cache blocks are plotted for SpPo (sphinx-povray) workload (more details of experimental setup are provided in Section 4).Clearly, the write-variation across different cache bocks is very large: the maximum write on a block (72349) is 1540 times the average write on any block (47).Notice that the write-variation is very large, while the average write-count per block is only 47, the maximum write-count is 72,349.
Thus, limited endurance of NVMs, along with the large write-variation introduced by the existing cache management techniques, may cause the endurance limit to be reached in a very short duration of time.This leads to hard errors in the cache [2] and limits its lifetime significantly.This vulnerability can also be exploited by any attacker by running a code which writes repeatedly to a single block to make it reach its endurance limit (called repeated address attack) for causing device failure.Thus, effective architectural techniques are required for managing NVM caches for making them a universal memory solution.
In this paper, we present ENLIVE, a technique for ENhancing the LIfetime of non-Volatile cachEs.ENLIVE uses a small (e.g., 128 entry) SRAM storage, called HotStore.To reduce the number of writes to the cache, ENLIVE migrates the frequently used blocks into the HotStore, so that the future accesses can be served from the HotStore (Section 2).This improves the performance and energy efficiency and also reduces the number of writes to the NVM cache, which translates into enhanced cache lifetime.We also discuss the architectural mechanism to manage the HotStore and show that the storage overhead of ENLIVE is less than 2% of the L2 cache (Section 3), which is small.In this paper, we assume a ReRAM cache, and based on the explanation, ENLIVE can be easily applied to any NVM cache.In the remainder of this paper, for sake of convenience, we use the words ReRAM and NVM interchangeably.
Microarchitectural simulations have been performed using Sniper simulator and workloads from SPEC2006 suite and HPC field (Section 4).In addition, ENLIVE has been compared with two recently proposed techniques for improving lifetime of NVM caches, namely PoLF (probabilistic line-flush) [5] and LastingNVCache [13] (refer Section 5.1).The results have shown that, compared to other techniques, ENLIVE provides larger improvement in cache lifetime and performance, with a smaller energy loss (Section 5.2).For two, four and eight core systems, the average improvement in lifetime on using ENLIVE are 8.47×, 14.67× and 15.79×, respectively.By comparison, the average lifetime improvement with LastingNVCache (which, on average, performs better than PoLF) for 2, 4 and 8 core systems is 6.81×, 8.76× and 11.87×, respectively.Additional results show that ENLIVE works well for a wide-range of system and algorithm parameters (Section 5.3).
The rest of the paper is organized as follows.Section 2 discusses the methodology.Section 3 presents the salient features of ENLIVE, evaluates its implementation overhead and discusses the application of ENLIVE for mitigating security threats in NVM caches.Section 4 discusses the experimental platform, workloads and evaluation metrics.Section 5 presents the experimental results.Section 6 discusses related work on NVM cache and lifetime improvement techniques.Finally, Section 7 concludes this paper.

Methodology
Notations: In this paper, the LLC is assumed to be an L2 cache.Let S, W, D and G denote the number of L2 sets, associativity, block-size and tag-size, respectively.In this paper, we take D = 512 bits (64 bytes) and G = 40 bits.Let β denote the ratio of the number of entries in the HotStore and the number of L2 sets.Thus, the number of entries in HotStore is βS.The typical values of β are 1/32, 1/16, 1/8 etc.

Key Idea
It is well-known that due to temporal locality of cache access, a few blocks see much more number of cache writes than the remaining blocks.This leads to two issues in NVM cache management.First, NVM write latency and energy are much higher than read latency and energy (refer Table 1), respectively, and, thus, these writes degrade the performance and energy-efficiency of NVM cache.Second, and more importantly, frequent writes to few blocks lead to failure of these blocks much earlier than what is expected assuming uniformly distributed writes.
ENLIVE works on the key idea that the frequently written blocks can be stored in a small SRAM storage, called HotStore.Thus, future writes to those blocks can be served from the HotStore, instead of from the NVM cache.Due to the small size of SRAM storage, along with smaller access latency of SRAM itself, these blocks can be accessed with small latency, which improves the performance by making the common-case fast.In addition, since the write-endurance of SRAM is very high, the SRAM storage absorbs most of the writes, and, thus, the writes to NVM cache are significantly reduced.This leads to improvement in the lifetime of the NVM cache.The overall organization of cache hierarchy with addition of HotStore is illustrated in Figure 2.
The HotStore only stores the data of a block and does not require a separate tag-field, since the tag directory of L2 cache itself is used.The benefit of this is that on a cache access, the tag-matching is done only on the L2 tag directory, as in the normal cache.Thus, from the point of view of processor-core, all the tags and corresponding cache blocks are always stored in the L2 cache, although internally, they may be stored in either the main NVM storage or the HotStore.Hence, a fully-associative search in the HotStore is not required.This fact enables use of a relatively large number of entries in the HotStore.From any set, only a single block can be stored in the HotStore and hence, when required, this block can be directly accessed as shown in Figure 3.In addition, when a block, which is evicted, is in the HotStore, the newly arrived data-item is placed in the HotStore itself.

An Example of How HotStore Works
Figure 3 shows an example of how HotStore works to illustrate the basic idea.Initially, the HotStore is empty and a set with set-index 38 has data in all of its ways.On a write-command to block "B", its write-counter exceeds the threshold (2), and, hence, it is inserted into HotStore.Next, "read C" command is issued, and, hence, the element "C" is read from L2 cache.Next, "write B" command is issued, and since B is stored in HotStore, it is written in HotStore and not in L2 cache.Next, "read B" command is issued and the data are supplied from HotStore and not from L2 cache.Section 2.3 explains the implementation and working of HotStore in full detail.

Implementation and Optimization
To ensure that the HotStore latency is small, its size needs to be small.Due to the limited and small size of the storage, effective cache management is required to ensure that only the most frequently written blocks reside in the HotStore and the cold blocks are soon evicted.Algorithm 1 shows the procedure for handling a cache write and managing the HotStore.If the block is already in the HotStore (Lines 1-3), a read or write is performed from the HotStore.Regardless of whether the block is in HotStore or L2 cache, on a write, the corresponding write-counter in L2 cache is incremented and on any cache read/write access, the information in LRU replacement policy of L2 cache is updated (not shown in Algorithm 1).The insertion into and eviction from HotStore are handled as follows.
Algorithm 1: Algorithm for handling a write to an arbitrary block "blockB" and managing the HotStore We use a threshold λ.A cache block which sees more than λ writes is a candidate for promotion to HotStore (Lines 4-42), depending on other factors.A block with less than λ writes is considered a cold block and need not be inserted in the HotStore (Lines 43-45).Due to the temporal locality, only a few MRU blocks in a set are expected to see most of the access [14].Hence, we allow only one block from a set to be stored in the HotStore at a time which also simplifies the management of HotStore.If free space exists in the HotStore, a hot block can be inserted (Lines 6-10), otherwise, a replacement candidate is searched (Line 13) as discussed below.

Eviction from HotStore
If the number of writes to the replacement candidate are less than the incoming block (Line 14), it is assumed that the incoming block is more write-intensive, and, hence, should get priority over the replacement candidate.In such a case, the replacement candidate is copied back to the NVM cache (Lines 16-19), and the new block is stored in the HotStore (Lines 21-23).A similar situation also appears if a block from the same set as the incoming set is already in the HotStore (Lines 29-42).Since only one block from a set can reside in HotStore at any time, the write-counter values of incoming and existing blocks are compared and the one with largest value is kept in the HotStore.If the incoming block is less write-intensive, it is not inserted into HotStore (Lines 24-26 and Lines 39-41).

Replacement Policy for HotStore
To select a suitable replacement candidate, HotStore also uses a replacement policy.An effective replacement policy will keep only the hottest blocks in the HotStore and thus minimize frequent evictions from HotStore.We have experimented with two replacement policies.
1. LNW (least number of writes), which selects the block with the least number of writes to it as the replacement candidate.It uses the same write-counters as the L2 cache.We assume four-cycle extra latency of this replacement policy, although it is noteworthy that the selection of a replacement candidate can be done off the critical path.2. NRU (not recently used) [15], is an approximation of LRU (least recently used) replacement policy and is commonly used in commercial microprocessors [15].NRU requires only one bit of storage for each HotStore entry, compared to the LRU which requires log(βS) bits for each entry.
HotStore is intended to keep the most heavily written blocks and hence, LNW aligns more closely with the purpose of the HotStore, although it also has higher implementation overhead.For this reason, we use LNW in our main results and show the results with NRU in the section on parameter sensitivity study (Section 5.3).Our results show that in general, LNW replacement policy provides higher improvement in lifetime compared to the NRU replacement policy.

Salient Features and Applications of ENLIVE
ENLIVE has several salient features.It does not increase the miss-rate, and, thus, unlike previous NVM cache lifetime improvement techniques (e.g., [4,5,16]) it does not cause extra writes to main memory.In addition, it does not require offline profiling or compiler analysis (unlike [17]) or use of large prediction tables (unlike [18]).In what follows, we first show that the implementation overhead of ENLIVE is small (Section 3.1).We then show that ENLIVE can be very effective in mitigating security threats to NVM caches (see Section 3.2), such as those posed by repeated address attack.

Overhead Assessment
ENLIVE uses the following extra storage.
(E 1 ) The HotStore has βS-entries, each of which uses D-bits for data storage, one-bit for valid/invalid information and log S-bits for recording which set a particular entry belongs to (note that for all log values, the base is two).(E 2 ) For each cache set, log W bit storage is required to record the way number of the block from the set which is in HotStore and log(βS)-bit storage is required to record the index in the HotStore where this block is stored.(E 3 ) We assume that for each block, eight bits are used for the write-counter.In a single "generation", a block is unlikely to get larger than 2 8 writes.If, for a workload, a block gets more than this number of writes, then we allow the counter to get saturated; thus, overflow does not happen. (E 4 ) NRU requires (βS) bit additional storage.
Thus, the storage overhead (Θ) of HotStore as a percentage of L2 cache is as follows: Assuming a 16-way, 4 MB L2 cache and β = 1/16, the relative overhead (Θ) of ENLIVE is nearly 1.96% of the L2 cache, which is very small.By using smaller value of β (e.g., 1/32), a designer can further reduce this overhead, although it also reduces the lifetime improvement obtained (refer Section 5.3.3).

Application of ENLIVE for Mitigating Security Threat
Due to their limited write-endurance, NVMs can be easily attacked and worn-out using repeated address attack (RAA).Using a simple RAA, a malicious attacker can write a cache block repeatedly which may lead to write endurance failure.Further, it is also conceivable that a greedy-user may run attack-like codes a few months before the warranty of his/her computing system expires, so as to get a new system before the warranty expiration period.An endurance-unaware cache can be an easy target of such attacks and hence, conventional cache management policies leave a serious security vulnerability for NVM caches with limited write endurance.To show the extent of vulnerability, we present the following example.
Assume that L1 is designed with SRAM and its associativity is W L1 .Assume an attack-like write-sequence which circularly writes to W L1 +1 data blocks which are mapped to the same L1 set.Since this set can only hold W L1 blocks, every L1 write will lead to writebacks in the same L2 cache set.Assume that due to this, after every 200 cycles, the same L2 block is written again.In this case, the time required to fail this block is: For write endurance of 10 11 and 2GHz frequency, we obtain the time to fail as: 10,000 s or 2.8 h only.Clearly, in absence of any policy for protection from attacks, the L2 cache can be made to fail in less than 3 h.In addition, due to process variation, the write-endurance of weakest block may be smaller than 10 11 .
ENLIVE can be useful in mitigating such security threats.ENLIVE keeps the most heavily written blocks in an SRAM storage, which has very high write-endurance and thus, the writes targeted for an NVM block are channeled to the SRAM storage.Furthermore, since ENLIVE dynamically changes the block which is actually stored in HotStore, alternate or multiple write-attacks can also be tolerated since other blocks which see large number of writes will be stored in the HotStore.
Further, the threshold λ can be randomly varied within a predetermined range which makes it difficult for attackers to predict the new block-location which will be stored in the HotStore.For this, note that a change in value of λ does not affect program correctness but only changes the number of writes absorbed by HotStore.Thus, assume that λ can be varied between range [2,4].Initially, its value is two and thus, a block receiving more than two writes is promoted to HotStore.Now, on changing λ to three, a block which sees more than three writes is promoted to HotStore.Thus, some of the blocks, which were previously promoted to HotStore, may not get promoted to HotStore with this change value of λ.However, the attacker is unaware of this fact since only cache controller is aware of λ value and can change it.Thus, even if the attacker knows that a HotStore is being used, he/she does not know which blocks are stored in HotStore and hence, he/she cannot design an attack which only targets the blocks not stored in HotStore.Finally, as we show in the results section, use of HotStore can extend the lifetime of cache by an order of magnitude and, in this amount of time, any abnormal attack can be easily identified.
Note that although intra-set wear-leveling techniques (e.g., [5,13,[19][20][21]) can also be used for mitigating repeated address attack to NVM, ENLIVE offers a distinct advantage over them.An intra-set wear-leveling technique only performs uniform-distribution of writes to a set and does not reduce the total number of writes to a set.Thus, although it can prolong the time taken for the first block in a set to fail, it cannot prolong the time required for all the blocks to fail.Hence, an attacker can continue to target different blocks in a set and finally, all the blocks in a set will fail in the same time as in the conventional cache, even if an intra-set wear-leveling technique is used.By comparison, ENLIVE performs write-minimization since the SRAM HotStore absorbs a large fraction of the writes and thus, it can increase the time required in both, making a single block fail and making all the blocks fail.Clearly, ENLIVE is more effective in mitigating security threats to NVM caches.

Simulation Platform
We perform simulations using Sniper x86-64 multi-core simulator [22].The processor frequency is 2 GHz.L1 I/D caches are four-way 32 KB caches with two-cycle latency and are private to each core.They are assumed to be designed using SRAM for performance reasons.L2 cache is shared among cores, and its parameters are obtained using DESTINY tool [23,24] and are shown in Table 1.Here, we have assumed 32 nm process, write energy-delay product (EDP) optimized cache design, 16-way associativity and sequential cache access.All caches use LRU, write-back, write-allocate policy and L2 cache is inclusive of L1 caches.The parameters for SRAM HotStore are also obtained using DESTINY and they are shown in Table 2.Note that the parameters for L2 and HotStore are obtained using "cache" and "RAM" as the design target in the DESTINY, respectively.Main memory latency is 220 cycles.Memory bandwidth for two, four and eight-core systems is 15, 25 and 35 GB/s, respectively.

Workloads
We use all 29 benchmarks from SPEC (Standard Performance Evaluation Corporation) CPU2006 suite with ref inputs and six benchmarks from HPC field (shown in italics in Table 3).Using these, we randomly create 18, nine and five multiprogrammed workloads for dual, quad and eight-core systems, such that each benchmark is used exactly once (except for completing the left-over group).These workloads are shown in Table 3.

Evaluation Metrics
Our baseline is an LRU-managed ReRAM L2 cache which does not use any technique for write-minimization.We model the energy of L2, main memory and HotStore.We ignore the energy overhead of counters, since it is several orders of magnitude smaller compared to the energy of (L2 + main memory + HotStore).The parameters for both L2 cache and HotStore are shown above and the dynamic energy and leakage power of main memory are taken as 0.18 W and 70 nJ/access, respectively [14].We show the results on the following metrics: 1. Relative cache lifetime where the lifetime is defined as the inverse of maximum writes on any cache block [4,13] 2. Weighted speedup [14] (called relative performance) 3. Percentage energy loss 4. Absolute increase in MPKI (miss-per-kilo-instructions) To provide additional insights for ENLIVE, we also show the results on percentage decrease in WPKI (writes per kilo instructions) to NVM cache and the number of writes that are served from the HotStore (termed as nWriteToHotStore).The higher these values, the higher is the efficacy of HotStore in avoiding the writes to NVM.
We fast-forward the benchmarks for 10 B instructions and simulate each workload until the slowest application executes nInst instructions, where nInst is 200 M for dual-core workloads and 150 M instructions for quad and eight-core workloads.This helps in keeping the simulation turnaround time manageable since we simulate a large number of workloads, system configurations (dual, quad and eight-core system), algorithm parameters and three techniques (viz.ENLIVE, PoLF and LastingNVCache) along with the baseline (refer Section 5.1 for a background on PoLF and LastingNVCache).The early-completing benchmarks are allowed to run but their IPC (instruction per cycle) is recorded only for the first nInst instructions, following well-established simulation methodology [14].Remaining metrics are computed for the entire execution, since they are system-wide metrics (while IPC is a per-core metric).
Note that since the last completing benchmark may take significantly longer time than the first one, the total number of instructions executed by the workloads is much more than nInst times the number of cores.Relative lifetime and weighted speedup values are averaged using geometric mean while the remaining metrics are averaged using arithmetic mean, since their values can be zero or negative.We have also computed fair speedup [14] and observed that their values are nearly same as weighted speedup and thus, ENLIVE does not cause unfairness.These results are omitted for brevity.

Results and Analysis
This section presents results on experimental evaluation of ENLIVE.We have compared ENLIVE with two recently proposed techniques for improving NVM cache lifetime named PoLF (Probabilistic line flush) [5] and LastingNVCache [13].We first present a background on these techniques.

Comparison with Other Techniques
In PoLF, after a fixed number of write hits (called flush threshold or FT) in the entire cache, a write-operation is skipped; instead, the data item is directly written-back to memory and the cache-block is invalidated, without updating the LRU-age information.In LastingNVCache, such a flushing operation is performed after a fixed number of write-hits to that particular block itself.While PoLF selects hot-data in probabilistic manner, LastingNVCache does not work by probabilistic manner, rather, it actually records the writes to each block.For both of them, lifetime improvement is achieved by the fact that based on LRU replacement policy, the hot data from flushed block will be loaded in another cold block leading to intra-set wear-leveling.The latency values of incrementing the write-counter and comparison with the threshold are taken as one and two cycles, respectively.
Choice of flush threshold: Both PoLF and LastingNVCache work by data-invalidation, while ENLIVE works by data-movement between cache and HotStore.Thus, unlike the former two techniques, ENLIVE does not increase the accesses to main memory or cause large energy losses.Since large energy losses degrade system performance and reliability, and also achieving fair and meaningful comparison, we choose the flush threshold values for PoLF and LastingNVCache in the following manner.We test with even values of FT and find one FT each for dual, quad and eight-core systems separately which achieves highest possible lifetime improvement, while keeping the average energy loss across workloads less than 10%.Note that for every watt of power dissipated in computing system, an additional 0.5 to 1 watt of power is consumed by the cooling system also and very aggressive threshold values may significantly increase the writes to main memory which itself may be designed using NVM, thus creating endurance and contention issue in main memory; for these reasons, larger energy loss may be unacceptable.The exact values of FT which are obtained are shown in Table 4.As we show in the results section, with smaller energy loss (less than 3% on average), ENLIVE provides larger lifetime improvement than both PoLF and LastingNVCache.

Results with Default Parameters
Figures 4-6 show the results, which are obtained using the following parameters, LNW replacement policy for HotStore, λ = 2, β = 1/16, 16-way associativity for L2, 4 MB L2 for two core system, 8 MB L2 for four core system and 16 MB L2 for eight core-system, respectively.This range of LLC sizes and number of cores to LLC size ratio are common in commercial processors [25,26].Per-workload figures for increase in MPKI and nWriteToHotStore are omitted for brevity; their average values are discussed below.We now analyze the results.

Results on Lifetime Improvement
For two, four and eight core systems, ENLIVE improves the cache lifetime by 8.47×, 14.67× and 15.79×, respectively.For several workloads, ENLIVE improves lifetime by more than 30×, e.g., SpPo, LqCoMcLs, OmXbXaGrMkMcSjHm, etc.The improvement in lifetime achieved with a workload also depends on the write-variation present in the original workload.If the write-variation is high, most writes happen to only a few blocks and when those blocks are stored in the HotStore, the number of writes to the NVM cache are significantly reduced.Conversely, for workloads with low write-variation, large reduction in writes to cache cannot be achieved by merely storing few blocks in HotStore since its size is fixed and small, and, hence, the improvement obtained in cache lifetime is small.The lifetime improvement for two, four and eight-core system with PoLF is 5.35×, 7.02× and 9.72×, respectively and for LastingNVCache, these values are 6.81×, 8.76× and 11.87×, respectively.
Clearly, LastingNVCache provides higher lifetime improvement than PoLF, although it is still less than that provided by ENLIVE.This can be easily understood from the fact that ENLIVE actually reduces the writes to NVM, while other techniques only uniformly distribute the writes to different cache blocks.

Results on MPKI, Performance and Energy
ENLIVE uses in-cache data-movement, and, hence, it does not increase MPKI.The average values of increase in MPKI on using PoLF and LastingNVCache are shown in Table 5. PoLF and LastingNVCache cause a small performance loss, while ENLIVE provides a small improvement in performance.For 2 and 4-core system, ENLIVE provides a small saving in energy and for 8-core system, it incurs 2% energy loss.By comparison PoLF and LastingNVCache cause large energy loss (recall from Section 5.1 that an energy loss bound of 10% was used for them).Notice that our energy model includes the energy consumption of main memory, and, thus, while trying to improve cache lifetime, ENLIVE does not increase the energy consumption of main memory.Since most accesses are supplied from the HotStore, which is smaller and faster than the NVM L2 cache, ENLIVE slightly improves performance.However, HotStore is designed using SRAM which has higher leakage energy, and migrations to and from HotStore also consume energy, the energy advantage of HotStore is slightly nullified.Overall, it can be concluded that unlike other techniques, ENLIVE does not harm performance or energy efficiency.

Results on WPKI and nWriteToHotStore
PoLF and LastingNVCache only perform wear-leveling and hence, they do not reduce the WPKI.For ENLIVE, average values of percentage reduction in WPKI and nWriteToHotStore are shown in Table 6.Note that the HotStore stores only βS blocks, while the L2 cache stores WS, thus HotStore stores only (β/W) × 100 percentage of L2 blocks.For β = 1/16 and W =16, this equals 0.39% of L2 blocks.Thus, with less than 0.4% of extra data storage, ENLIVE can reduce the WPKI of cache by more than 10%.Figure 7 shows the WPKI value (i.e., write intensity) for baseline execution.From these values, the correlation between lifetime improvement achieved for an application and its write intensity can be seen.For example, HmWr and CaTo workloads have low WPKI (1.74 and 1.95, respectively) and hence, the lifetime improvement achieved for them with ENLIVE is also small, e.g., 2.69× and 2.60×, respectively.On the other hand, the WPKI of LqCoMcLs and BwSoSjH2 is 17.72 and 11.49, and due to this large write-intensity, the lifetime improvement achieved for them with ENLIVE is also large, e.g., 37.85× and 17.50×, respectively.

Comments on Pros and Cons of Each Technique
PoLF flushes a block in probabilistic manner and hence, it may not always select a hot block.For this reason, PoLF provides smaller lifetime improvement and also incurs higher performance and energy loss than other two techniques.In addition, for a workload with high write-intensity but low write-variation, it may lead to unnecessary flushing of data without achieving corresponding improvement in lifetime.By comparison, both LastingNVCache and ENLIVE use counters for measuring write-intensity and thus can detect the hot-block more accurately.
Both PoLF and LastingNVCache use data-invalidation and one of their common limitations is that in an effort to reduce write-variation in cache, they may increase the writes to main memory.Since the main memory itself may be designed using NVM, these techniques may cause endurance issues in the main memory.By contrast, ENLIVE uses in-cache data-movement, and, hence, it does not degrade performance or energy efficiency or cause endurance or contention issue in the main memory.
The advantage of PoLF is that it only uses a single global counter for recording the number of writes, and hence, it incurs the smallest implementation overhead.Both LastingNVCache and ENLIVE use per-block counters.Further, the advantage of PoLF and LastingNVCache over ENLIVE is that they do not use any extra storage and thus, do not incur the area overhead of SRAM HotStore.However, as shown above, the overhead of ENLIVE is still very small and it performs better than other techniques on all metrics and hence, its small overhead is easily justified.

Parameter Sensitivity Results
We now focus exclusively on ENLIVE and study its sensitivity for different parameters.Each time, only one parameter is changed from the default parameters and the results are summarized in Table 7.

HotStore Replacement Policy
Compared to the LNW replacement policy, the NRU replacement policy gives smaller improvement in lifetime, which is expected since NRU accounts for only the recency of access and not the the magnitude of writes.For small value of λ, ENLIVE aggressively promotes the blocks to HotStore, which works to reduce the number of writes to NVM.However, it also has the disadvantage of creating contention for HotStore leading to frequent eviction of blocks from HotStore which are written-back to L2 cache.Thus, the actual improvement in lifetime obtained depends on the trade-off between these two factors.Due to the mutual effect of these two parameters, the lifetime improvement does not change monotonically with λ.Still, a value of λ = 2, 3 or 4 is found to be reasonable.

Number of HotStore Entries (Corresponding to β)
On increasing β to 1/8, the number of entries in HotStore are increased and, with this, the improvement in lifetime also increases, since a higher number of hot blocks can be accommodated in the HotStore.The exact amount of improvement depends on the write-variation present in the workloads.Larger sized HotStore also dissipates higher leakage energy, which reflects in increased energy loss.

L2 Cache Associativity
For a fixed cache size and fixed β value, on reducing the cache associativity by half, the number of L2 sets and entries in HotStore are both increased by a factor of two.With a 16-way cache, only one out of 16 ways can be inserted in the HotStore, while with eight-way cache, one out of eight ways can be inserted, and, hence, the contention for HotStore is reduced.This increases the efficacy of HotStore in capturing hot blocks.This reflects in increased improvement in lifetime.Opposite is seen with 32-way cache, since now only one out of 32 blocks can be inserted in the HotStore at any time, which reduces the improvement in lifetime.

L2 Cache Size
With increasing cache size, cache hit-rate increases since applications have fixed working set size.This also increases the write-variation since only few blocks see repeated access.Hence, with increasing cache size, the effectiveness of ENLIVE in capturing hot blocks also increases, which reflects in higher improvement in cache lifetime.
The results shown in this section confirm that ENLIVE works well for a wide range of parameters.In addition, ENLIVE provides tunable knobs for trading-off acceptable implementation overhead and desired improvement in lifetime.

Related Work
In this section, we discuss relevant research work.

SRAM Limitations and Emerging Technologies
As the number of cores on a chip increases and key applications become even more data intensive [27], the requirement of cache and memory capacity is also on rise.In addition, since the power budget is limited, power consumption is now becoming the primary constraint in designing computer systems.Due to its low density and high leakage power, SRAM caches consume a significant fraction of chip area and power budget [28].These elevated levels of power consumption may increase the chip-design complexity and exhaust the maximum capability of conventional cooling technologies.Although it is possible to use architectural mechanisms to reduce power consumption of SRAM caches [14,28], the savings provided by these techniques is not sufficient to meet the power budget targets of future computing systems.
To mitigate these challenges, researchers have explored several alternative low-leakage technologies, such as eDRAM [10,29], die-stacked DRAM [30] and NVM [5,21], etc.While eDRAM has high write-endurance, its main limitation is the requirement of refresh operations to maintain data integrity.Further, its retention period is in the range of tens of microseconds [29] and hence, a large fraction of energy spent in eDRAM caches is in the form of refresh energy.Similarly, die-stacked DRAM caches also require refresh operations and may create thermal and reliability issues due to higher operation temperatures in die-stacked chips.As for NVMs, although their write latency and energy are higher than that of SRAM, it has been shown that their near-zero leakage and high capacity can generally compensate for the overhead of writes [29].Thus, addressing the limited write-endurance of NVMs remains a crucial research challenge for enabling their wide-spread adoption.For this reason, we have proposed a technique for improving lifetime of NVMs.

Techniques for Improving Lifetime of NVM Caches
Researchers have proposed several lifetime enhancement techniques for NVMs, which can be classified as write-minimization or wear-leveling.The write-minimization techniques, aim to reduce the number of writes to NVMs [31,32], while the wear-leveling techniques aim to uniformly distribute the writes to different blocks in the NVM cache [4,5,13,16,19,20].Based on their granularity, wear-leveling techniques can be further classified as inter-set level [4,5,16] and intra-set level [5,13,19,20].ENLIVE reduces the writes to hot blocks and thus, in addition to write-minimization, it also implicitly performs wear-leveling.
Some techniques perform a read-before-write operation to identify the changed bits or use additional flags to record the changed data words [33].Using this, the redundant writes can be avoided, since only those bits or words whose values have changed can be written to the NVM cache.However, read-before-write schemes are effective mainly for PCM, since its write latency/energy are significantly higher (e.g., six to 10 times) than that its read latency/energy.By comparison, for ReRAM and STT-RAM, the write latency/energy are only two to four times that of read latency/energy and hence, the performance overhead of an extra read operation before the write operation becomes high.For this reason, read-before-write schemes are less suitable for caches, which are expected to be designed using STT-RAM and ReRAM, and not PCM.
Similarly, some techniques use extra flag bits to denote certain data-patterns (e.g., all-zero) and when the incoming data has such patterns, the write can be avoided by setting the flag to ON and later constructing the data using the flag value [34].ENLIVE performs write-minimization at cache-access level and can be synergistically integrated with the above mentioned bit-level write-minimization techniques to further reduce the writes to the NVM cache.

Comparison with NVM-Only Cache Designs
Several techniques have been proposed for improving lifetime of NVM only caches, as shown above.However, since the write endurance of NVMs is orders of magnitude smaller than that of SRAMs, in case of high write traffic, NVM-only caches may show small lifetime since even after wear-leveling and write minimization, the worst-writes on a block may exceed the NVM write endurance.Further, as we have shown in Section 3.2, an NVM-only cache is much more vulnerable to a write-attack than an NVM cache with an SRAM HotStore (as used by ENLIVE technique).
In Section 5.1, we have qualitatively and quantitatively compared ENLIVE with two recently proposed techniques for NVM caches.In what follows, we qualitatively compare ENLIVE with two other techniques which use additional SRAM storage.
Sun et al. [31] use write-buffers for hiding long latencies of STT-RAM banks.A write-buffer temporarily buffers incoming writes to L2 cache.It is well-known that due to filtering by first-level cache, the locality of cache access is significantly reduced at the last level cache [14] and hence, the write-buffer cannot fully capture the hot blocks and may be polluted by blocks which are written only once.Thus, its effectiveness in reducing the number of writes to NVM cache is limited.By comparison, in ENLIVE technique, only the hot blocks are moved to the HotStore and they are later served from there and thus, it reduces the number of writes to the NVM cache.In addition, ENLIVE aims at improving the lifetime of NVM caches and not on the performance, which is the focus of the work by [31].
Ahn et al. [32] observe that, in many applications, computed values are varied within small ranges and hence, the upper bits of data are not changed as frequently as the lower bits.Based on this, they propose use of a small lower-bit cache (LBC) designed with SRAM which is placed between the L1 cache and the STT-RAM L2 cache.LBC aims to coalesce write-requests and also hides frequent value changes from the L2 cache.LBC stores the lower half of every word in cache blocks written by the L1 data cache.On a hit in LBC, the data to be provided for the L1 cache are a combination of upper bits from the L2 cache and lower bits from the LBC, since the lower bits in L2 may not be up-to-date.However, in cases when upper bits change frequently, this technique will incur large overhead.In addition, for the reasons mentioned above, an LBC may not capture the hot blocks.

Comparison with SRAM-NVM Hybrid Cache Designs
Some researchers propose way-based SRAM-NVM hybrid caches [31,35,36], where a few ways are designed using SRAM and the remaining ways are designed using NVM.The management policies for these caches aim to keep the write-intensive blocks in SRAM to improve the lifetime of the cache by reducing the number of writes to NVM.The limitation of these hybrid-cache designs, however, is that designing a few way using SRAM while others using NVM in the same cache may increase the design complexity and cost.By comparison, in ENLIVE, the SRAM HotStore can be designed separately from the cache.
In addition, in a hybrid cache with (say) one SRAM way, the number of SRAM sets become equal to the number of LLC sets.This may lead to wastage of SRAM space, since not all the LLC sets are expected to store hot-data.In contrast, the number of HotStore entries is much smaller than the number of LLC sets (e.g., 1/16 times the number of LLC sets).ENLIVE allows adapting the size of HotStore based on the write-variation present in the target workloads or the desired improvement in lifetime.The limitation of ENLIVE, however, is that the data in SRAM HotStore is also stored in the NVM LLC cache; by comparison, hybrid caches in general do not incur such redundancy.However, note that due to small size of HotStore, the redundancy is also small.

Techniques for Improving Lifetime of NVM Main Memory
Some researchers propose hybrid DRAM-PCM main memory design which aim to leverage the high performance and write-endurance of DRAM along with high density and low leakage power of PCM [37].However, several key differences in operational mechanisms between caches and main memory make the solutions proposed for main memory inadequate for caches.In addition to inter-set write-variation, caches also show intra-set write-variation, which presents both opportunities and challenges.In addition, the techniques proposed for main memory typically have high latency overhead, which is acceptable at main memory level but is unacceptable at the level of on-chip caches.

Conclusions
Under ongoing core-scaling, NVMs offers a practical means of scaling cache capacity in an energy-efficient manner.NVMs are potential candidates for replacing SRAM for the design of last level caches due to their high density and low leakage power consumption.However, their limited write endurance and high write latency and energy present a crucial bottleneck in their widespread use.In this paper, we presented ENLIVE, a technique for improving the lifetime of NVM caches by minimizing the number of writes.Microarchitectural simulations show that ENLIVE outperforms two recently proposed techniques and works well for different system configurations.
Our future work will focus on synergistically integrating ENLIVE with write-reduction mechanisms (such as cache compression and bit-level write reduction) and wear-leveling mechanisms to improve the cache lifetime even further.To aggressively improve application performance with NVM caches, we also plan to study use of prefetching [38] in NVM caches while ensuring that extra writes do not degrade NVM lifetime.In recent years, the size of GPU caches has been increasing.Since power consumption is becoming first order design constraint in GPUs as well, our future efforts will focus on studying use of NVMs for designing GPU caches.Furthermore, recent research has highlighted sources of soft-errors in NVMs [2] and a part of our future work will focus on mitigating these errors in NVM caches.

Figure 1 .
Figure 1.Number of writes on cache blocks for SpPo (sphinx-povray) workload.The top figure shows the full y-range (number of writes), while the bottom figure shows only the range [0:200].Notice that the write-variation is very large, while the average write-count per block is only 47, the maximum write-count is 72,349.

Figure 2 .
Figure 2. Illustration of the cache hierarchy with addition of HotStore.

Table 3 .
Workloads used in the experiments.

Table 5 .
Results on miss per kilo instruction (MPKI) increase with PoLF and LastingNVCache.

Table 6 .
Results on reduction in write per kilo instruction (WPKI) and nWriteToHotStore with ENLIVE.

Table 7 .
Parameter Sensitivity Results.Default parameters are shown in Section 5.2.