A Survey of Cache Bypassing Techniques
Abstract
:1. Introduction
2. Background and Motivation
2.1. Preliminaries
2.2. Support for Cache Bypassing in Commercial Processors
2.3. Promises of Cache Bypassing
2.3.1. Performance and Energy Benefits
2.3.2. Benefits in NVM and DRAM Caches
2.3.3. Benefits in GPUs
2.4. Challenges in Using Cache Bypassing
2.4.1. Implementation Overhead
2.4.2. Memory Bandwidth and Performance Overhead
2.4.3. Challenges in GPUs
2.4.4. Challenges in Inclusive Caches
3. Key Ideas and Classification of CBTs
3.1. Main Ideas of CBTs
- Criterion for making bypass decisions:
- Some techniques keep counters for every data-block and to make bypassing decision or getting feedback, they compare counters of incoming and existing data to see which one is accessed first or more frequently [39,43,54,56,64,65]. Thus, these and a few other techniques [35,66] use a learning approach where the value of its parameters (e.g., threshold) are continuously updated based on correctness of a bypassing decision.
- Some techniques predict reuse behavior of a line based on its behavior in its previous generation (i.e., last residency in cache) [20,23,40,49,66]. Other techniques infer reuse behavior of a line from that of another line adjacent to it in memory address space, since adjacent lines show similar access properties [54]. Similarly, the reuse pattern of a block in one cache (e.g., L2) can guide bypassing decisions for this block in another cache (e.g., L3) [21,45].
- Classifying accesses/warps for guiding bypassing: Some works classify the accesses, misses or warps into different categories to selectively bypass certain categories. Ahn et al. [13] classify the writes into dead-on-arrival fills, dead-value fills and closing writes (refer to Section 5.1). Wang et al. [53] classify the LLC write accesses into core-write (write to LLC through a higher-level write-through cache or eviction of dirty data from a higher-level writeback cache), write due to prefetch-miss and due to demand miss (refer to Section 6.1). Similarly, Chaudhuri et al. [45] classify cache blocks based on the number of reuses seen by it and its state at the time of eviction from L2, etc. In the work of Wang et al. [67], the LLC blocks which are frequently written back to memory in an access interval are termed as frequent writeback blocks and remaining blocks (either dirty or clean) are termed as infrequent writeback blocks.Collins et al. [10] classify the misses into conflict and capacity (which includes compulsory) misses. Tyson et al. [61] classify the misses based on whether they fetch useful or dead-on-arrival data. For GPUs, Wang et al. [68] classify the warps into locality warps and thrashing warps depending on the reuse pattern shown by them. Liang et al. [69] classify access patterns as partial or full sharing (few or all threads share the same data, respectively) and streaming pattern.
- Cache hierarchy organization: Some CBTs work by reorganizing the cache and/ore cache hierarchy (refer to Section 4.5). Malkowski et al. [70] split L1 cache into a regular and a bypass cache. B. Wang et al. [68] assume logical division of a cache into a locality region and a thrashing region for storing data with different characteristics and Z. Wang et al. [67] logically divide each cache set into a frequent writeback and an infrequent writeback list. Das et al. [55] divide a large wire-delay-dominated cache into multiple sublevels based on distance of cache banks from processor, e.g., three sublevels may consist of the nearest 4, next 4 and furthest 8 banks from the processor, respectively. Gonzalez et al. [71] divide the data cache into a spatial cache and a temporal cache, which exploit spatial and temporal locality, respectively.Xu and Li [46] study page-mapping in systems with a main cache (8 KB) and a mini cache (512 B), where a page can be mapped to either of them or bypassed. Etsion and Feitelson [36] propose replacing a 32 K 4-way cache with a 16 KB direct-mapped cache (for storing frequently reused data) and a 2 K filter (for storing transient data).
- Use of bypass buffers: Some works use buffer/table to store both tags and data [9,10,36,49,65,66,72] or only tags [43,56] of the bypassed blocks. Access to the cache is avoided for the blocks found in these buffers and with effective bypassing algorithms, the size of these buffers are expected to be small [43,49]. The bypassed blocks stored in the buffer may be moved to the main cache only if they show temporal reuse [9,49,73,74]. Chou et al. [37] buffer tags of recently accessed adjacent DRAM cache lines. On a miss to last level SRAM cache, the request is first searched in this buffer and a hit result avoids the need of miss probe in DRAM cache.
- Granularity: Most techniques make prediction at the granularity of a block of size 64 B or 128 B. Stacked-DRAM cache designs may use 64 B block size [37] to reduce cache pollution or 4 KB block size [30,48] to reduce metadata overhead. By comparison, Alves et al. [23] predict when a sub-block (8 B) is dead, while Johnson and Hwu [65] make prediction at the level of a macroblock (1 KB) which consists of multiple adjacent blocks. Lee et al. [48] also discuss bypassing at superpage (2 MB to 1 GB) level (refer to Section 6.2). Khairy et al. [58] disable the entire cache and thus, all data bypass the cache (refer to Section 4.7). Use of larger granularity allows lowering the metadata overhead at the cost of reducing the accuracy of information collected about reuse pattern.
- Use of compiler: Many CBTs use a compiler for their functioning [8,38,46,51,52,57,63,69], while most other CBTs work based on runtime information only (refer to Section 4.6). The compiler can identify thread-sharing behavior [69], communication pattern [52,63], reuse count [8,38,46] and reuse distance [51,57]. This information can be used by compiler itself (e.g., for performing intelligent instruction scheduling [57]) or by hardware for making bypassing decisions.
- Co-management policies: In addition to bypassing, the information about cache accesses or dead blocks has been used for other optimizations such as power-gating [23,75], prefetching [10,50] and intelligent replacement policy decisions [14,23,50,76]. For example, data can be prefetched into dead blocks and while replacing, first preference can be given to dead blocks. The energy overhead of CBTs (e.g., due to predictors) can be offset by using dynamic voltage/frequency scaling (DVFS) technique [70].
- Other features: While most CBTs work with any cache replacement policy, some CBTs assume specific replacement policy (e.g., LRU replacement policy [8]).
- 11.
- Probabilistic bypassing: To avoid the overhead of maintaining full metadata, many CBTs use probabilistic bypassing approach [36,37,43,56] (refer to Section 4.4).
- 12.
- Set sampling: Several key characteristics (e.g., miss rate) of a set associative cache can be estimated by evaluating only a few of its sets. This strategy, known as set sampling, has been used for reducing the overhead of cache profiling [13,21,37,43,45,51,56,67,76,77]. Also, it has been shown that keeping only a few bits of tags is sufficient for achieving reasonable accuracy [10,76] (refer to Section 4.8).
- 13.
3.2. A Classification of CBTs
4. Working Strategies of CBTs for CPUs
4.1. CBTs Based on Reuse-Count
4.2. CBTs Based on Reuse-Distance
4.3. CBTs Based on Cache Miss Behavior
4.4. Probabilistic CBTs
4.5. CBTs Involving Cache Hierarchy Reorganization or Bypass Buffer
4.6. CBTs Involving Software/Compiler Level Management
4.7. Use of Different Bypassing Granularities in CBTs
4.8. Strategies for Reducing Overhead of CBTs
5. CBTs for Different Hierarchies and Evaluation Using Different Platforms
5.1. CBTs for Inclusive Cache Hierarchy
5.2. CBTs for Exclusive Cache Hierarchy
5.3. Evaluation on Real Processor
5.4. Evaluation Using Analytical Models
6. CBTs for Specific Memory Technologies
6.1. Bypassing in Context of NVM Cache or Main Memory
6.2. Bypassing in Die-Stacked DRAM Caches
7. CBTs for GPUs and CPU-GPU Heterogeneous Systems
- In CPU-GPU systems, requests from GPUs can be bypassed by leveraging latency tolerance of GPU accesses (Table 4).
- Several techniques perform bypassing primarily based on reuse characteristics (or utility) of a block (Table 4). For example, these techniques may bypass streaming or thrashing blocks.
- Under GPU’s lock-step execution model, using different cache/bypass decision for different threads of a warp would create differences in their latencies and hence, all the threads would be stalled till the completion of last memory request. By making identical caching/bypassing decision for all threads, and by caching few warps at a time, these memory divergence issues can be avoided (Table 4). Based on these, some techniques seek to cache a warp fully and not partially [11,32,34,38,59,60,68]. Some techniques work by caching/bypassing two warps together or individually [69] or performing request reordering [34,64]. Thus, these techniques perform bypassing together with a thread management scheme.
- Some techniques perform bypassing when the resources (e.g., MSHR) for servicing a miss cannot be allocated (Table 4).
- For several GPU applications, the cores show symmetric behavior and hence, by comparatively evaluating different policies on just few cores, the optimal policy can be selected for all the cores. This strategy, referred to as core sampling, has been used by several CBTs to reduce their metadata overheads (Table 4). Li et al. [47] use core-sampling to ascertain cache friendliness of an application such that one core uses their bypassing scheme and another core uses default caching scheme and best scheme is found by comparing their miss-rates. Mekkat et al. [77] determine the impact of bypassing on GPU performance by using two different bypassing thresholds with two different cores. Chen et al. [11] estimate `protecting distance’ on a few cores and use this value for the remaining cores.
7.1. CBTs Based on Reuse Characteristics
7.2. CBTs Based on Memory Divergence Properties
7.3. CBTs for CPU-GPU Heterogeneous Systems
8. Future Challenges and Conclusions
Acknowledgments
Conflicts of Interest
References
- Fluhr, E.J.; Friedrich, J.; Dreps, D.; Zyuban, V.; Still, G.; Gonzalez, C.; Hall, A.; Hogenmiller, D.; Malgioglio, F.; Nett, R.; et al. 5.1 POWER8TM: A 12-core server-class processor in 22 nm SOI with 7.6 Tb/s off-chip bandwidth. In Proceedings of the International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 96–97.
- Kurd, N.; Chowdhury, M.; Burton, E.; Thomas, T.P.; Mozak, C.; Boswell, B.; Lal, M.; Deval, A.; Douglas, J.; Elassal, M.; et al. 5.9 Haswell: A family of IA 22 nm processors. In Proceedings of the International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 112–113.
- NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. 2009. Available online: http://goo.gl/X2AI0b (accessed on 27 April 2016).
- NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture:Kepler GK110/210. 2014. Available online: http://goo.gl/qOSWW1 (accessed on 27 April 2016).
- Harris, M. 5 Things You Should Know about the New Maxwell GPU Architecture. 2014. Available online: http://goo.gl/8NV82n (accessed on 27 April 2016).
- Mittal, S. A survey of techniques for managing and leveraging caches in GPUs. J. Circuits Syst. Comput. 2014, 23, 229–236. [Google Scholar] [CrossRef]
- Huangfu, Y.; Zhang, W. Real-Time GPU Computing: Cache or No Cache? In Proceedings of the International Symposium on Real-Time Distributed Computing (ISORC), Auckland, New Zealand, 13–17 April 2015; pp. 182–189.
- Chi, C.H.; Dietz, H. Improving cache performance by selective cache bypass. In Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences, Kailua-Kona, HI, USA, 3–6 January 1989; Volume 1, pp. 277–285.
- John, L.K.; Subramanian, A. Design and performance evaluation of a cache assist to implement selective caching. In Proceedings of the International Conference on Computer Design, Austin, TX, USA, 12–15 October 1997; pp. 510–518.
- Collins, J.D.; Tullsen, D.M. Hardware identification of cache conflict misses. In Proceedings of the International Symposium on Microarchitecture, Haifa, Israel, 16–18 November 1999; pp. 126–135.
- Chen, X.; Chang, L.W.; Rodrigues, C.I.; Lv, J.; Wang, Z.; Hwu, W.M. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the 47th International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 343–355.
- Zhang, C.; Sun, G.; Li, P.; Wang, T.; Niu, D.; Chen, Y. SBAC: A statistics based cache bypassing method for asymmetric-access caches. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), La Jolla, CA, USA, 11–13 August 2014; pp. 345–350.
- Ahn, J.; Yoo, S.; Choi, K. DASCA: Dead write prediction assisted STT-RAM cache architecture. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 15–19 February 2014; pp. 25–36.
- Duong, N.; Zhao, D.; Kim, T.; Cammarota, R.; Valero, M.; Veidenbaum, A.V. Improving cache management policies using dynamic reuse distances. In Proceedings of the 45th International Symposium on Microarchitecture, Vancouver, BC, Canada, 1–5 December 2012; pp. 389–400.
- Mittal, S. A Survey of Architectural Techniques For Improving Cache Power Efficiency. Sustain. Comput. Inform. Syst. 2014, 4, 33–43. [Google Scholar] [CrossRef]
- Belady, L.A. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 1966, 5, 78–101. [Google Scholar] [CrossRef]
- Atkins, M. Performance and the i860 microprocessor. IEEE Micro 1991, 11, 24–27. [Google Scholar] [CrossRef]
- Intel Corporation. Intel 64 and IA-32 Architectures, Software Developer’s Manual, Instruction Set Reference, A-Z; Intel Corporation: Santa Clara, CA, USA, 2011; Volume 2. [Google Scholar]
- NVIDIA Corporation. Parallel Thread Execution ISA Version 4.2; NVIDIA Corporation: Santa Clara, CA, USA, 2015. [Google Scholar]
- Kharbutli, M.; Solihin, Y. Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 2008, 57, 433–447. [Google Scholar] [CrossRef]
- Gaur, J.; Chaudhuri, M.; Subramoney, S. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of the 38 th International Symposium on Computer Architecture (ISCA), San Jose, CA, USA, 4–8 June 2011; pp. 81–92.
- Mittal, S.; Zhang, Z.; Vetter, J. FlexiWay: A Cache Energy Saving Technique Using Fine-grained Cache Reconfiguration. In Proceedings of the 31st IEEE International Conference on Computer Design (ICCD), Asheville, NC, USA, 6–9 October 2013.
- Alves, M.; Khubaib, K.; Ebrahimi, E.; Narasiman, V.; Villavieja, C.; Navaux, P.O.A.; Patt, Y.N. Energy savings via dead sub-block prediction. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), New York, NY, USA, 24–26 October 2012; pp. 51–58.
- Mittal, S.; Zhang, Z. EnCache: A Dynamic Profiling Based Reconfiguration Technique for Improving Cache Energy Efficiency. J. Circuits Syst. Comput. 2014, 23, 1450147. [Google Scholar] [CrossRef]
- Mittal, S.; Vetter, J.S.; Li, D. A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches. IEEE Trans. Parallel Distrib. Syst. 2015, 26, 1524–1537. [Google Scholar] [CrossRef]
- Mittal, S. A Survey of Power Management Techniques for Phase Change Memory. Int. J. Comput. Aided Eng. Technol. 2014. [Google Scholar]
- Mittal, S.; Poremba, M.; Vetter, J.; Xie, Y. Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool; Technical Report ORNL/TM-2014/636; Oak Ridge National Laboratory: Oak Ridge, TN, USA, 2014.
- Mittal, S.; Vetter, J.S. A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems. IEEE Trans. Parallel Distrib. Syst. 2016, 27, 1537–1550. [Google Scholar] [CrossRef]
- Wang, J.; Dong, X.; Xie, Y. OAP: An obstruction-aware cache management policy for STT-RAM last-level caches. In Proceedings of the Conference on Design, Automation and Test in Europe, Grenoble, France, 18–22 March 2013; pp. 847–852.
- Mittal, S.; Vetter, J. A Survey of Techniques for Architecting DRAM Caches. IEEE Trans. Parallel Distrib. Syst. 2015. [Google Scholar] [CrossRef]
- AMD. AMD Graphics Cores Next (GCN) Architecture. 2012. Available online: https://goo.gl/NjNcDY (accessed on 27 April 2016).
- Li, A.; van den Braak, G.J.; Kumar, A.; Corporaal, H. Adaptive and Transparent Cache Bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Austin, TX, USA, 15–20 November 2015.
- Hagedoorn, H. Core i7 5775C Processor Review: Desktop Broadwell—The Broadwell-H Architecture. 2015. Available online: http://goo.gl/1QFwja (accessed on 27 April 2016).
- Jia, W.; Shaw, K.; Martonosi, M. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 15–19 February 2014; pp. 272–283.
- Tian, Y.; Puthoor, S.; Greathouse, J.L.; Beckmann, B.M.; Jiménez, D.A. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, San Francisco, CA, USA, 7 February 2015; pp. 25–35.
- Etsion, Y.; Feitelson, D.G. Exploiting core working sets to filter the L1 cache with random sampling. IEEE Trans. Comput. 2012, 61, 1535–1550. [Google Scholar] [CrossRef]
- Chou, C.; Jaleel, A.; Qureshi, M.K. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 13–17 June 2015.
- Xie, X.; Liang, Y.; Wang, Y.; Sun, G.; Wang, T. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CA, USA, 7–11 February 2015; pp. 76–88.
- Li, L.; Tong, D.; Xie, Z.; Lu, J.; Cheng, X. Optimal bypass monitor for high performance last-level caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 19–23 September 2012; pp. 315–324.
- Kharbutli, M.; Jarrah, M.; Jararweh, Y. SCIP: Selective cache insertion and bypassing to improve the performance of last-level caches. In Proceedings of the IEEE Conference on Applied Electrical Engineering and Computing Technologies (AEECT), Amman, Jordan, 3–5 December 2013; pp. 1–6.
- Wang, P.H.; Liu, G.H.; Yeh, J.C.; Chen, T.M.; Huang, H.Y.; Yang, C.L.; Liu, S.L.; Greensky, J. Full system simulation framework for integrated CPU/GPU architecture. In Proceedings of the International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu, Taiwan, 28–30 April 2014; pp. 1–4.
- Mittal, S.; Vetter, J. A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Comput. Surv. 2015, 47, 69:1–69:35. [Google Scholar] [CrossRef]
- Gupta, S.; Gao, H.; Zhou, H. Adaptive cache bypassing for inclusive last level caches. In Proceedings of the International Symposium on Parallel & Distributed Processing (IPDPS), Cambridge, MA, USA, 20–24 May 2013; pp. 1243–1253.
- Kim, M.K.; Choi, J.H.; Kwak, J.W.; Jhang, S.T.; Jhon, C.S. Bypassing method for STT-RAM based inclusive last-level cache. In Proceedings of the Conference on Research in Adaptive and Convergent Systems, Prague, Czech Republic, 9–12 October 2015; pp. 424–429.
- Chaudhuri, M.; Gaur, J.; Bashyam, N.; Subramoney, S.; Nuzman, J. Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 19–23 September 2012; pp. 293–304.
- Xu, R.; Li, Z. Using cache mapping to improve memory performance handheld devices. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, USA, 10–12 March 2004; pp. 106–114.
- Li, C.; Song, S.L.; Dai, H.; Sidelnik, A.; Hari, S.K.S.; Zhou, H. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the International Conference on Supercomputing (ICS), Newport Beach, CA, USA, 8–11 June 2015.
- Lee, Y.; Kim, J.; Jang, H.; Yang, H.; Kim, J.; Jeong, J.; Lee, J.W. A fully associative, tagless DRAM cache. In Proceedings of the International Symposium on Computer Architecture, Portland, OR, USA, 13–17 June 2015; pp. 211–222.
- Xiang, L.; Chen, T.; Shi, Q.; Hu, W. Less reused filter: Improving L2 cache performance via filtering less reused lines. In Proceedings of the 23rd International conference on Supercomputing, Yorktown Heights, NY, USA, 8–12 June 2009; pp. 68–79.
- Liu, H.; Ferdman, M.; Huh, J.; Burger, D. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the International Symposium on Microarchitecture, Como, Italy, 8–12 November 2008; pp. 222–233.
- Feng, M.; Tian, C.; Gupta, R. Enhancing LRU replacement via phantom associativity. In Proceedings of the 16th Workshop on Interaction between Compilers and Computer Architectures (INTERACT), New Orleans, LA, USA, 25 February 2012; pp. 9–16.
- Park, J.; Yoo, R.M.; Khudia, D.S.; Hughes, C.J.; Kim, D. Location-aware cache management for many-core processors with deep cache hierarchy. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–22 November 2013; p. 20.
- Wang, Z.; Jiménez, D.A.; Xu, C.; Sun, G.; Xie, Y. Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 15–19 February 2014; pp. 13–24.
- Yu, B.; Ma, J.; Chen, T.; Wu, M. Global Priority Table for Last-Level Caches. In Proceedings of the International Conference on Dependable, Autonomic and Secure Computing (DASC), Sydney, Australia, 12–14 December 2011; pp. 279–285.
- Das, S.; Aamodt, T.M.; Dally, W.J. SLIP: Reducing wire energy in the memory hierarchy. In Proceedings of the International Symposium on Computer Architecture, Portland, OR, USA, 13–17 June 2015; pp. 349–361.
- Gao, H.; Wilkerson, C. A dueling segmented LRU replacement algorithm with adaptive bypassing. In Proceedings of the JILP Worshop on Computer Architecture Competitions: Cache Replacement Championship (JWAC), Saint-Malo, France, 20 June 2010.
- Wu, Y.; Rakvic, R.; Chen, L.L.; Miao, C.C.; Chrysos, G.; Fang, J. Compiler managed micro-cache bypassing for high performance EPIC processors. In Proceedings of the 35th Annual IEEE International Symposium on Microarchitecture, Istanbul, Turkey, 18–22 November 2002; pp. 134–145.
- Khairy, M.; Zahran, M.; Wassal, A.G. Efficient utilization of GPGPU cache hierarchy. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, San Francisco, CA, USA, 7 February 2015; pp. 36–47.
- Zheng, Z.; Wang, Z.; Lipasti, M. Adaptive cache and concurrency allocation on GPGPUs. IEEE Comput. Archit. Lett. 2015, 14, 90–93. [Google Scholar] [CrossRef]
- Ausavarungnirun, R.; Ghose, S.; Kayiran, O.; Loh, G.H.; Das, C.R.; Kandemir, M.T.; Mutlu, O. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT), San Francisco, CA, USA, 18–21 October 2015.
- Tyson, G.; Farrens, M.; Matthews, J.; Pleszkun, A.R. A modified approach to data cache management. In Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, MI, USA, 29 November–1 December 1995; pp. 93–103.
- Dai, H.; Gupta, S.; Li, C.; Kartsaklis, C.; Mantor, M.; Zhou, H. A Model-Driven Approach to Warp/Thread-Block Level GPU Cache Bypassing. In Proceedings of the Design Automation Conference (DAC), Austin, TX, USA, 5–9 June 2016.
- Choi, H.; Ahn, J.; Sung, W. Reducing off-chip memory traffic by selective cache management scheme in GPGPUs. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, London, UK, 3 March 2012; pp. 110–119.
- Mu, S.; Deng, Y.; Chen, Y.; Li, H.; Pan, J.; Zhang, W.; Wang, Z. Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Trans. Very Large Scale Integr. Syst. 2014, 22, 1803–1814. [Google Scholar] [CrossRef]
- Johnson, T.L.; Hwu, W.M.W. Run-time adaptive cache hierarchy management via reference analysis. In Proceedings of the International Symposium on Computer Architecture, Denver, CO, USA, 1–4 June 1997; Volume 25, pp. 315–326.
- Jalminger, J.; Stenström, P. A novel approach to cache block reuse prediction. In Proceedings of the 42nd International Conference on Parallel Processing, Kaohsiung, Taiwan, 6–9 October 2003; pp. 294–302.
- Wang, Z.; Shan, S.; Cao, T.; Gu, J.; Xu, Y.; Mu, S.; Xie, Y.; Jiménez, D.A. WADE: Writeback-aware dynamic cache management for NVM-based main memory system. ACM Trans. Archit. Code Optim. 2013, 10, 51:1–51:21. [Google Scholar] [CrossRef]
- Wang, B.; Yu, W.; Sun, X.H.; Wang, X. DaCache: Memory Divergence-Aware GPU Cache Management. In Proceedings of the 29th International Conference on Supercomputing, Newport Beach, CA, USA, 8–11 June 2015; pp. 89–98.
- Liang, Y.; Xie, X.; Sun, G.; Chen, D. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, 18–21 November 2013.
- Malkowski, K.; Link, G.; Raghavan, P.; Irwin, M.J. Load miss prediction-exploiting power performance trade-offs. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), Long Beach, CA, USA, 26–30 March 2007; pp. 1–8.
- González, A.; Aliagas, C.; Valero, M. A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality. In Proceedings of the 9th International Conference on Supercomputing, Barcelona, Spain, 3–7 July 1995; pp. 338–347.
- Mittal, S.; Vetter, J. A Technique For Improving Lifetime of Non-volatile Caches using Write-minimization. J. Low Power Electron. Appl. 2016, 6, 1. [Google Scholar] [CrossRef]
- Chan, K.K.; Hay, C.C.; Keller, J.R.; Kurpanek, G.P.; Schumacher, F.X.; Zheng, J. Design of the HP PA 7200 CPU. HP J. 1996. [Google Scholar]
- Karlsson, M.; Hagersten, E. Timestamp-based selective cache allocation. In High Performance Memory Systems; Springer: New York, NY, USA, 2004; pp. 43–59. [Google Scholar]
- Lee, J.; Woo, D.H.; Kim, H.; Azimi, M. GREEN Cache: Exploiting the Disciplined Memory Model of OpenCL on GPUs. IEEE Trans. Comput. 2015, 64, 3167–3180. [Google Scholar] [CrossRef]
- Khan, S.; Tian, Y.; Jiménez, D. Sampling dead block prediction for last-level caches. In Proceedings of the International Symposium on Microarchitecture (MICRO), Atlanta, GA, USA, 4–8 December 2010; pp. 175–186.
- Mekkat, V.; Holey, A.; Yew, P.C.; Zhai, A. Managing shared last-level cache in a heterogeneous multicore processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Edinburgh, UK, 7–11 September 2013; pp. 225–234.
- Mittal, S. A Survey Of Techniques for Cache Locking. ACM Trans. Des. Autom. Electron. Syst. 2016, 21, 49:1–49:24. [Google Scholar]
- Mittal, S. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Comput. Surv. 2016. [Google Scholar]
- Mittal, S.; Cao, Y.; Zhang, Z. MASTER: A multicore cache energy saving technique using dynamic cache reconfiguration. IEEE Trans. Very Large Scale Integr. Syst. 2014, 22, 1653–1665. [Google Scholar] [CrossRef]
- Kampe, M.; Stenstrom, P.; Dubois, M. Self-correcting LRU replacement policies. In Proceedings of the 1st Conference on Computing Frontiers, Ischia, Italy, 14–16 April 2004; pp. 181–191.
- Ma, J.; Meng, J.; Chen, T.; Shi, Q.; Wu, M.; Liu, L. Improve LLC Bypassing Performance by Memory Controller Improvements in Heterogeneous Multicore System. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), Hong Kong, 9–11 December 2014; pp. 82–89.
- Dai, H.; Kartsaklis, C.; Li, C.; Janjusic, T.; Zhou, H. RACB: Resource Aware Cache Bypass on GPUs. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW), Paris, France, 22–24 October 2014; pp. 24–29.
- Lesage, B.; Hardy, D.; Puaut, I. Shared Data Caches Conflicts Reduction for WCET Computation in Multi-Core Architectures. In Proceedings of the 18th International Conference on Real-Time and Network Systems, Toulouse, France, 4–5 Novermber 2010; p. 2283.
- Hardy, D.; Piquet, T.; Puaut, I. Using bypass to tighten WCET estimates for multi-core processors with shared instruction caches. In Proceedings of the 34th IEEE Real-Time Systems Symposium (RTSS), Washington, DC, USA, 1–4 December 2009; pp. 68–77.
- Jaleel, A.; Theobald, K.B.; Steely, S.C., Jr.; Emer, J. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture, Saint-Malo, France, 19–23 June 2010; pp. 60–71.
- Intel Corporation. Intel StrongARM SA-1110 Microprocessor Developer’s Manual; Intel Corporation: Santa Clara, CA, USA, 2000. [Google Scholar]
- Xie, X.; Liang, Y.; Sun, G.; Chen, D. An efficient compiler framework for cache bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, 18–21 November 2013; pp. 516–523.
- Mittal, S. A survey of architectural techniques for managing process variation. ACM Comput. Surv. 2016, 48, Article No. 54. [Google Scholar] [CrossRef]
- Mittal, S. A survey of techniques for approximate computing. ACM Comput. Surv. 2016, 48, Article No. 54. [Google Scholar] [CrossRef]
Classification | References |
---|---|
Study/optimization objective | |
Performance | [7,9,10,11,12,13,14,20,21,29,32,34,35,36,37,38,39,40,41,43,44,45,46,47,48,49,50,51,52,53,54,55,57,58,59,60,61,62,63,64,66,67,68,69,70,74,75,76,77,81,82,83,84] |
Energy | [12,13,23,35,36,44,46,47,52,53,55,60,67,70,75,77,83] |
Predictability | [7,84,85] |
Level in cache hierarchy | |
First-level cache | [7,10,11,23,32,34,35,36,38,46,47,52,58,59,61,62,65,66,68,69,74,75,81,83] |
Mid/last-level cache | [12,13,14,20,21,23,32,37,39,40,41,43,44,45,48,49,50,51,52,53,54,55,56,58,60,63,64,65,67,70,75,76,77,82,83,84] |
Micro-cache | [57] |
Classification | References |
---|---|
Nature of cache hierarchy | |
Inclusive | [13,43,44] |
Exclusive | [21,45] |
Non-inclusive | Most others |
Evaluation Platform | |
Real-hardware | [32,46,69,73] |
Analytical performance models | [12,29] |
Simulator | Nearly all others |
Classification | References |
---|---|
Bypassing NVM cache | [12,13,29,44,53] |
Bypassing cache for reducing accesses to NVM memory | [67] |
Bypassing DRAM cache | [37,48] |
Classification | References |
---|---|
GPU | [7,11,32,34,35,38,47,58,59,60,62,63,64,68,75,83,88] |
GPU in CPU-GPU system | [41,77,82] |
CPU | Nearly all others |
Key idea/feature | |
Bypassing based on reuse behavior | [7,34,35,38,47,58,59,63,75] |
Bypassing based on memory divergence properties | [11,32,34,38,59,60,62,64,68,69] |
Bypassing when resources are scarce | [34,83] |
Use of core sampling | [11,47,77] |
© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mittal, S. A Survey of Cache Bypassing Techniques. J. Low Power Electron. Appl. 2016, 6, 5. https://doi.org/10.3390/jlpea6020005
Mittal S. A Survey of Cache Bypassing Techniques. Journal of Low Power Electronics and Applications. 2016; 6(2):5. https://doi.org/10.3390/jlpea6020005
Chicago/Turabian StyleMittal, Sparsh. 2016. "A Survey of Cache Bypassing Techniques" Journal of Low Power Electronics and Applications 6, no. 2: 5. https://doi.org/10.3390/jlpea6020005
APA StyleMittal, S. (2016). A Survey of Cache Bypassing Techniques. Journal of Low Power Electronics and Applications, 6(2), 5. https://doi.org/10.3390/jlpea6020005