Transparent Control Flow Transfer between CPU and Accelerators for HPC
Abstract
:1. Introduction
- Non-intrusive profiling of the target application;
- Automatic configuration of the appropriate accelerators;
- Transparent transfer of the control flow at run time.
2. Control Flow Transfer Mechanism
2.1. Overview
2.2. Implementation of Control Flow Transfer
- uint8_t trap_instruction [8] = {0xCC, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90};
- uint8_t old_instruction [8];
- uint64_t break_addr = get_break_addr(i);
- // read 8 bytes from tracee’s memory into old_instruction using ptrace
- peek_text(tracee_pid, (void∗)break_addr, &old_instruction, 8)
- // replace a 5 byte instruction with an INT3 (0xCC) followed by NOPs (0x90)
- // keep the bytes that follow the replaced instruction
- trap_instruction [5] = old_instruction [5];
- trap_instruction [6] = old_instruction [6];
- trap_instruction [7] = old_instruction [7];
- // write the new 8 bytes in the tracee’s memory using ptrace
- poke_text(tracee_pid, (void∗)break_addr, &trap_instruction, 8)
- void transfer_control_aes(uint8_t key[], uint8_t iv[], uint8_t data[], uint32_t length) {
- OPAE_SVC_WRAPPER fpga(AFU_ACCEL_UUID); // Find and connect to the accelerator
- CSR_MGR csrs(fpga); // Connect the CSR manager
- // allocate shared memory buffer
- auto data_handle = fpga.attachBuffer(data, getpagesize() ∗ ( (length/4096) + 1) );
- // write into CSRs non-variable sized data, including size and address of arrays
- csrs.writeCSR(0, (uint64_t)data); // Address of src
- csrs.writeCSR(1, (uint64_t)data); // Address of dest
- csrs.writeCSR(2, (uint64_t)length); // data length
- csrs.writeCSR(4, (uint64_t)∗(uint64_t∗)iv); // IV 0
- csrs.writeCSR(5, (uint64_t)∗(uint64_t∗)(iv+8)); // IV 1
- csrs.writeCSR(6, (uint64_t)∗(uint64_t∗)key); // key 0
- csrs.writeCSR(7, (uint64_t)∗(uint64_t∗)(key+8)); // key 1
- csrs.writeCSR(8, (uint64_t)∗(uint64_t∗)(key+16)); // key 2
- csrs.writeCSR(9, (uint64_t)∗(uint64_t∗)(key+24)); // key 3
- csrs.writeCSR(3, (uint64_t)1); // Run~signal
- while (0 == csrs.readCSR(0)) _mm_pause(); // spin wait
- return;
- }
2.3. Proof-Of-Concept Implementation
Listing 3. Contents of a sample configuration file loaded by the manager at runtime. |
|
3. Experimental Results
- Software—original application (not subject to acceleration).
- FPGA Shared Library Interposing—the application runs with an alternative accelerated shared library.
- FPGA Proposed Mechanism—implemented framework without an acceleration threshold.
- FPGA + Software Proposed Mechanism—implemented framework with an acceleration threshold.
3.1. AES Encryption Case
3.2. Matrix Multiplication Case
4. Discussion
- Architecture/platform dependency: using ptrace to modify and control another process is highly dependent on the hardware architecture, as well as the specific Application Binary Interface (ABI). However, the use of dedicated FPGA-based accelerators is also very specific, so the impact of this drawback may be limited in practice.
- Debugging is hindered: Since a process can only be traced by one process at a time, it becomes impossible to debug the target with a debugger based on ptrace, while the manager simultaneously controls the target process. Again, accelerators are likely to be used only after the software-only application has been tested and debugged, so this drawback also has limited scope.
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
ABI | Application Binary Interface |
CSR | Control & Status Register |
HPC | High-Performance Computing |
OPAE | Open Programmable Acceleration Engine |
SLI | Shared Library Interposing |
References
- Cutress, I. Intel’s Manufacturing Roadmap from 2019 to 2029: Back Porting, 7 nm, 5 nm, 3 nm, 2 nm, and 1.4 nm. Available online: https://web.archive.org/web/20191215001821/https://www.anandtech.com/show/15217/intels-manufacturing-roadmap-from-2019-to-2029 (accessed on 16 December 2019).
- Theis, T.N.; Philip Wong, H.S. The End of Moore’s Law: A New Beginning for Information Technology. Comput. Sci. Eng. 2017, 19, 41–50. [Google Scholar] [CrossRef]
- Williams, R.S. What’s Next? Comput. Sci. Eng. 2017, 19, 7–13. [Google Scholar] [CrossRef]
- Wang, L.; Skadron, K. Implications of the Power Wall: Dim Cores and Reconfigurable Logic. IEEE Micro 2013, 33, 40–48. [Google Scholar] [CrossRef] [Green Version]
- Hao, Y.; Fang, Z.; Reinman, G.; Cong, J. Supporting Address Translation for Accelerator-Centric Architectures. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, USA, 4–8 February 2017; pp. 37–48. [Google Scholar] [CrossRef]
- Putnam, A.; Caulfield, A.M.; Chung, E.S.; Chiou, D.; Constantinides, K.; Demme, J.; Esmaeilzadeh, H.; Fowers, J.; Gopal, G.P.; Gray, J.; et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. IEEE Micro 2015, 35, 10–22. [Google Scholar] [CrossRef]
- Blott, M. Reconfigurable future for HPC. In Proceedings of the 2016 International Conference on High Performance Computing Simulation (HPCS), Innsbruck, Austria, 18–22 July 2016; pp. 130–131. [Google Scholar] [CrossRef]
- Paulino, N.M.; Ferreira, J.C.; Cardoso, J.M. Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces. IEEE Trans. Very Large Scale Integr. Syst. 2017, 25, 21–34. [Google Scholar] [CrossRef]
- Gupta, P.; Accelerating Datacenter Workloads. Presented at FPL’16. Available online: https://web.archive.org/web/20180903013405/https://fpl2016.org/slides/Gupta%20–%20Accelerating%20Datacenter%20Workloads.pdf (accessed on 5 February 2021).
- Vahid, F.; Stitt, G.; Lysecky, R. Warp processing: Dynamic translation of binaries to FPGA circuits. Computer 2008, 41, 40–46. [Google Scholar] [CrossRef] [Green Version]
- Gupta, S.; Feng, S.; Ansari, A.; Mahlke, S.; August, D. Bundled execution of recurring traces for energy-efficient general purpose processing. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture—MICRO-44 ’11, Porto Alegre, Brazil, 3–7 December 2011; p. 12. [Google Scholar] [CrossRef]
- Paulino, N.; Ferreira, J.C.; Cardoso, J.M. Architecture for transparent binary acceleration of loops with memory accesses. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Cambridge, UK, 19–20 March 2013; Volume 7806 LNCS, pp. 122–133. [Google Scholar] [CrossRef]
- Beisel, T.; Niekamp, M.; Plessl, C. Using shared library interposing for transparent application acceleration in systems with heterogeneous hardware accelerators. In Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors, Rennes, France, 7–9 July 2010; pp. 65–72. [Google Scholar] [CrossRef] [Green Version]
- Miyajima, T.; Thomas, D.; Amano, H. A domain specific language and toolchain for OpenCV Runtime Binary Acceleration using GPU. In Proceedings of the 2012 3rd International Conference on Networking and Computing, ICNC 2012, Okinawa, Japan, 5–7 December 2012; pp. 175–181. [Google Scholar] [CrossRef]
- Kerrisk, M. Ptrace(2)—Linux Manual Page. Available online: https://web.archive.org/web/20181230071754/http://man7.org/linux/man-pages/man2/ptrace.2.html (accessed on 15 January 2019).
- Bispo, J.; Paulino, N.; Cardoso, J.M.P.; Ferreira, J.C. Transparent trace-based binary acceleration for reconfigurable HW/SW systems. IEEE Trans. Ind. Inform. 2013, 9, 1625–1634. [Google Scholar] [CrossRef] [Green Version]
- Paulino, N.M.C.; Ferreira, J.C.; Cardoso, J.M.P. Trace-based reconfigurable acceleration with data cache and external memory support. In Proceedings of the 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2014, Milan, Italy, 26–28 August 2014; pp. 158–165. [Google Scholar] [CrossRef] [Green Version]
- IEEE/Open Group 1003.1-2017—IEEE Standard for Information Technology–Portable Operating System Interface (POSIX(TM)) Base Specifications, Issue 7. Available online: https://publications.opengroup.org/standards/unix/t101 (accessed on 5 February 2021).
- Klitzke, E. Using Ptrace for Fun and Profit. Available online: https://web.archive.org/web/20200215141911/https://eklitzke.org/ptrace (accessed on 5 March 2019).
- Luebbers, E.; Liu, S.; Chu, M. Simplify Software Integration for FPGA Accelerators with OPAE (White Paper). Available online: https://01.org/sites/default/files/downloads/opae/open-programmable-acceleration-engine-paper.pdf (accessed on 3 February 2021).
- Dworkin, M.J. Recommendation for Block Cipher Modes of Operation; NIST: Gaithersburg, MD, USA, 2007. [CrossRef]
- Hsing, H. AES Core. Available online: https://web.archive.org/web/20200710061100if_/https://opencores.org/projects/tiny_aes (accessed on 19 March 2019).
- Kokke. tiny-AES-c. Available online: https://web.archive.org/web/20190325180304/https://github.com/kokke/tiny-AES-c (accessed on 20 May 2019).
- Netlib BLAS. Available online: https://web.archive.org/web/20190407202641/http://netlib.org/blas/ (accessed on 25 May 2019).
Parameter | Description |
---|---|
functionAddr | Address of the hot spot function |
functionCalls | Array containing addresses where the hot spot function is called |
targetName | Name of the target’s process executable |
functionName | Name of the hot spot function to be accelerated |
functionArgs | Array describing each one of the hot spot function arguments |
accLibString | Name of the shared library containing the transfer code |
accLibPath | Path to the shared library containing the transfer code |
accHeaderPath | Path to the header file of the shared library |
accFunctionName | Name of the function containing the transfer code |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Granhão, D.; Canas Ferreira, J. Transparent Control Flow Transfer between CPU and Accelerators for HPC. Electronics 2021, 10, 406. https://doi.org/10.3390/electronics10040406
Granhão D, Canas Ferreira J. Transparent Control Flow Transfer between CPU and Accelerators for HPC. Electronics. 2021; 10(4):406. https://doi.org/10.3390/electronics10040406
Chicago/Turabian StyleGranhão, Daniel, and João Canas Ferreira. 2021. "Transparent Control Flow Transfer between CPU and Accelerators for HPC" Electronics 10, no. 4: 406. https://doi.org/10.3390/electronics10040406
APA StyleGranhão, D., & Canas Ferreira, J. (2021). Transparent Control Flow Transfer between CPU and Accelerators for HPC. Electronics, 10(4), 406. https://doi.org/10.3390/electronics10040406