Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip
Abstract
:1. Introduction
- It defines a general strategy to comprehensively evaluate the benefits of OoO instruction execution in the context of tightly coupled coprocessors using a variable-latency module and an automatic generator of test applications with different instruction compositions and instruction dependency patterns.
- It demonstrates the effectiveness of an existing open-source OoO CPU in covering the latency of long-latency coprocessors in a wide selection of workloads.
2. Background and Related Works
2.1. Tightly Coupled Coprocessors
2.2. Out-of-Order Central Processing Units
- LEN5’s modular microarchitecture facilitates the straightforward deployment of custom instruction set extensions and coprocessors. This modularity greatly simplified the integration of the CLC in the system, whereas the centralized execution control scheme used by the available OoO cores would have required significant modifications.
- Compared to other OoO cores, LEN5’s architecture prioritizes scalability over performance, resulting in a more area-efficient design. Conversely, BOOM and Alibaba’s cores are optimized for superscalar instruction execution, featuring wider issue windows and multiple execution units for each instruction class. While these features yield superior IPC when executing sequences of scalar instructions, they are less advantageous for the purposes of this work. When handling long-latency accelerated instructions, maximizing the number of scalar instructions executed per cycle could result in the CPU Execution Units (EUs) idling while awaiting the completion of offloaded instructions, thus not justifying the additional area and power overhead.
- The base variant of LEN5 targets bare-metal applications without a cache hierarchy, thereby offering a simpler interface with the host system bus. This interface is compatible with common bus protocols used in low-power Microcontroller Units (MCUs) like the OBI bus in the X-HEEP platform [13] selected for the experiments. Adapting the interface of other available cores would have required additional efforts.
3. Out-of-Order Central Processing Unit Microarchitecture
- It enables sufficient entry of instructions into the execution engine of the core, regardless of their readiness for execution. This approach maximizes the chances of identifying instructions that are independent of the previous ones, allowing for their immediate scheduling and execution, irrespective of the original program order.
- It facilitates the parallel execution of multiple instructions (a superscalar design) so that if one instruction requires a prolonged time to complete, another can be dispatched to different EUs and executed concurrently.
- It ensures the prompt retirement of completed instructions, potentially out of program order, to allow new instructions to enter the execution engine, thereby maintaining the EUs’s productivity.
3.1. Out-of-Order Instruction Execution
- Translating incoming instructions into commands for the associated RS and EU.
- Allocating an available ROB entry to buffer the result of the instruction once it completes. From this moment, the instruction is uniquely tagged with the index of the assigned ROB entry.
- Fetching the operands for the instruction from the Register File (RF) or the ROB, if available.
3.2. Out-of-Order Instruction Commit
- Its execution is complete, making the instruction result available in the ROB.
- It is no longer speculative, meaning all previous branch predictions have been validated.
- It did not trigger any exceptions.
- There are no newer instructions eligible for commit that would write to the same destination register (WAW hazard). In such cases, the older instruction is simply retired without updating the RF.
4. Experimental Results
4.1. Experimental Setup
4.1.1. System Configuration
4.1.2. Configurable-Latency Coprocessor Architecture
- The new custom instructions are incorporated into the main decoder, specifying the expected control signals for the CLC, the necessary source operands, and the result type so that the CPU can correctly manage dependencies and commit.
- A new FOUR-entry RS is adapted from the ALU one and added to LEN5’s backend with negligible impact on the overall area.
- The CLC is connected to the dedicated RS. Dynamic synchronization between the CLC and the CPU is inherently achieved by the system-wide valid-ready handshake protocol.
4.1.3. Configurable Test Applications
Listing 1. Example of assembly code with dummy instructions. |
4.2. Instructions per Cycle Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ALU | Arithmetic Logic Unit |
ANN | Artificial Neural Network |
BTB | Branch Target Buffer |
BU | Branch Unit |
CDB | Common Data Bus |
CGRA | Coarse-Grained Reconfigurable Array |
CLC | Configurable-Latency Coprocessor |
CPU | Central Processing Unit |
DLP | Data-Level Parallelism |
EU | Execution Unit |
FPGA | Field-Programmable Gate Array |
FPU | Floating-Point Unit |
FSM | Finite-State Machine |
GPR | General-Purpose Register |
ILP | Instruction-Level Parallelism |
IoT | Internet of Things |
IPC | Instructions Per Cycle |
ISA | Instruction Set Architecture |
MCU | Microcontroller Unit |
NIB | Number of Instructions per Block |
OoO | Out-of-Order |
PQC | Post-Quantum Cryptography |
LB | Load Buffer |
LSU | Load–Store Unit |
RAS | Return Address Stack |
RAW | Read-After-Write |
RISC | Reduced Instruction Set Computer |
RF | Register File |
ROB | ReOrder Buffer |
RS | Reservation Station |
RTL | Register Transfer Level |
SB | Store Buffer |
SIMD | Single-Instruction Multiple-Data |
SoC | System-on-Chip |
WAR | Write-After-Read |
WAW | Write-After-Write |
References
- Daoud, H.; Bayoumi, M.A. Efficient Epileptic Seizure Prediction Based on Deep Learning. IEEE Trans. Biomed. Circuits Syst. 2019, 13, 804–813. [Google Scholar] [CrossRef] [PubMed]
- Hoshino, S.; Kubota, Y. Mobile Robot Motion Planning through Obstacle State Classifier. In Proceedings of the 2023 62nd Annual Conference of the Society of Instrument and Control Engineers (SICE), Tsu, Japan, 6–9 September 2023; pp. 120–126. [Google Scholar]
- Wang, Y.; Jiang, J.; Li, S.; Li, R.; Xu, S.; Wang, J.; Li, K. Decision-Making Driven by Driver Intelligence and Environment Reasoning for High-Level Autonomous Vehicles: A Survey. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10362–10381. [Google Scholar] [CrossRef]
- Galuzzi, C.; Bertels, K. The Instruction-Set Extension Problem: A Survey. ACM Trans. Reconfigurable Technol. Syst. 2011, 4, 1–28. [Google Scholar] [CrossRef]
- Gautschi, M.; Schiavone, P.D.; Traber, A.; Loi, I.; Pullini, A.; Rossi, D.; Flamand, E.; Gürkaynak, F.K.; Benini, L. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2700–2713. [Google Scholar] [CrossRef]
- Mach, S.; Schuiki, F.; Zaruba, F.; Benini, L. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 29, 774–787. [Google Scholar] [CrossRef]
- Garofalo, A.; Tagliavini, G.; Conti, F.; Rossi, D.; Benini, L. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 186–191. [Google Scholar]
- Fritzmann, T.; Sigl, G.; Sepúlveda, M.J. RISQ-V: Tightly Coupled RISC-V Accelerators for Post-Quantum Cryptography. IACR Cryptol. ePrint Arch. 2020, 2020, 446. [Google Scholar] [CrossRef]
- Perotti, M.; Cavalcante, M.; Wistoff, N.; Andri, R.; Cavigelli, L.; Benini, L. A “New Ara” for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design. In Proceedings of the 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP), Gothenburg, Sweden, 12–14 July 2022; pp. 43–51. [Google Scholar]
- Zhao, J.; Korpan, B.; Gonzalez, A.; Asanovic, K. SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine. In Proceedings of the Fourth Workshop on Computer Architecture Research with RISC-V, Virtual, 29 May 2020. [Google Scholar]
- Chen, C.; Xiang, X.; Liu, C.; Shang, Y.; Guo, R.; Liu, D.; Lu, Y.; Hao, Z.; Luo, J.; Chen, Z.; et al. Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension: Industrial Product. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 30 May–3 June 2020; pp. 52–64. [Google Scholar]
- Caon, M.; Petrolo, V.; Mirigaldi, M.; Guella, F.; Masera, G.; Maurizio, M. Seeing Beyond the Order: A LEN5 to Sharpen Edge Microprocessors with Dynamic Scheduling. In CF ’24: Proceedings of the 20th ACM International Conference on Computing Frontiers, Ischia, Italy, 7–9 May 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar]
- Machetti, S.; Schiavone, P.D.; Müller, T.C.; Peón-Quirós, M.; Atienza, D. X-HEEP: An Open-Source, Configurable and Extendible RISC-V Microcontroller for the Exploration of Ultra-Low-Power Edge Accelerators. arXiv 2024, arXiv:2401.05548. [Google Scholar]
- Tomasulo, R.M. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 1967, 11, 25–33. [Google Scholar] [CrossRef]
- Hennessy, J.L.; Patterson, D.A. Computer Architecture: A Quantitative Approach; Morgan Kaufmann: Burlington, MA, USA, 2017. [Google Scholar]
- Alves, R.; Ros, A.; Black-Schaffer, D.; Kaxiras, S. Filter caching for free: The untapped potential of the store-buffer. In ISCA’19: Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 436–448. [Google Scholar]
- OpenHW Group. OpenHW Group Specification: Core-V eXtension Interface (CV-X-IF). 2023. Available online: https://docs.openhwgroup.org/projects/openhw-group-core-v-xif/en/latest/ (accessed on 28 May 2024).
LEN5 Max Perf | CV32E40X | |
---|---|---|
Clk Freq. [ | 438 | 360 |
Area [1 × 103 | 423 | 49 |
Area [ a | 294 | 34 |
Relative System Area b | 1.12 | 1.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Caon, M.; Masera, G.; Martina, M. Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip. Electronics 2024, 13, 3018. https://doi.org/10.3390/electronics13153018
Caon M, Masera G, Martina M. Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip. Electronics. 2024; 13(15):3018. https://doi.org/10.3390/electronics13153018
Chicago/Turabian StyleCaon, Michele, Guido Masera, and Maurizio Martina. 2024. "Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip" Electronics 13, no. 15: 3018. https://doi.org/10.3390/electronics13153018
APA StyleCaon, M., Masera, G., & Martina, M. (2024). Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip. Electronics, 13(15), 3018. https://doi.org/10.3390/electronics13153018