Much research focuses on many-core processors, which possess a vast number of cores. Their area, energy consumption, and performance have a tendency to be proportional to the number of cores. It is better to utilize in-order (IO) execution for better area/energy efficiency. However, expanding two-way IO to three-way IO offers very little improvement, since data dependency limits the effectiveness. In addition, if the core is changed from IO to out-of-order (OoO) execution to improve Instruction Per Cycle(IPC), area and energy consumption increases significantly. The combination of IO execution and Arithmetic Logic Unit(ALU) cascading is an effective solution to alleviate this problem. However, ALU cascading is implemented by complex bypass circuits because it requires a connection between all outputs and all inputs of all ALUs. The hardware complexity of the bypass circuits increases area, energy consumption, and delay. In this study, we proposed a mechanism that limits the number of the forwarding paths and allocates instructions to ALUs in accordance with the limited paths. This mechanism scales down bypass circuits to reduce the hardware complexity. Our evaluation results show that our proposed mechanism can reduce the area by 38.7%, the energy by 41.1%, and the delay by 23.2% with very little IPC loss on average, as compared with the conventional mechanism.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited