A Latency-Optimized Network-on-Chip with Rapid Bypass Channels
Abstract
:1. Introduction
- We propose to combine the bypass request with flits instead of delivering the request before a packet.
- A rapid bypass circuit is present, which shows the possibility to transmit the bypass request and the flit simultaneously in an acceptable timing constraint.
- We evaluate the impact of the startup delay. The result shows that for a single-cycle multi-hop network, lower startup latency could reduce the transmission latency even bypass fewer nodes.
2. Related Work
3. Background and Motivation
4. Network-on-Chip with Rapid Bypass Channels
4.1. Architecture
- Packet based flow control. The wormhole mechanism is widely used in different NoCs due to its flexibility [30,31,32], and it is also supported by previous SMART-based NoCs. However, allowing flits of a packet to be blocked and stored in different routers makes it difficult to establish a bypass channel when the bypass request arrives with the flit simultaneously. Thus, we use a packet-based flow control instead. In our proposed NoC, all flits must be ready before the packet leaves the source node. Therefore, transmitting all flits of a packet will not be interrupted. The virtual cut-through flow control is also used for latency reduction [33]. A packet transmission between routers can be done for several consecutive cycles. In this way, a bypass channel keeps working without breaking until the entire transmission of the packet is completed.
- Static virtual channel. For VC-based NoC, the virtual channel must be decided before delivering the packet to the next router. Allocating VC will cause a huge delay on the critical path. It is unacceptable for flit bypass when the bypass request arrives with flit at the same time. Thus, we use a static VC instead of allocating it dynamically. The VC of a packet is allocated at the source node, and it will be constant during the transmission. In this way, VC allocation in each router is not needed reducing the transmission delay greatly.
- Packet 0 (P0) is a new packet that will be transmitted to N4 in the following cycles. N0 is the source node of P0. The whole packet is now stored in the local buffer of N0.
- Packet 1 (P1) is an existing packet that comes from N2. Before getting to the destination (N5), P1 stopped at N3 due to the resource contention in the last cycle. All flits of P1 are stored in the input buffer of N3.
- Packet 2 (P2) is also an existing packet in the network. Its source node and destination node are N4 and N6 respectively. Part of the flits have already arrived at N6 that is different from P0 and P1. Thus, the output ports of N4 and N5 are occupied by P2.
- Data path for Packet 0. N0 launches a flit of P0 to N4. The corresponding bypass request is sent to N1, N2 and N3 through the bypass request link in sequence. N1 and N2 are bypassed since all buffers are idle in both of the two nodes. P0 will stop at N3 because P1 is already waiting for transmitting at the input buffer.
- Data path for Packet 1. N3 launches a flit of P1 to N5. But the flit and the bypass request stop at N4 due to the output port contention caused by P2. In this example, N4 will start transmitting P1 as soon as all flits of P2 leave the node.
- Data path for Packet 2. At N4, a flit of P2 is launched to N6. It could arrive at the destination node in the same clock without any conflicts.
4.2. Implementation
5. Wire Overhead
6. Results and Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Petrisko, D.; Gilani, F.; Wyse, M.; Jung, D.C.; Davidson, S.; Gao, P.; Zhao, C.; Azad, Z.; Canakci, S.; Veluri, B.; et al. BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCs. IEEE Micro 2020, 40, 93–102. [Google Scholar] [CrossRef]
- Jung, D.C.; Davidson, S.; Zhao, C.; Richmond, D.; Taylor, M.B. Ruche Networks: Wire-Maximal, No-Fuss NoCs: Special Session Paper. In Proceedings of the 2020 14th IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Hamburg, Germany, 24–25 September 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Howard, J.; Dighe, S.; Hoskote, Y.; Vangal, S.; Finan, D.; Ruhl, G.; Jenkins, D.; Wilson, H.; Borkar, N.; Schrom, G.; et al. A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference—(ISSCC), San Francisco, CA, USA, 7–11 February 2010; pp. 108–109. [Google Scholar] [CrossRef]
- Johnson, D.; Johnson, M.; Kelm, J.; Tuohy, W.; Lumetta, S.; Patel, S. Rigel: A 1,024-Core Single-Chip Accelerator Architecture. IEEE Micro 2011, 31, 30–41. [Google Scholar] [CrossRef]
- Davidson, S.; Xie, S.; Torng, C.; Al-Hawai, K.; Rovinski, A.; Ajayi, T.; Vega, L.; Zhao, C.; Zhao, R.; Dai, S.; et al. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro 2018, 38, 30–41. [Google Scholar] [CrossRef]
- Passas, G.; Katevenis, M.; Pnevmatikatos, D. Crossbar NoCs Are Scalable Beyond 100 Nodes. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2012, 31, 573–585. [Google Scholar] [CrossRef]
- Sewell, K.; Dreslinski, R.G.; Manville, T.; Satpathy, S.; Pinckney, N.; Blake, G.; Cieslak, M.; Das, R.; Wenisch, T.F.; Sylvester, D.; et al. Swizzle-Switch Networks for Many-Core Systems. IEEE J. Emerg. Sel. Top. Circuits Syst. 2012, 2, 278–294. [Google Scholar] [CrossRef]
- Kao, Y.; Yang, M.; Artan, N.S.; Chao, H.J. CNoC: High-Radix Clos Network-on-Chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2011, 30, 1897–1910. [Google Scholar] [CrossRef]
- Abeyratne, N.; Das, R.; Li, Q.; Sewell, K.; Giridhar, B.; Dreslinski, R.G.; Blaauw, D.; Mudge, T. Scaling towards kilo-core processors with asymmetric high-radix topologies. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Shenzhen, China, 23–27 February 2013; pp. 496–507. [Google Scholar] [CrossRef]
- Besta, M.; Hassan, S.M.; Yalamanchili, S.; Ausavarungnirun, R.; Mutlu, O.; Hoefler, T. Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar]
- Christy, R.; Riches, S.; Kottekkat, S.; Gopinath, P.; Sawant, K.; Kona, A.; Harrison, R. 8.3 A 3GHz ARM Neoverse N1 CPU in 7nm FinFET for Infrastructure Applications. In Proceedings of the 2020 IEEE International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 148–150. [Google Scholar] [CrossRef]
- McKeown, M.; Fu, Y.; Nguyen, T.; Zhou, Y.; Balkind, J.; Lavrov, A.; Shahrad, M.; Payne, S.; Wentzlaff, D. Piton: A Manycore Processor for Multitenant Clouds. IEEE Micro 2017, 37, 70–80. [Google Scholar] [CrossRef]
- McKeown, M.; Lavrov, A.; Shahrad, M.; Jackson, P.J.; Fu, Y.; Balkind, J.; Nguyen, T.M.; Lim, K.; Zhou, Y.; Wentzlaff, D. Power and Energy Characterization of an Open Source 25-Core Manycore Processor. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, Austria, 24–28 February 2018; pp. 762–775. [Google Scholar] [CrossRef]
- Mullins, R.; West, A.; Moore, S. Low-latency virtual-channel routers for on-chip networks. In Proceedings of the 31st Annual International Symposium on Computer Architecture, Munich, Germany, 23 June 2004; pp. 188–197. [Google Scholar] [CrossRef] [Green Version]
- Matsutani, H.; Koibuchi, M.; Amano, H.; Yoshinaga, T. Prediction router: Yet another low latency on-chip router architecture. In Proceedings of the 2009 IEEE 15th International Symposium on High Performance Computer Architecture, Raleigh, NC, USA, 14–18 February 2009; pp. 367–378. [Google Scholar] [CrossRef]
- Kumar, A.; Kundu, P.; Singh, A.P.; Peh, L.S.; Jha, N.K. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS. In Proceedings of the International Conference on Computer Design, Lake Tahoe, CA, USA, 7–10 October 2007. [Google Scholar]
- Kumar, A.; Peh, L.; Kundu, P.; Jha, N.K. Toward Ideal On-Chip Communication Using Express Virtual Channels. IEEE Micro 2008, 28, 80–90. [Google Scholar] [CrossRef]
- Perez, I.; Vallejo, E.; Beivide, R. Efficient Router Bypass via Hybrid Flow Control. In Proceedings of the 2018 11th International Workshop on Network on Chip Architectures (NoCArc), Fukuoka, Japan, 20 October 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Krishna, T.; Chen, C.O.; Kwon, W.C.; Peh, L. Breaking the on-chip latency barrier using SMART. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Shenzhen, China, 23–27 February 2013; pp. 378–389. [Google Scholar] [CrossRef]
- Krishna, T.; Chen, C.O.; Kwon, W.; Peh, L. Smart: Single-Cycle Multihop Traversals over a Shared Network on Chip. IEEE Micro 2014, 34, 43–56. [Google Scholar] [CrossRef]
- Chen, C.O.; Park, S.; Krishna, T.; Subramanian, S.; Chandrakasan, A.P.; Peh, L. SMART: A single-cycle reconfigurable NoC for SoC applications. In Proceedings of the 2013 Design, Automation Test in Europe Conference Exhibition (DATE), Grenoble, France, 18–22 March 2013; pp. 338–343. [Google Scholar] [CrossRef] [Green Version]
- Krishna, T.; Chen, C.O.; Park, S.; Kwon, W.; Subramanian, S.; Chandrakasan, A.P.; Peh, L. Single-Cycle Multihop Asynchronous Repeated Traversal: A SMART Future for Reconfigurable On-Chip Networks. Computer 2013, 46, 48–55. [Google Scholar] [CrossRef]
- Kwon, H.; Krishna, T. OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel. In Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Santa Rosa, CA, USA, 24–25 April 2017; pp. 195–204. [Google Scholar] [CrossRef]
- Yang, L.; Liu, W.; Chen, P.; Guan, N.; Li, M. Task mapping on SMART NoC: Contention matters, not the distance. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Pérez, I.; Vallejo, E.; Beivide, R. SMART++: Reducing Cost and Improving Efficiency of Multi-Hop Bypass in NoC Routers. In Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, New York, NY, USA, 17–18 October 2019; NOCS ’19. Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Zong, W.; Xu, Q. DOART: A low-power and low-latency Network-on-Chip. In Proceedings of the 2016 IEEE 34th International Conference on Computer Design (ICCD), Scottsdale, AZ, USA, 2–5 October 2016; pp. 352–355. [Google Scholar] [CrossRef]
- Chen, X.; Jha, N.K. Reducing Wire and Energy Overheads of the SMART NoC Using a Setup Request Network. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2016, 24, 3013–3026. [Google Scholar] [CrossRef]
- Krishna, T.; Postman, J.; Edmonds, C.; Peh, L.; Chiang, P. SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS. In Proceedings of the 2010 IEEE International Conference on Computer Design, Amsterdam, The Netherlands, 3–6 October 2010; pp. 439–446. [Google Scholar] [CrossRef]
- Park, S.; Krishna, T.; Chen, C.; Daya, B.; Chandrakasan, A.; Peh, L. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45 nm SOI. In Proceedings of the DAC Design Automation Conference 2012, San Francisco, CA, USA, 3–7 June 2012; pp. 398–405. [Google Scholar]
- Dally, W.J.; Seitz, C.L. The torus routing chip. Distrib. Comput. 1986, 1, 187–196. [Google Scholar] [CrossRef]
- Duato, J. A new theory of deadlock-free adaptive multicast routing in wormhole networks. In Proceedings of the 1993 5th IEEE Symposium on Parallel and Distributed Processing, Dallas, TX, USA, 1–4 December 1993; pp. 64–71. [Google Scholar] [CrossRef] [Green Version]
- Lin, X.; McKinley, P.K.; Ni, L.M. The message flow model for routing in wormhole-routed networks. IEEE Trans. Parallel Distrib. Syst. 1995, 6, 755–760. [Google Scholar] [CrossRef] [Green Version]
- Kermani, P.; Kleinrock, L. Virtual cut-through: A new computer communication switching technique. Comput. Netw. 1979, 3, 267–286. [Google Scholar] [CrossRef] [Green Version]
- Agarwal, N.; Krishna, T.; Peh, L.; Jha, N.K. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software, Boston, MA, USA, 26–28 April 2009; pp. 33–42. [Google Scholar] [CrossRef] [Green Version]
- Binkert, N.; Beckmann, B.; Black, G.; Reinhardt, S.K.; Saidi, A.; Basu, A.; Hestness, J.; Hower, D.R.; Krishna, T.; Sardashti, S.; et al. The Gem5 Simulator. ACM SIGARCH Comput. Archit. News 2011, 39, 1–7. [Google Scholar] [CrossRef]
- Gem5: The gem5 Simulator System. Available online: http://www.gem5.org/ (accessed on 22 May 2021).
- Lowe-Power, J.; Ahmad, A.M.; Akram, A.; Alian, M.; Amslinger, R.; Andreozzi, M.; Armejach, A.; Asmussen, N.; Beckmann, B.; Bharadwaj, S.; et al. The gem5 Simulator: Version 20.0+. arXiv 2020, arXiv:2007.03152. [Google Scholar]
Router | Input Buffer along Dimension | Local/Turn Input Buffer | Output Buffer |
---|---|---|---|
N0 | Packet 0 | / | / |
N1 | / | / | / |
N2 | / | / | / |
N3 | Packet 1 | / | / |
N4 | / | Packet 2 | Packet 2 |
N5 | / | / | Packet 2 |
N6 | / | / | / |
Signal | Width | Description | |
---|---|---|---|
SSR in [19,27] | hop_num | log (1 + HPC) | Number of hops. |
vnet_id | log (N) | Virtual network ID. | |
inject_router_id | log (HPC) | Source router ID. | |
head_flit_flag | 1 | Header flit flag. | |
eject_flag | 1 | Arrive at the destination. | |
eject_port_id | log (N) | Ejection port at destination. | |
Pre-SSR in [27] | source_id | log (HPC) | Source router ID |
Rapid bypass in this work | hop_left | log (HPC) | Hops left over. |
vc | N | Virtual channel ID. |
Uniform Random | Bit Complement | Tornado | Transpose | Average | ||
---|---|---|---|---|---|---|
This Work vs. Baseline | Injection Rate | 0.02∼0.38 | 0.02∼0.18 | 0.02∼0.24 | 0.02∼0.14 | / |
Baseline (cycles) | 14.38 | 18.84 | 11.29 | 13.24 | 14.44 | |
This Work (cycles) | 5.86 | 6.25 | 3.91 | 4.93 | 5.24 | |
Reduction (cycles) | 8.52 | 12.59 | 7.38 | 8.30 | 9.20 | |
Reduction (%) | 59.25% | 66.81% | 65.38% | 62.72% | 63.54% | |
This Work vs. SMART | Injection Rate | 0.02∼0.44 | 0.02∼0.22 | 0.02∼0.24 | 0.02∼0.14 | / |
SMART (cycles) | 10.00 | 10.34 | 5.41 | 7.16 | 8.23 | |
This Work (cycles) | 7.24 | 7.25 | 3.91 | 4.93 | 5.83 | |
Reduction (cycles) | 2.76 | 3.10 | 1.51 | 2.23 | 2.4 | |
Reduction (%) | 27.62% | 29.94% | 27.81% | 31.10% | 29.12% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, W.; Gao, X.; Gao, Y.; Yu, N. A Latency-Optimized Network-on-Chip with Rapid Bypass Channels. Micromachines 2021, 12, 621. https://doi.org/10.3390/mi12060621
Ma W, Gao X, Gao Y, Yu N. A Latency-Optimized Network-on-Chip with Rapid Bypass Channels. Micromachines. 2021; 12(6):621. https://doi.org/10.3390/mi12060621
Chicago/Turabian StyleMa, Wenheng, Xiyao Gao, Yudi Gao, and Ningmei Yu. 2021. "A Latency-Optimized Network-on-Chip with Rapid Bypass Channels" Micromachines 12, no. 6: 621. https://doi.org/10.3390/mi12060621
APA StyleMa, W., Gao, X., Gao, Y., & Yu, N. (2021). A Latency-Optimized Network-on-Chip with Rapid Bypass Channels. Micromachines, 12(6), 621. https://doi.org/10.3390/mi12060621