# A Survey of Network-Based Hardware Accelerators

## Abstract

**:**

## 1. Introduction

- -
- sorting networks;
- -
- searching networks;
- -
- counting networks.

## 2. Networks for Data Processing

#### 2.1. Sorting Networks

^{2}) in terms of the number of comparisons), the bubble-sorting network inherits the same characteristic.

- (1)
- The number of comparators grows exponentially with the number of inputs N in all the network types. For any interesting number of N (at least several thousands), the demanded hardware resources become prohibitive.
- (2)
- The even–odd merge and bitonic merge networks are not regular, therefore it is not easy to design parameterizable circuits that could be created for any value of N. Even-odd transition networks are definitely the most regular and therefore easily parameterizable for any value of N.
- (3)
- The networks are “hardwired” in a sense that even if all the data appear to be sorted, these have to propagate through the remaining comparators until the results could be read from the outputs. This situation is illustrated in Figure 1 with a “sorted” label.
- (4)
- Even-odd transition networks can be seen as a good example of iterative circuit with a main module composed of two comparator lines (and two levels: green and red in Figure 1c). The full network is just a replication of N/2 equal modules. Therefore, the function of this N/2-module iterative circuit can be performed by a sequential circuit that uses just one copy of the module but requires N/2 clock cycles to obtain the result. This is a very good example of space/time trade-off in digital design as it becomes possible to drastically reduce the number of comparators (alleviating much the resources problem) at the cost of only a slight increase in the processing time (basically, register setup and propagation times multiplied by N/2).

#### 2.2. Searching Networks

#### 2.3. Counting Networks

^{i}-bit binary vectors. For example, fragment 1 calculates the Hamming weights in four (M/2

^{1}) 2-bit vectors, which are 11, 11, 01, and 10; fragment 2 calculates the Hamming weights in two (M/2

^{2}) 4-bit vectors: 1111 and 0110; and, finally, the last fragment, 3, calculates the Hamming weight in a single (M/2

^{3}) 8-bit vector: 11110110, producing the result 0110

_{2}(6 in decimal). The total number of levels (including data-independent components) in the network is L(M) = 6. For a general value of M = 2

^{f}, the counting network has the following parameters:

- the number of components (half-adders or XOR gates) at any fragment i = 1, 2, …, f is ${\mathrm{C}}_{\mathrm{i}}=\frac{\mathrm{M}}{{2}^{\mathrm{i}}}\times (\mathrm{i}\times \frac{\mathrm{i}+1}{2})$;
- the total number of components in the network is C(M) = ${{\displaystyle \sum}}_{\mathrm{i}=1}^{\mathrm{f}}{\mathrm{C}}_{\mathrm{i}}$;
- the number of levels $\mathrm{L}\left({\mathrm{M}=2}^{\mathrm{f}}\right)=\mathrm{f}\times \frac{\mathrm{f}+1}{2}$.

## 3. Implementations of Parallel Networks in Reconfigurable Hardware

#### 3.1. Implementations of Sorting Networks

^{p=6}= 64 data items. The overall conclusion is that only small data subsets can be sorted in parallel in an FPGA-based accelerator therefore some software-based merger would make sense. When merging the sorted by the accelerator subsets in software, the speedup decreases with the increase of the data set size. The explanation is that the ratio between the work done by software (the CPU—Central Processing Unit) to work done in parallel in the accelerator decreases. Though no high speed-ups have been achieved over a CPU, the works [27,28] provide the important step towards incorporating the capabilities of FPGAs into parallel data processing engines.

^{19}, M = 16.

^{M}-bit long streams.

#### 3.2. Implementations of Searching Networks

#### 3.3. Implementations of Counting Networks

^{n}. It is also suggested how to deal with k >> 2

^{n}. The proposed designs are not based on sorting networks, leading to more modest hardware resources. This is because, in contrast to sorting networks, the number of the used basic components in counting networks is incrementally reduced as data move from left to right (see Figure 3). Thus, albeit the number of levels is the same (when compared to the number of levels in the best sorting networks), due to incrementally reduced complexity at each level, counting networks can be employed for a significantly greater value of M than sorting networks (within the same target hardware constraints). The authors report the results of experiments on two FPGA/PSoC-based prototyping boards: the Atlys with the Xilinx Spartan-6 FPGA and ZedBoard with the Xilinx Zynq including Artix-7 FPGA that prove that the proposed counting networks outperform the previous parallel counter-based designs. It is also illustrated that counting networks can be mapped efficiently to DSP slices that are available as standard components in modern FPGAs.

## 4. Advantages and Drawbacks

## 5. Conclusions

## Funding

## Conflicts of Interest

## References

- Oak Ridge National Laboratory. SUMMIT Oak Ridge National Laboratory’s 200 Petaflop Supercomputer. Available online: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ (accessed on 8 January 2022).
- Fu, H.; Liao, J.; Yang, J.; Wang, L.; Song, Z.; Huang, X.; Yang, C.; Xue, W.; Liu, F.; Qiao, F.; et al. The Sunway TaihuLight supercomputer: System and applications. Sci. China Inf. Sci.
**2019**, 59, 072001. [Google Scholar] [CrossRef] [Green Version] - Fujitsu. Supercomputer Fugaku Specifications. Available online: https://www.fujitsu.com/global/about/innovation/fugaku/specifications/ (accessed on 8 January 2022).
- Kuchcinski, K. Constraint programming in embedded systems design: Considered helpful. Microprocess. Microsyst.
**2019**, 69, 24–34. [Google Scholar] [CrossRef] - Rodríguez, A.; Valverde, J.; Portilla, J.; Otero, A.; Riesgo, T.; De la Torre, E. FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo3 Framework. Sensors
**2018**, 18, 1877. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Alaei, M.; Yazdanpanah, F. A high-performance FPGA-based multicrossbar prioritized network-on-chip. Concurr. Comput. Pract. Exp.
**2021**, 33, e6055. [Google Scholar] [CrossRef] - Podobas, A.; Zohouri, H.R.; Maruyama, N.; Matsuoka, S. Evaluating high-level design strategies on FPGAs for high-performance computing. In Proceedings of the 2017 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 4–8 September 2017; pp. 1–4. [Google Scholar] [CrossRef]
- Streit, F.; Wituschek, S.; Pschyklenk, M.; Becher, A.; Lechner, M.; Wildermann, S.; Pitz, I.; Merklein, M.; Teich, J. Data acquisition and control at the edge: A hardware/software-reconfigurable approach. Prod. Eng.
**2020**, 14, 365–371. [Google Scholar] [CrossRef] - Vanderbauwhede, W.; Benkrid, K. (Eds.) High-Performance Computing Using FPGAs; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar] [CrossRef] [Green Version]
- Zohouri, H.R. High Performance Computing with FPGAs and OpenCL. Ph.D. Thesis, Tokyo Institute of Technology, Tokyo, Japan, 2018. Available online: https://arxiv.org/ftp/arxiv/papers/1810/1810.09773.pdf (accessed on 8 January 2022).
- Xiong, Q. FPGA Acceleration of High Performance Computing Communication Middleware. Ph.D. Thesis, Boston University, Boston, MA, USA, 2019. Available online: https://open.bu.edu/handle/2144/38211 (accessed on 1 March 2022).
- Huang, S.; Lin, M.; Yu, F.; Chen, R.; Zhang, L.; Zhu, Y. Real-time high definition license plate localization and recognition accelerator for IoT endpoint system on chip. J. Appl. Sci. Eng.
**2022**, 25, 1–11. [Google Scholar] [CrossRef] - Cho, H.; Lee, J.; Lee, J. FARNN: FPGA-GPU Hybrid Acceleration Platform for Recurrent Neural Networks. IEEE Trans. Parallel Distrib. Syst.
**2022**, 33, 1725–1738. [Google Scholar] [CrossRef] - Papadopoulos, L.; Soudris, D.; Kessler, C.; Ernstsson, A.; Ahlqvist, J.; Vasilas, N.; Papadopoulos, A.I.; Seferlis, P.; Prouveur, C.; Haefele, M.; et al. EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems. IEEE Trans. Parallel Distrib. Syst.
**2022**, 33, 792–804. [Google Scholar] [CrossRef] - Xu, Q.; Varadarajan, S.; Chakrabarti, C.; Karam, L.J. A distributed canny edge detector: Algorithm and FPGA implementation. IEEE Trans. Image Process.
**2015**, 23, 2944–2960. [Google Scholar] [CrossRef] - Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.-J. A high-throughput and power-efficient FPGA implementation of yolo CNN for object detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2019**, 27, 1861–1873. [Google Scholar] [CrossRef] - Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl.
**2020**, 32, 1109–1139. [Google Scholar] [CrossRef] - Liu, Z.; Dou, Y.; Jiang, J.; Xu, J.; Li, S.; Zhou, Y.; Xu, Y. Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans. Reconfig. Technol. Syst.
**2017**, 10, 1–23. [Google Scholar] [CrossRef] - Sugie, T.; Akamatsu, T.; Nishitsuji, T.; Hirayama, R.; Masuda, N.; Nakayama, H.; Ichihashi, Y.; Shiraki, A.; Oikawa, M.; Takada, N.; et al. High-performance parallel computing for next-generation holographic imaging. Nat. Electron.
**2018**, 1, 254–259. [Google Scholar] [CrossRef] - George, A.D.; Wilson, C.M. Onboard Processing with Hybrid and Reconfigurable Computing on Small Satellites. Proc. IEEE
**2018**, 106, 458–470. [Google Scholar] [CrossRef] - Seng, K.P.; Lee, P.J.; Ang, L.M. Embedded intelligence on FPGA: Survey, applications and challenges. Electronics
**2021**, 10, 895. [Google Scholar] [CrossRef] - Wan, Z.; Yu, B.; Li, T.Y.; Tang, J.; Zhu, Y.; Wang, Y.; Raychowdhury, A.; Liu, S. A Survey of FPGA-Based Robotic Computing. IEEE Circuits Syst. Mag.
**2021**, 21, 48–74. [Google Scholar] [CrossRef] - Knuth, D.E. The Art of Computer Programming. Sorting and Searching, 3rd ed.; Addison-Wesley: Boston, MA, USA, 2011. [Google Scholar]
- Wey, C.L.; Shieh, M.D.; Lin, S.Y. Algorithms of Finding the First Two Minimum Values and Their Hardware Implementation. IEEE Trans. Circuits Syst. I Regul. Pap.
**2008**, 55, 3430–3437. [Google Scholar] - Skliarova, I.; Sklyarov, V. FPGA-Based Hardware Accelerators; Springer: Cham, Switzerland, 2019. [Google Scholar]
- Sklyarov, V.; Skliarova, I. Design and implementation of counting networks. Comput. J.
**2015**, 97, 557–577. [Google Scholar] [CrossRef] - Mueller, R.; Teubner, J.; Alonso, G. Sorting Networks on FPGAs. Int. J. Very Large Data Bases
**2012**, 21, 1–23. [Google Scholar] [CrossRef] - Mueller, R. Data Stream Processing on Embedded Devices. Ph.D. Thesis, ETH, Zurich, Switzerland, 2010. [Google Scholar]
- Zuluaga, M.; Milder, P.; Puschel, M. Computer Generation of Streaming Sorting Networks. In Proceedings of the 49th Design Automation Conference, San Francisco, CA, USA, 3–7 June 2012; pp. 1245–1253. [Google Scholar]
- Sklyarov, V.; Skliarova, I. Fast Regular Circuits for Network-based Parallel Data Processing. Adv. Electr. Comput. Eng.
**2013**, 13, 47–50. [Google Scholar] [CrossRef] - Sklyarov, V.; Skliarova, I. High-performance implementation of regular and easily scalable sorting networks on an FPGA. Microprocess. Microsyst.
**2014**, 38, 470–484. [Google Scholar] [CrossRef] - Sklyarov, V.; Skliarova, I.; Rjabov, A.; Sudnitson, A. Fast Iterative Circuits and RAM-based Mergers to Accelerate Data Sort in Software/Hardware Systems. Proc. Est. Acad. Sci.
**2017**, 66, 323–335. [Google Scholar] [CrossRef] - Najafi, M.H.; Lilja, D.J.; Riedel, M.D.; Bazargan, K. Low-Cost Sorting Network Circuits Using Unary Processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2018**, 26, 1471–1480. [Google Scholar] [CrossRef] - Norollah, A.; Derafshi, D.; Beitollahi, H.; Fazeli, M. RTHS: A Low-Cost High-Performance Real-Time Hardware Sorter, Using a Multidimensional Sorting Algorithm. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2019**, 27, 1601–1613. [Google Scholar] [CrossRef] - Srivastava, A.; Chen, R.; Prasanna, V.K.; Chelmis, C. A hybrid design for high performance large-scale sorting on FPGA. In Proceedings of the 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Riviera Maya, Mexico, 7–9 December 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Ricco, M.; Mathe, L.; Monmasson, E.; Teodorescu, R. FPGA-Based Implementation of MMC Control Based on Sorting Networks. Energies
**2018**, 11, 2394. [Google Scholar] [CrossRef] [Green Version] - Mendoza, I.L.; Pizano Escalante, J.L.; González, J.C.; Longoria Gándara, O.H. Implementation of a parameterizable sorting network for spatial modulation detection on FPGA. In Proceedings of the 2019 IEEE Colombian Conference on Communications and Computing (COLCOM), Barranquilla, Colombia, 5–7 June 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Ayoubi, R.; Istambouli, S.; Abbas, A.W.; Akkad, G. Hardware Architecture For A Shift-Based Parallel Odd-Even Transposition Sorting Network. In Proceedings of the 2019 Fourth International Conference on Advances in Computational Tools for Engineering Applications (ACTEA), Beirut, Lebanon, 3–5 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Chen, R.; Siriyal, S.; Prasanna, V. Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 240–249. [Google Scholar] [CrossRef]
- Farmahini-Farahani, A. Modular Design of High-Throughput, Low-Latency Sorting Units. Master’s Thesis, University of Wisconsin–Madison, Madison, WI, USA, 2012. [Google Scholar]
- Tzimpragos, G.; Kachris, C.; Soudris, D.; Tomkos, I. A Low-Latency Algorithm and FPGA Design for the Min-Search of LDPC Decoders. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium Workshop—IPDPSW’2014, Phoenix, AZ, USA, 19–23 May 2014. [Google Scholar] [CrossRef]
- Skliarova, I. Accelerating Population Count with a Hardware Co-Processor for MicroBlaze. J. Low Power Electron. Appl.
**2021**, 11, 20. [Google Scholar] [CrossRef] - Pedroni, V. Compact Hamming-comparator-based rank order filter for digital VLSI and FPGA implementations. In Proceedings of the IEEE International Symposium on Circuits and Systems—ISCAS’2004, Vancouver, BC, Canada, 23–26 May 2004; pp. 585–588. [Google Scholar]
- Piestrak, S.J. Efficient Hamming weight comparators of binary vectors. Electron Lett.
**2007**, 43, 611–612. [Google Scholar] [CrossRef] - Parhami, B. Efficient Hamming weight comparators for binary vectors based on accumulative and up/down parallel counters. IEEE Trans. Circuits Syst. II Express Briefs
**2009**, 56, 167–171. [Google Scholar] [CrossRef] [Green Version] - Sklyarov, V.; Skliarova, I. Digital Hamming weight and distance analyzers for binary vectors and matrices. Int. J. Innov. Comput. Inf. Control
**2013**, 9, 4825–4849. [Google Scholar] - Sklyarov, V.; Skliarova, I.; Silva, J. On-chip reconfigurable hardware accelerators for popcount computations. Int. J. Reconfig. Comput.
**2016**, 2016, 8972065. [Google Scholar] [CrossRef] [Green Version] - Pilz, S.; Porrmann, F.; Kaiser, M.; Hagemeyer, J.; Hogan, J.M.; Rückert, U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms
**2020**, 13, 47. [Google Scholar] [CrossRef] [Green Version] - Umuroglu, Y.; Conficconi, D.; Rasnayake, L.K.; Preußer, T.B.; Själander, M. Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing. ACM Trans. Reconfig. Technol. Syst.
**2019**, 12, 1–24. [Google Scholar] - Rasoulinezhad, S.; Zhou, H.; Wang, L.; Boland, D.; Leong, P.H.W. LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 26–28 February 2020; pp. 161–171. [Google Scholar]
- Kobayashi, R.; Kenji, K. A High Performance FPGA-Based Sorting Accelerator with a Data Compression Mechanism. IEICE Trans. Inf. Syst.
**2017**, 100, 1003–1015. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Sorting networks for ascending sort of eight data items: bubble network (

**a**); even–odd merge network (

**b**); even–odd transition network (

**c**); and bitonic merge network (

**d**). When a comparator swaps input data items, these are underlined and shown in italics.

**Figure 2.**An example of a searching network finding the maximum and the minimum values in a set of 8 input data (

**a**); a network of comparators for finding the two largest data items (

**b**).

**Figure 3.**Basic components of a counting network (

**a**); example of a counting network calculating the Hamming weight of an 8-bit binary vector (

**b**).

**Figure 4.**Network of comparators with a feedback register for finding the minimum and the maximum data items.

**Figure 5.**Network of comparators with memory for finding two largest data items: 150 (highlighted with yellow color) and 138 (highlighted with orange color).

Sorting Network | C (N = 2^{p}) | L (N = 2^{p}) |
---|---|---|

Bubble | N × (N − 1)/2 | 2 × N − 3 |

Even–odd merge | (p^{2} − p + 4) × 2^{p−2} − 1 | p × (p + 1)/2 |

Bitonic merge | (p^{2} + p) × 2^{p} − 2 | p × (p + 1)/2 |

Even–odd transition | N × (N − 1)/2 | N |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Skliarova, I.
A Survey of Network-Based Hardware Accelerators. *Electronics* **2022**, *11*, 1029.
https://doi.org/10.3390/electronics11071029

**AMA Style**

Skliarova I.
A Survey of Network-Based Hardware Accelerators. *Electronics*. 2022; 11(7):1029.
https://doi.org/10.3390/electronics11071029

**Chicago/Turabian Style**

Skliarova, Iouliia.
2022. "A Survey of Network-Based Hardware Accelerators" *Electronics* 11, no. 7: 1029.
https://doi.org/10.3390/electronics11071029