Accelerating Population Count with a Hardware Co-Processor for MicroBlaze
Abstract
:1. Introduction
- Analysis and relative comparison of population count computations in software running on MicroBlaze processor;
- Analysis and relative comparison of parallel dedicated accelerators for population count computation in hardware;
- A hardware/software co-design technique implemented and tested in a low-cost FPGA of Artix-7 from Xilinx;
- The result of experiments and comparisons demonstrating increase of throughput of hardware-accelerated computations comparing to the best software alternative.
2. Background
2.1. FPGA and MicroBlaze
2.2. Population Count
- Binarized Neural Networks (BNN) are reduced precision neural networks, having weights and activations restricted to single-bit values [11,12,13]. One of the computations executed in BNNc is to multiply a binarized vector of input neurons against a binarized weight matrix. Such operation can be done using variant of a population count [11]. The parameter N tends to be large (64–1200) as it equals the number of input neurons for a fully connected layer and to the product of the size of the convolution filter in one dimension and the number of input channels for a convolution layer [12]. An example of using BNNs in a robot design for agricultural cyber-physical systems is reported in [14].
- Cryptographic applications—in [15] the population count operation is used to identify pairs of vectors that are likely to lie at a small angle to each other. In [16] Hamming weight is computed to describe an attack that recovers the private key from the public key for a public-key encryption scheme based on Mersenne numbers. Hamming distance (which is the population count of the number of mismatches between a pair of vectors) needs to be determined to prevent intrusion and detect anomalies in CPSs reviewed in [17].
- Telecommunications—error detection/correction in a communication channel recurring to Hamming weight calculus is reported in [18].
- Cheminformatics—a high-performance molecular similarity search system is described in [19] executing similarity search of bitstring fingerprints and combining fast population count evaluation and pruning algorithms. A fingerprint for chemical similarity is a description of a molecule such that the similarity between two descriptions gives some idea of the similarity between two molecules [19]. The most widely used fingerprints have a length ranging from 166 to 2048 bits. The most popular way to compare two fingerprints is to calculate their Tanimoto similarity, which can be reduced to one population count evaluation per comparison [19]. Millions of fingerprints have to be processed for most corporate compound collections.
- Bioinformatics—in [20], a tool is proposed to remove duplicated and near-duplicated reads from next generation sequencing datasets.
3. Related Work
3.1. Software Implementations of Population Count
3.2. Hardware Implementations of Population Count
- Parallel counters from [24] are tree-based circuits that are built from full-adders.
- The designs from [25] are based on sorting networks, which have known limitations; in particular, when the number of source data items grows, the occupied resources are increased considerably.
- The designs [28] are based on embedded to FPGA digital signal processing (DSP) slices, organized as a tree of DSP adders.
- LUT-based circuits [29] are very competitive but they are hardly scalable and resource consuming for big values of N.
- A combination of counting networks, LUT- and DSP- based circuits is proposed in [22]. It is noticed that LUT-based circuits and counting networks are the fastest solutions for small length sub-vectors (up to 128 bits). The result is produced as a combinational sum of the accumulated population counts of the sub-vectors that can be either done in DSP slices or in a circuit built from logical slices.
4. Proposed Hardware Accelerators for Population Count
4.1. Hardware Population Count Accelerators Architectures
- A1—trivial counting the set bits;
- A2—using tables 8:4 and adders;
- A3—using double layer of tables 8:4;
- A4—counting the set bits in parallel with 16- and 32-bit chunks processed through multiplication;
- A5—counting bits set in parallel without multiplication, like in Section 3.1;
- A6—dedicated LUT-based circuit from [29].
4.2. System Architecture
- A 32-bit MicroBlaze processor optimized for performance with instruction and data caches disabled, the debug module enabled, and the peripheral AXI data interface enabled. The processor has two interfaces for memory accesses: local memory bus and AXI4 for peripheral access.
- A mixed-mode clock manager (MMCM)—to generate the 100 MHz clock for the design from signal arriving from the crystal clock oscillator available on Nexys-4.
- MicroBlaze local memory configured to 128 KB and connected to the MicroBlaze instance through the local memory bus core providing fast connection to on-chip block RAM storing instruction and data.
- MicroBlaze debug module interfacing with the JTAG port of the FPGA to provide support for software debugging tools.
- MicroBlaze concat for concatenating bus signals to be used in the interrupt controller.
- AXI interrupt controller supporting interrupts from the AXI timer, the UARTLite module and DMA. It concentrates three interrupt inputs from these devices to a single interrupt output to the MicroBlaze.
- AXI timer—hardware timer for measuring execution times. The timer counter is configured to 32 bits.
- Reset module that provides customized resets for the entire system, including the processor, the interconnect, the DMA, and peripherals.
- AXI interconnect with two slave and six master interfaces. The interconnect core connects AXI memory-mapped master devices to one or more memory-mapped slave devices. The two slave ports of the interconnect are connected to the MicroBlaze and the DMA controller. The six master ports are linked with the interrupt controller, DMA controller, UARTLite module, population count accelerator, external memory controller, and timer.
- UARTLite module implementing AXI4-Lite slave interface for interacting with the FPGA through UART from the host PC. A serial port terminal is used to get and print the results.
- AXI External Memory Controller (EMC)—controller for the onboard external cellular 16 MB RAM which is used to store source data for processing.
- AXI DMA controller configured to 32 bits data width with the buffer length register width of 24 bits (allowing transfers up to 224-1 bytes) and providing high-bandwidth direct memory access between the cellular 16MB memory (memory controller) and the population count accelerator module.
- AXI4-Stream data FIFO with 4096 depth for buffering AXI4-Stream data.
- The population count accelerator receiving stream data for processing through DMA from the cellular 16 MB RAM and providing memory-mapped interface to the MicroBlaze for reading the result (sum of the population counts of x 32-bit binary vectors, which is equivalent to processing bits).
4.3. Experimental Setup
- trivial accumulation of the values of each of the 217 bits in a loop;
- B. Kernighan’s method;
- counting the set bits in parallel as illustrated in Section 3.1;
- counting the set bits in parallel with 16- and 32-bit chunks processed through multiplication:
- using LUTs 8:4 and adding up the intermediate results.
5. Discussion of the Results
6. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kim, K.D.; Kumar, P.R. An overview and some challenges in cyber-physical systems. J. Indian Inst. Sci. 2013, 93, 341–352. [Google Scholar]
- Mosterman, P.J.; Zander, J. Cyber-physical systems challenges: A needs analysis for collaborating embedded software systems. Softw. Syst. Model 2016, 15, 5–16. [Google Scholar] [CrossRef]
- Rodríguez, A.; Valverde, J.; Portilla, J.; Otero, A.; Riesgo, T.; de la Torre, E. FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo3 Framework. Sensors 2018, 18, 1877. [Google Scholar] [CrossRef] [Green Version]
- Qasaimeh, M.; Denolf, K.; Vissers, J.L.K.; Zambreno, J.; Jones, P.H. Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels. In Proceedings of the 2019 IEEE International Conference on Embedded Software and Systems (ICESS), Las Vegas, NV, USA, 2–3 June 2019; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
- Hong, T.; Kang, Y.; Chung, J. InSight: An FPGA-Based Neuromorphic Computing System for Deep Neural Networks. J. Low Power Electron. Appl. 2020, 10, 36. [Google Scholar] [CrossRef]
- Spagnolo, F.; Perri, S.; Frustaci, F.; Corsonello, P. Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA. J. Low Power Electron. Appl. 2020, 10, 1. [Google Scholar] [CrossRef] [Green Version]
- Sarwar, I.; Turvani, G.; Casu, M.R.; Tobon, J.A.; Vipiana, F.; Scapaticci, R.; Crocco, L. Low-Cost Low-Power Acceleration of a Microwave Imaging Algorithm for Brain Stroke Monitoring. J. Low Power Electron. Appl. 2018, 8, 43. [Google Scholar] [CrossRef] [Green Version]
- Intel Corp. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A–Z. 2016. Available online: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf (accessed on 14 March 2021).
- Arm, Lda., Arm Armv8-A A32/T32 Instruction Set Architecture. Available online: https://developer.arm.com/documentation/ddi0597/2020-12/SIMD-FP-Instructions/VCNT—Vector-Count-Set-Bits-?lang=en (accessed on 14 March 2021).
- Xilinx, Inc. MicroBlaze Processor Reference Guide. UG081 (v9.0). 2008. Available online: https://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf (accessed on 14 March 2021).
- Nurvitadhi, E.; Sheffield, D.; Sim, J.; Mishra, A.; Venkatesh, G.; Marr, D. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 77–84. [Google Scholar] [CrossRef]
- Kim, J.H.; Lee, J.; Anderson, J.H. FPGA Architecture Enhancements for Efficient BNN Implementation. In Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT), Naha, Japan, 10–14 December 2018; pp. 214–221. [Google Scholar] [CrossRef]
- Agrawal, A.; Jaiswal, A.; Roy, D.; Han, B.; Srinivasan, G.; Ankit, A.; Roy, K. Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM Compute Arrays. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 3064–3076. [Google Scholar] [CrossRef] [Green Version]
- Huang, C.H.; Chen, P.J.; Lin, Y.J.; Chen, B.W.; Zheng, J.X. A robot-based intelligent management design for agricultural cyber-physical systems. Comput. Electron. Agric. 2021, 181. [Google Scholar] [CrossRef]
- Schanck, J. Improving Post-Quantum Cryptography through Cryptanalysis. Ph.D. Thesis, University of Waterloo, Waterloo, ON, Canada, 2020. Available online: https://uwspace.uwaterloo.ca/bitstream/handle/10012/16060/Schanck_John.pdf?sequence=3&isAllowed=y (accessed on 14 March 2021).
- Coron, J.S.; Gini, A. Improved cryptanalysis of the AJPS Mersenne based cryptosystem. J. Math. Cryptol. 2020, 14, 218–223. [Google Scholar] [CrossRef]
- Mitchell, R.; Chen, I.R. A Survey of Intrusion Detection Techniques for Cyber-Physical Systems. ACM Comput. Surv. 2014, 55. [Google Scholar] [CrossRef] [Green Version]
- John, I.N.; Kamaku, P.W.; Macharia, D.K.; Mutua, N.M. Error Detection and Correction Using Hamming and Cyclic Codes in a Communication Channel. Pure Appl. Math. J. 2016, 5, 220–231. [Google Scholar] [CrossRef] [Green Version]
- Dalke, A. The chemfp project. J. Cheminform. 2019, 11, 76. [Google Scholar] [CrossRef] [Green Version]
- Gonzalez-Domínguez, J.; Schmidt, B. ParDRe: Faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics 2016, 32, 1562–1564. [Google Scholar] [CrossRef] [Green Version]
- Anderson, S.E. Bit Twiddling Hacks. Available online: http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable (accessed on 14 March 2021).
- Sklyarov, V.; Skliarova, I.; Silva, J. On-chip reconfigurable hardware accelerators for popcount computations. Int. J. Re Config. Comput. 2016, 2016, 8972065. [Google Scholar] [CrossRef] [Green Version]
- Sklyarov, V.; Skliarova, I. Hamming Weight Counters and Comparators based on Embedded DSP Blocks for Implementation in FPGA. Adv. Electr. Comput. Eng. 2014, 14, 63–68. [Google Scholar] [CrossRef]
- Parhami, B. Efficient Hamming weight comparators for binary vectors based on accumulative and up/down parallel counters. IEEE Trans. Circuits Syst. Ii Express Briefs 2009, 56, 167–171. [Google Scholar] [CrossRef] [Green Version]
- Piestrak, S.J. Efficient Hamming weight comparators of binary vectors. Electron. Lett. 2007, 43, 611–612. [Google Scholar] [CrossRef]
- Sklyarov, V.; Skliarova, I. Design and implementation of counting networks. Computing 2015, 97, 557–577. [Google Scholar] [CrossRef]
- El-Qawasmeh, E. Beating the Popcount. Int. J. Inf. Technol. 2003, 9, 1–18. Available online: http://intjit.org/cms/journal/volume/9/1/91_1.pdf (accessed on 14 March 2021).
- Sklyarov, V.; Skliarova, I. Multi-core DSP-based vector set bits counters/comparators. J. Signal. Process. Syst. 2015, 80, 309–322. [Google Scholar] [CrossRef]
- Sklyarov, V.; Skliarova, I.; Barkalov, A.; Titarenko, L. Synthesis and Optimization of FPGA-Based Systems; Springer: Berlin, Germany, 2014. [Google Scholar]
- Pilz, S.; Porrmann, F.; Kaiser, M.; Hagemeyer, J.; Hogan, J.M.; Rückert, U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms 2020, 13, 47. [Google Scholar] [CrossRef] [Green Version]
- Umuroglu, Y.; Conficconi, D.; Rasnayake, L.K.; Preußer, T.B.; Själander, M. Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing. ACM Trans. Reconfig. Technol. Syst. 2019, 12, 1–24. [Google Scholar] [CrossRef] [Green Version]
- Rasoulinezhad, S.; Siddhartha; Zhou, H.; Wang, L.; Boland, D.; Leong, P.H.W. LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 26–28 February 2020; pp. 161–171. [Google Scholar] [CrossRef] [Green Version]
- Preußer, T.B. Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 4–8 September 2017; pp. 1–7. [Google Scholar] [CrossRef] [Green Version]
- Xilinx, Inc. 7 Series FPGAs Data Sheet: Overview. 2020. Available online: https://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.pdf (accessed on 21 March 2021).
- Digilent, Nexys 4 Reference Manual. Available online: https://reference.digilentinc.com/reference/programmable-logic/nexys-4/reference-manual (accessed on 21 March 2021).
Architecture | Worst Slack (ns) | Max Freq (MHz) | LUT | FF | BRAM | DSP | Total on-Chip Power (W) |
---|---|---|---|---|---|---|---|
A1 | 1.28 | 115 | 4663 | 4739 | 37 | 3 | 0.294 |
A2 | 0.77 | 108 | 4661 | 4707 | 38.5 | 3 | 0.294 |
A3 | 0.682 | 107 | 4658 | 4707 | 38.5 | 3 | 0.293 |
A4 | −5.308 | 65 | 4718 | 4739 | 37 | 6 | 0.293 |
A5 | −0.902 | 92 | 4708 | 4739 | 37 | 3 | 0.293 |
A6 | 0.415 | 104 | 4663 | 4739 | 37 | 3 | 0.294 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Skliarova, I. Accelerating Population Count with a Hardware Co-Processor for MicroBlaze. J. Low Power Electron. Appl. 2021, 11, 20. https://doi.org/10.3390/jlpea11020020
Skliarova I. Accelerating Population Count with a Hardware Co-Processor for MicroBlaze. Journal of Low Power Electronics and Applications. 2021; 11(2):20. https://doi.org/10.3390/jlpea11020020
Chicago/Turabian StyleSkliarova, Iouliia. 2021. "Accelerating Population Count with a Hardware Co-Processor for MicroBlaze" Journal of Low Power Electronics and Applications 11, no. 2: 20. https://doi.org/10.3390/jlpea11020020
APA StyleSkliarova, I. (2021). Accelerating Population Count with a Hardware Co-Processor for MicroBlaze. Journal of Low Power Electronics and Applications, 11(2), 20. https://doi.org/10.3390/jlpea11020020