Reconﬁgurable Binary Neural Network Accelerator with Adaptive Parallelism Scheme

: Binary neural networks (BNNs) have attracted signiﬁcant interest for the implementation of deep neural networks (DNNs) on resource-constrained edge devices, and various BNN accelerator architectures have been proposed to achieve higher efﬁciency. BNN accelerators can be divided into two categories: streaming and layer accelerators. Although streaming accelerators designed for a speciﬁc BNN network topology provide high throughput, they are infeasible for various sensor applications in edge AI because of their complexity and inﬂexibility. In contrast, layer accelerators with reasonable resources can support various network topologies, but they operate with the same parallelism for all the layers of the BNN, which degrades throughput performance at certain layers. To overcome this problem, we propose a BNN accelerator with adaptive parallelism that offers high throughput performance in all layers. The proposed accelerator analyzes target layer parameters and operates with optimal parallelism using reasonable resources. In addition, this architecture is able to fully compute all types of BNN layers thanks to its reconﬁgurability, and it can achieve a higher area–speed efﬁciency than existing accelerators. In performance evaluation using state-of-the-art BNN topologies, the designed BNN accelerator achieved an area–speed efﬁciency 9.69 times higher than previous FPGA implementations and 24% higher than existing VLSI implementations for BNNs.

BNN accelerators can be divided into two categories: streaming and layer accelerators. Streaming accelerators are designed for all or most layers in a target network [22][23][24][25][26][27][28][29][30][31][32][33]. Since optimized hardware is implemented for each layer, this type of architecture usually offers reasonable latency. However, streaming accelerators require more resources than layer accelerators, which are implemented for a specific target layer. In addition, they can only support the network topology that is targeted prior to implementation. In other words, if a network is optimized and implemented for a specific application, it needs to be structurally changed and re-designed to use it in other applications. Therefore, BNNs with a streaming architecture cannot always satisfy resource constraints and are thus infeasible for various sensor applications in edge AI because of their complexity and inflexibility [20].
On the other hand, layer accelerators are designed to handle a target layer in BNN topologies [34][35][36][37][38]. Therefore, these architectures require less resources than streaming accelerators, which are designed for all or most of the layers of a target network topology. In addition, since layer accelerators need to be able to cope with various types of layers, that is layers with different input feature map (fmap) sizes and different numbers of in/out channels, most layer accelerators have been designed with reconfigurable architectures that can handle various network topologies. Therefore, this type of architecture can support various sensor applications and is well suited for resource-constrained applications.
Conventional layer accelerator designs usually fall under one of two categories according to how data scheduling is implemented: filter-level parallelism and fmap-level parallelism. Accelerators with filter-level parallelism are designed with a line-buffer architecture that stores the previous row of the input fmap for sliding operations [34][35][36]. Such architectures offer efficient data reuse and higher convolution throughput. They have been used widely in convolution-based image processing applications. However, typical network topologies consist of layers having varied structures ranging from an initial convolutional (CONV) layer to deep fully connected (FC) layers, and each layer can have a different architecture. In general, the number of in/out channels gradually increases in the deeper layers, whereas the fmap size becomes smaller. Therefore, accelerators with filter-level parallelism, which is optimized for CONV operations, cannot attain higher utilization and throughput in deep layers that are similar to FC layers. In addition, these accelerators require more resources than other types of architectures to store previous row data and integer-scale intermediate values of the popcount operation.
On the other hand, several accelerators that introduce fmap-level parallelism have been proposed [37,38]. Accelerators with fmap-level parallelism load and process data on a per-channel basis, with 100% utilization and throughput at deeper layers exceeding a certain number of input channels. In addition, these architectures require less resources than those with filter-level parallelism because they can rapidly binarize intermediate results via efficient data scheduling [37]. However, the throughput of these accelerators is excessively degraded in the initial layers having a small number of channels.
In this paper, we propose a reconfigurable BNN accelerator with adaptive parallelism that offers high throughput performance in all layers. The proposed accelerator analyzes target layer parameters and adaptively applies parallelism schemes, which consist of fmap-level parallelism combined with an advanced method. This architecture can perform BNN operations with a reasonable amount of resources because the supported parallelism schemes are based on the mechanisms of fmap-level parallelism. The design and implementation results of the proposed architecture are also presented. The remainder of this paper is organized as follows. Section 2 briefly reviews BNNs and related works. Section 3 describes the proposed adaptive parallelism scheme and presents performance evaluation results. Section 4 discusses the hardware architecture and implementation results. Finally, Section 5 concludes the paper.

Binary Neural Network
The basic computation at each layer of a convolutional neural network (CNN) can be expressed as: where y, x, W, and b are the output, input fmaps, weights, and biases, respectively. c in and c out are the number of in/out fmaps (in/out channels) in high-dimensional convolutions, as shown in Figure 1. The output fmaps, which have a size of c out ×h out ×w out , can be obtained through a convolution operation between the input fmaps and the filters, which is the set of weights. f is a nonlinear activation function, which is typically applied after each CONV layer or FC layer. These functions are usually conventional nonlinear functions, such as the rectified linear unit (ReLU), sigmoid, and hyperbolic tangent [6]. In addition, various optional layers can be found in the typical network topologies of CNNs, such as the pooling and normalization layers. Pooling reduces the dimensionality of the fmaps and is applied to each fmap separately. In batch normalization (BN), the output of the activation function is further scaled and shifted as follows: where γ, β, µ, and σ are learned parameters and k is the input of the BN function.
In BNNs, f in (1) can be expressed as sign(BN(k)), which is the binarization of the BN output [20]. Since the values of y, x, W, and b are +1 or −1, ⊗ can be reduced to bitwise operations. If the binary values obtained from these operations are encoded with +1 as a one-valued bit and −1 as a zero-valued bit, a multiplication operation is equivalent to an XNOR logical operation on the binary values. The sum of XNOR operations can also be calculated via a simple popcounting, that is counting the number of bits set to one. In addition, the costly calculation of sign(BN(k))) can be reorganized into: Multiplication and division in (2) can be replaced with a simple comparison with a threshold τ . = σβ/γ − µ, defined for convenience.
In the training phase, the pooling layers in the BNNs are generally placed before the activation layer because of their higher accuracy. However, in the inference of BNNs, there is no loss of accuracy caused by the different ordering of these layers [20]. Since the max pooling of encoded bits as one or zero can be achieved using simple OR logical operations, placing the pooling layer after the activation layer is more efficient in the inference phase.
The streaming architectures with advanced design methods were also proposed [27][28][29][30]. SimBNN [27] analyses the data similarities to significantly reduce the operational complexity, and ReBNet [28] uses residual binarizations for efficient hardware implementation in ResNet architectures. LP-BNN [29] adopts layer parallelism and supports nearly perfect load balancing for various types of BNNs. In [30], a fully digital ASIC architecture with computation tightly coupled to the memory for aggressive data reuse was proposed. In addition, several streaming accelerators [31][32][33] apply improved learning schemes for BNNs. Streaming architectures can offer high throughput performance because optimized hardware is implemented for each layer. However, this type of architecture cannot always satisfy resource constraints and is thus infeasible for various sensor applications in edge AI because of the complexity and inflexibility [20].
On the other hand, layer accelerators are designed to handle a target layer in BNN topologies [34][35][36][37][38]. Conventional layer accelerator designs usually fall under one of two categories according to how data scheduling is implemented: filter-level parallelism and fmaplevel parallelism. Accelerators with filter-level parallelism are designed with a line-buffer architecture that stores the previous row of the input fmap for sliding operations [34][35][36]. In [34], a batch normalization-free technique was proposed, which can use a simple comparison instead of complex operations. FBNA [35] focuses on binarizing the first layer in BNNs and reduces the processing time for the first layer. In [36], a BNN on FPGA was proposed, which uses on-chip memories only. Although accelerators with filter-level parallelism offer efficient data reuse and higher throughput for CONV layers, they cannot attain higher utilization and throughput in deep layers such as FC layers. In addition, these accelerators require more resources than other types of architectures to store the previous row of the input fmap and integer-scale intermediate values of the popcount operation.
In contrast, several accelerators that introduce fmap-level parallelism have been proposed [37,38]. Accelerators with fmap-level parallelism load and process data on a per-channel basis, with 100% utilization and throughput at deeper layers exceeding a certain number of input channels. In [37], the XNOR neural engine (XNE) was proposed, which adopts fmap-level parallelism and a tightly coupled shared memory paradigm. In [38], a design automation scheme for BNNs with fmap-level parallelism was proposed for near-sensor processing. The accelerators with fmap-level parallelism require less resources than those with filter-level parallelism because they can rapidly binarize intermediate results via efficient data scheduling [37]. However, the throughput of these accelerators is excessively degraded in the initial layers having a small number of channels.

Proposed Parallelism Scheme of the BNN Accelerator
The proposed BNN accelerator adaptively applies two parallelism schemes according to the architectures of the target layer to offer high throughput in all layers. Our adaptive parallelism scheme consists of a combination of fmap-level parallelism and an advanced method that can efficiently cope with certain CONV layers that fmap-level parallelism cannot handle with 100% utilization. To minimize resource requirements, the proposed BNN accelerator is designed to operate with two parallelism schemes while using a similar data path as in conventional architectures with fmap-level parallelism. In addition, the proposed BNN accelerator is designed to make efficient use of the memory bandwidth allowed in various SoC platforms by using the design parameter P hw , which is the maximum number of operations that can be processed in one clock cycle. In other words, if P hw is set according to the memory bandwidth allowed in each SoC, our architecture can load P hw -size data in one clock cycle.
The first parallelism scheme (i.e., the advanced method) shown in Figure 2 is implemented for the initial layers, which have a small number of channels. The data scheduling of this scheme loads and processes ks×P cin data bits per cycle, where ks is the kernel size and P cin is the number of fmaps that can be loaded and processed in one clock cycle. Figure 3 shows how CONV layers are reorganized in the first parallelism mode, where i c and o c mean the number of input and output channels, respectively. The loops are reordered differently than in conventional BNN loops, bringing a kernel width loop and an input fmap loop to the innermost position. These inner loops are computed in the proposed BNN accelerator in one clock cycle. The in/out fmap loops split into an inner loop (computing units on processing elements; P cin for input and P hw for output fmaps) and an outer loop (cycling iterations on network control unit; c in /P cin for input and c out /P hw for output fmaps). This scheme can efficiently cope with certain CONV layers in which utilization is drastically degraded when using fmap-level parallelism, as detailed in Section 3.2.2.  The second parallelism scheme used in the proposed BNNs accelerator is described in Figures 4 and 5. Similar to fmap-level parallelism, this scheme loads and processes data on a per-channel basis, and it can achieve 100% utilization at deeper layers exceeding a certain number of input channels. An overview of this parallelism scheme is shown in Figure 5. The kernel width loop is moved outward, and the hardwired inner loop in the accelerator is only focused on the fmap channels. The parallelism parameter P hw is used to define the number of simultaneous XNOR operations in the processing elements (PEs) per cycle. In other words, P hw input fmap channels can be loaded and processed in one clock cycle. Therefore, the data scheduling of this scheme handles more channels per cycle than other schemes in deeper layers. The proposed BNN accelerator executes one of the two parallelism schemes depending on the target layer as follows: where m t is the parallelism mode at the current target layer. The accelerator performs each parallelism scheme according to the value m t , as shown in Table 1. Considering that the processor is designed with P hw = 256, if it targets a layer with respective c in and ks values of 128 and three, the first parallelism is used, and the accelerator attains 75% utilization. This is a utilization improvement of 25% compared with the 50% attained via fmap-level parallelism. In contrast, if c in is 256 or more, the proposed accelerator operates based on the second parallelism scheme, which can achieve 100% utilization.

Mode
Parallelism Scheme m t = 0 First scheme introduced in Figure 2 m t = 1 Second scheme shown in Figure 4

Target Network Topologies
To evaluate the efficiency of the proposed accelerator, common topologies of BNNs were used to make performance comparisons [20], as listed in Table 2. The architectures are described layer-by-layer using our own notation. Here, 2C128 and 2FC1024 refer to two CONV layers and two FC layers, with 128 and 1024 output channels, respectively. All max pooling (MP) layers have a size of 2 × 2 and a stride of two. The first topology is the original BNN described by Courbariaux et al. [16] and is referred to as BNN-Cifar10 in this paper. It is a variation of the VGG-11 topology with six CONV layers and three FC layers, as shown in Figure 6, and was used on the CIFAR-10 datasets. The CIFAR-10 dataset consists of 60,000 32 × 32 photos and contains 10 different classes, six different animals and four different vehicles. BNN-Cifar10 delivers state-of-the-art accuracy on the CIFAR-10 dataset. The second topology used for evaluation is the well-known VGG-16 model for the ImageNet dataset. The ImageNet dataset consists of 1.2 million images and contains 1000 different classes.

Performance Evaluation Results
We used operations per cycle (ops/cycle) to compare the throughput performance of our accelerator with that of existing BNN architectures. This parameter refers to the number of operations the accelerator can process per clock cycle, and we counted XNOR and popcount as separate operations [37]. In addition, ops/cycle can also represent the memory access efficiency because the number of operations that can be processed in each cycle is related to the number of data that can be loaded from memory in each cycle. To evaluate performance, we compared efficient BNN layer accelerators targeted for a lowpower microcontroller [34][35][36][37][38]. To make a fair comparison, a similar number of operators (XNOR gates and adders for popcounts) were assumed for each accelerator. In addition, the first layer, whose fmaps are normally floating-point image data, was excluded in the comparison.
First, our architecture was assumed to have P hw = 256, and Reference [34] assumed having a similar number of operators (252). Moreover, Reference [34] adopted filter-level parallelism, which significantly degrades throughput performance in deeper layers, as shown in Figure 7. Although the work in [34] showed excellent peak performance (Gops/s, giga operations per second), it used more resources than the proposed accelerators, as detailed in Section 4.2.1, and the throughput decreased in other layers because it applied an inflexible parallelism scheme. The proposed parallelism can provide higher throughput performance in almost all layers (C4∼FC3), as shown in Figure 7. In addition, when comparing at 150 MHz, which was the operating frequency of the work in [34], the proposed accelerator operates with a lower latency of 1.57 ms for BNN-Cifar10 and of 30.92 ms for VGG-16 compared with [34]. On the other hand, the XNOR neural engine (XNE) [37] uses fmap-level parallelism to achieve 100% efficiency in the deeper layers, but its performance decreases sharply in the upper CONV layers, as shown in Figure 8. Our architecture and XNE were assumed to have P hw = 256 to make a fair comparison. The proposed accelerator can achieve on average ops/cycle 51.2 higher than XNE for VGG-16 and 32 higher for BNN-Cifar10. In addition, it can process with a lower latency of 1.96 ms for BNN-Cifar10 and 120.42 ms for VGG-16 compared with XNE when operating at a clock frequency of 300 MHz, which is the operating frequency of [37].  Figure 9 shows the block diagram of the proposed BNN accelerator, including a network control unit (NCU), processing elements (PEs), a popcount unit (PCU), an accumulator, an activation unit, and a pooling unit. The PEs are composed of XNOR gates, AND gates, and a sub-popcount unit for processing 8 bit words. The feature vector that is input from the system is stored in a feature register for reuse. This feature vector is multiplied by the weight stream by the P hw XNOR gates in the PEs. To allow the proposed processor to operate with a smaller number of input features than P hw , the product vector is masked by sum filter coefficients, in an array of AND gates. A popcount unit based on a simple combinational circuit is used to count the number of "1s" in the unmasked vector. Each PE has a sub-popcount unit for 8 bit words, from which a 4 bit output that ranges from zero to eight is extracted. These output components are merged in the main popcount unit. The final count is accumulated with the output of the current register in the accumulator. The accumulator consists of adder and P hw registers, and each register stores one output computed in a full accumulation cycle.

Hardware Architecture
The accumulated values are binarized in the activation unit with the thresholding phase defined in (3). This simple comparison involves batch normalization and binarization, as mentioned in Section 2.1. The max pooling of binarized values is computed using OR logical operations in the pooling unit when a target layer has a pooling phase. Otherwise, the binarized values bypass the pooling unit for flexibility.
The NCU is designed to execute the proposed parallelism schemes. It has R/Wregisters for loading the parameters of the target layer (e.g., c out , h out , w out , ks, etc.), a simple finitestate machine (FSM) to iterate the loop execution, and address calculators. The main FSM orchestrates the operation of the proposed accelerator and communicates with the microcontroller unit (MCU). By analyzing the layer parameters and selecting an appropriate parallelism scheme, the NCU controls the data loading phase and generates a memory address. It loads the data until the operation required for the current operation is completed, and the intermediate results during each iteration are stored in the accumulator. After these phases are completed, the threshold values from (3) are loaded and compared with the accumulated values in the activation unit. Then, the final outputs are streamed out to the bus interface.

FPGA Implementation Results
The proposed BNN accelerator was designed using the Verilog hardware Description Language (HDL) and implemented on an XCZU7EV FPGA targeting the Xilinx ZYNQ Ultrascale+ ZCU-104 evaluation board. The proposed architecture with P hw = 256 requires 1050 configurable logic blocks (CLBs), 4816 CLB LUTs, and 2 DSPs, as shown in Table 3. In addition, the accelerator with the proposed parallelism schemes can process at an operating frequency of 371.6 MHz. To evaluate the efficiency of our architecture, we compared it with previous BNN implementations targeted at FPGA devices [34,35]. As shown in Table 3, although the proposed accelerator exhibits a lower peak performance than previous designs, it offers higher efficiency in terms of area (Gops/s/KLUT, Gops/s/KCLB) and power (Gops/s/W). The accelerators with filter-level parallelism [34,35] require more resources than the proposed accelerator because they need to store the previous row data and integer-scale intermediate values of the popcount operation. The proposed BNN accelerator was also synthesized to gate-level circuits using a 40 nm CMOS standard cell library. The key features of the proposed BNN accelerator are summarized in Table 4. It can be seen that the proposed architecture requires 23.37 K logic gates with a total die size of 0.016 mm 2 . This system can operate at an operating frequency of 450 MHz, and real-time processing is possible. To evaluate the efficiency of our architecture, we compared it with XNE [37] and XNORBIN [30], which represent the VLSI implementation results for the BNNs. To make a fair comparison, our architecture and XNE were assumed to have P hw = 128, and we normalized the area as: where Tech is the process technology expressed in nanometers [39]. As shown in Table 4, the proposed accelerator exhibited an average area-speed efficiency (Gops/s/A norm ) 24% higher than XNE with fmap-level parallelism when operating at 300 MHz. In addition, our architecture can offer an area-speed efficiency 3.77 times higher than XNORBIN, which presents the streaming accelerator.

Conclusions
In this paper, we proposed a reconfigurable BNN accelerator with an adaptive parallelism scheme. Existing BNN layer accelerators suffer from throughput performance degradation at certain layers in BNN topologies because of their constant data scheduling schemes. On the other hand, streaming architectures, which provide parallelism schemes optimized for each layer, require a large amount of resources and are thus not suitable for edge devices. To overcome such problems, the proposed accelerator analyzes target layer parameters and performs operations with optimal parallelism using reasonable resources. The proposed parallelism schemes consist of fmap-level parallelism combined with an advanced method, which can efficiently cope with certain layers that fmap-level parallelism cannot handle with 100% utilization. In our performance evaluation for state-ofthe-art BNN topologies, accelerators with the proposed parallelism schemes showed good throughput performance in terms of ops/cycle and low latency compared with existing BNN accelerators. The proposed processor was implemented with 1050 CLBs, 4816 CLB LUTs, and two DSPs on a Xilinx XCZU7EV FPGA device. It offered an area-speed efficiency 9.69 times higher than previous FPGA implementations for BNNs. In addition, it had a logic gate count of 23.37 K with a die size of 0.016 mm 2 , and it offered an area-speed efficiency 24% higher than existing VLSI implementations when operating at 300 MHz.
Author Contributions: J.C. designed the accelerator, performed the simulation and experiment, and wrote the paper. Y.J. (Yongchul Jung) and S.L. implemented the processor and revision of this manuscript. Y.J. (Yunho Jung) conceived of and led the research, analyzed the experimental results, and wrote the paper. All authors read and agreed to the published version of the manuscript.
Funding: This research received no external funding.