An Efﬁcient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

: Convolutional Neural Networks (CNNs) have been widely applied in various ﬁelds, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efﬁcient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataﬂows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantiﬁcation process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efﬁcient reconﬁgurable three-stage activation-quantiﬁcation-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantiﬁcation, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of 0.17 mm 2 . It can achieve 7.03 TOPS/W energy efﬁciency and 4.14 TOPS/mm 2 area efﬁciency at 100.1 mW, which makes it a promising design for the embedded devices.


Introduction
Convolutional neural networks (CNNs) have been widely applied in a variety of domains [1][2][3], and achieve great performance in many tasks including image recognition [4], speech processing [5] and natural language processing [6].With the renewal of the CNN models, larger and deeper structures promise the improving predicting accuracy [7].However, the number of parameters also increases dramatically, resulting in unacceptable power dissipation and latency, which hinders the implementation of the Internet-of-Thing (IoT) applications like intelligent security systems.
The problems stimulate the research of both algorithms and hardware designs to pursue low power and high throughput.In terms of the former, one approach is to compress the model by pruning the redundant connections, resulting in sparse neural network [8].Nevertheless, it also bears supplementary loads including the pruning, Huffman coding, and decoding.Another easier way is to simply train low bit-width CNN models, which each weight and activation can be represented with a few bits.For example, due to binarized weights and activations, Binarized Neural Networks (BNNs) [9] and XNOR-Net [10] use the bitwise operations instead of most of the complicated arithmetic operations, which could also achieve rather competitive performance.In addition, most of the parameters and data can be kept in on-chip memory, enabling greater performance and reducing external memory accesses.Nevertheless, such an aggressive strategy cannot guarantee accuracy.To compensate for this, the DoReFa-Net [11] adopts low bit-width activations and weights to achieve higher recognition accuracy.These delicate models are more suitable for mobile and embedded systems.Moreover, the hardware design of CNNs has also drawn much attention in recent years.Currently, Graphics Processing Units (GPUs) and TPU [12] have been the mainstay for DNN processing.However, the power consumption of GPUs, which are general-purpose compute engines, is relatively high for the embedded systems.The power consumption of another popular platform TPU is low, but the resource use is not satisfactory when processing most of the benchmarks.
Generally, there are three kinds of mapping methods from the layer computation to the computing units in hardware.First, previous research [13][14][15][16][17][18] maps CNN models onto an accelerator with only one computing unit, which will process the layers iteratively.This approach is called "one size fits all".However, the fixed dimensions of one computing unit could not be compatible with all the layers with different dimensions, which leads to the resource inefficiency, especially in Fully Connected (FCN) layers [19].Some recent works [20][21][22][23][24] focus on a parallel streaming architecture, which partitions a system into several independent tasks and runs them in parallel hardware [25].In general, the partitioning includes task level and data level.Tasks are separated into sequential and parallel modules.Sequential modules are used to process different tasks, and parallel modules execute the same task based on different data.Based on this architecture, many accelerators are proposed.Yang et al. [20] adopt another mapping approach called "one to one", which means that each layer is processed by an individual optimized computing unit.Thus it can achieve high resource use.Nevertheless, this approach requires more on-chip memory resources and control logic resources (i.e., configuration and communication logics).This is the second mapping way.Shen et al. [22] and Venieris et al. [23] both present a resource partitioning methodology for mapping CNNs on FPGAs.It can be regarded as a trade-off approach between "one size fits all" and "one to one".These works can only construct an optimized framework for a specific CNN model at a time, but the framework cannot well apply to other models flexibly.In these works, a layer is the smallest unit of granularity during the partitioning.The computational workload balance is usually the primary concern, and the processing order may not according to the layer sequence of a CNN model, which probably causes too many data accesses to the external memory.
Recently, there is an increasing amount of research [17,18,21,26,27] focusing on the accelerators targeting on low bit-width CNNs.Wang et al. [17], Venkatesh et al. [26] and Andri et al. [18] all propose their own architecture for deep Binarized CNNs, working by "one size fits all" approach.In these works, the energy efficiency is better than that of conventional CNNs, since the complicated multiplications and additions are replaced by some simple operations.However, these three works do not take all kinds of layers acceleration into account.Umuroglu et al. build FINN [27] framework and its second generation FINN-R [28] framework for fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture, which works by the approach of "one to one".Li et al. [21] design an FPGA accelerator which specifically targets at the same kind of low bit-width CNNs as ours.This accelerator consists of several computing units, each unit processes a group of layers, which is a trade-off approach.However, the processing time of each unit is not well balanced, which will reduce the throughput due to the large pipeline stalls.However, these designs usually ignore the design of layers except Convolutional (CONV) layers and FCN layers, and need extra hardware to support those layers like batch-normalization, activation, quantification, and pooling operation, which brings extra overheads.
To overcome these problems, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture, accompanied by different optimization strategies, aiming at reducing the power consumption and area overhead, and improving the throughput.Our major contributions are summarized as follows.

•
We propose a novel coarse grain task partitioning (CGTP) strategy to minimize the processing time of each computing unit based on the parallel streaming architecture, which can improve the throughput.Besides, the multi-pattern dataflows are designed for the different sizes of CNN models can be applied to each computing unit according to the configuration context.

•
We propose an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit, which can support two modes: AQ (processing activation and quantification) and AQP (processing activation, quantification and pooling).It means that the AQP unit can process the "possible" max-pooling layer (it does not exist after every CONV layer) without any overhead.Besides, the low power property is also exploited by the staged blocking strategy in AQP unit.

•
The proposed architecture is implemented and evaluated with TSMC 40 nm technology with a core size of 0.17 mm 2 .It can achieve over 7.03 TOPS/W energy efficiency and about 4.14 TOPS/mm 2 area efficiency at 100.1 mW.
The rest of this paper is organized as follows.In Section 2, we introduce the background of CNNs and low bit-width CNNs.In Section 3, the efficient parallel streaming architecture and the design details are presented.We show the implementation results and comparison with other recent works in Section 4. Finally, Section 5 provides a summary.

Convolutional Neural Networks
A CNN model essentially consists of cascaded layers, including the CONV layers, FCN layers, activation layers, pooling layers, and batch normalization layers.The CONV layers and FCN layers are the two most critical steps.The CONV layers bear the most intensive computation and apply the filters on the input feature maps to extract embedded visual characteristics and generate the output feature maps.For instance, it receives C input feature maps and outputs E output feature maps.The E output feature maps correspond to the filtered results of E 3-D filters.Each 3-D filter is operated on C input feature maps to generate one output feature map.The FCN layers are usually stacked behind the CONV layers as classifiers.The FCN layers mainly perform matrix multiplication.In FCN layers, the features can be represented as a vector, each node of this vector is connected to all nodes of the previous layer, which explains the huge amount of parameters and the requirement of high bandwidth.Figure 1a shows the computation loops of a single CONV or FCN layer.The descriptions of these parameters can be seen in Figure 1b.However, the CONV and FCN layers are combinations of linear functions, which cannot generate new feature information.To tackle this problem, the CONV and FCN layers are usually followed by the activation layers.Its non-linearity brings in new characteristics.Popular activation functions include the hyperbolic tangent function and the rectified linear unit.Besides, the pooling layers, which execute a non-linear down-sampling function, also bring some new changes to the features.Generally, the max-pooling (MaxP) is the most popular choice, which outputs the maximum point from a small subregion of the feature maps [29].The batch normalization layers make a difference in the process of training, and it can accelerate the convergence and avoid the overfitting problem.With the batch normalization layers [30], models can obtain the same accuracy with 14 times less training time.It can alleviate internal covariate shift by normalizing the inputs of layers.
We take a CONV-based layer block as an example.It includes a CONV layer, a batch normalization layer, an activation layer, and a MaxP layer.A typical process is where l is the index of layers, a represents the input feature maps, b is the bias, y denotes the output feature maps.

Low Bit-Width CNNs
The low bit-width CNNs are similar to the conventional CNN models in terms of the structures.DoReFa-Net [11] is a method to train neural network that has low bit-width weights, activations with low bit-width parameter gradients.While weights and activations can be deterministically quantized, gradients are also stochastically quantized.AlexNet using the method of DoReFa-Net that has 1-bit weights, 2-bit activations, can be trained get 0.498 top-1 accuracy on ImageNet validation set, higher than BNN (0.279) and XNOR-Net (0.442).In this paper, the benchmarks and the algorithm refer to the DoReFa-Net.In the model DoReFa-Net [11], weights and activations are trained as k-bit and m-bit fixed-point integers respectively.As illustrated in Figure 2, within a CONV-based layer block, after convolution operations, there are four more steps we should take: batch normalization, activation, quantification, and optional max-pooling operations.The data of output feature maps after convolution are normalized by the trained statistical mean µ and variance σ.In addition, the results are processed by the scaling factor γ and the shifting parameter β.The common simplified batch normalization can be regarded as a linear function.Thus

Batch normalization Activation
Pooling Quantification The activation operation is used to restrict the outputs ranging from zero to one.In this low bit-width CNN model, we can choose one of the following three kinds of activation functions, as shown in Figure 3, to add the non-linearity, thus x o = min (abs(x i ), 1) x o = t f .clip_by_value(x i , 0, 1) , (7) where t f .clip_by_value(x,min, max) is a function defined in Tensorflow, and will filter out the values beyond [min, max].After that, the quantification function quantizes a real number output from the activation layers , which represents a few certain values.

Efficient Parallel Streaming Architecture for Low Bit-Width CNNs
First, we describe the top architecture of this accelerator, as shown in Figure 4. Second, we describe the computing unit, which is the main component of this design, to introduce the novel dataflows and the CGTP strategy.Then the novel reconfigurable three-stage AQP unit is proposed, which is another important part of the computing unit.Finally, we describe the novel interleaving bank scheduling scheme.

Top Architecture
Figure 4 shows the overall architecture of the CNN accelerator.It is already proved that low bit-width neural network in [11] can also achieve competitive classification accuracy compared to 32-bit precision counterpart based on amounts of experiments.Besides, the properties of low bit-width provide more exploration space in terms of low power and high efficiency.In our work, we choose 2-bit and 1-bit as the data format of activations and weights respectively.
The accelerator is mainly composed of computing units, a memory module, a controller, and a configuration module.The computing units consist of three heterogeneous parts: two low bit-width convolutional computing (LCONV) units and a low bit-width fully connected computing (LFCN) unit for processing CONV-based layer blocks and FCN-based layer blocks, respectively.Each unit contains different subelements.For example, the LCONV unit is composed of a 3D systolic-like low bit-width processing element (LPE) array, and 16 activation, quantization and optional pooling (AQP) units.Compared to LCONV units, an LFCN unit is composed of a 2D systolic-like LPE array and an AQP unit.The dimension of computing array in CONV unit and LFC unit is 2 × 13 × 4 × 4 LPEs and 9 × 4 LPEs respectively, and the total number is 452 LPEs.The memory module includes a single port SRAM and an SRAM controller, which are both separated into the weight part and the input feature data part.
The overall execution is managed by the controller.The instructions, which can be categorized to execution and configuration, are fetched in the external memory via Advanced Extensive Interface (AXI) bus and decoded in the controller.The execution commands are responsible to initiate the execution.The configuration contexts, such as stride, number of channels, initial address and so on, are transferred to the configuration module.After receiving the configuration contexts, the configuration module reconfigures the data path, and the buffers supply data and parameters to computing units.These computing units are respectively controlled by separated configuration contexts and finite state controllers in the configuration module.
This parallel streaming architecture is very suitable for CNNs.
(1) CNNs have cascaded layers, which can be executed as sequential modules.
(2) multiple loops of intensive computation within a layer can be partitioned and executed in parallel conveniently.(3) the CNN accelerator should handle large batches of classification tasks based on the application requirements.

Merging Bit-Wise Convolution and Batch Normalization into One Step
We combine the bit-wise convolution operation with the batch normalization, on account of their common linear characteristics.In this way, we can operate the CONV layers and batch normalization layers together without any latency, additional consumption or silicon area overhead.
The computation expression of the convolution is demonstrated in Figure 1.The batch normalization stacked behind will transform the expression linearly by factors of γ, σ, β and µ, which can be formulated as (4).We merge the similar terms, thus where p equals to γ σ , q equals to β − µγ σ , and these two parameters can be pre-computed offline.As shown in Figure 5, the results of the combination of CONV or FCN layers and batch normalization layers can be obtained by multiplying each product item with p and adding q to each bias item.

3D Systolic-Like Array
In the original bit-wise convolution, the product of 1-bit weight and 2-bit activation has eight possible values.According to the optimized combination, we only need to change the pre-computed values stored in the local buffer and modify the bias values before parameters are sent to the weight buffers from the external memory.These additional operations will be finished offline, and operate without any overhead in hardware.Systolic arrays can date back to 1980s.However, history-aware architects could have a competitive edge [31].In this work, we adopt the systolic array as our basic computing architecture.By the systolic dataflow, data can be continuously streamed along two dimensions of an array and processed in a pipelined fashion.When accelerating the CONV layers, we further improve data reuse by extending a conventional 2D array to a 3D tiled array for the intensive computation.
Figure 4 shows that a fixed number of rows are clustered to form a tile.Each tile has three groups of IO ports at the left, upper and bottom edges.The input feature data are loaded at the left edge and are horizontally shifted to the LPEs inside the tile on every cycle, weights are broadcasted to LPEs located in the same column.The outputs are locally generated on each LPE, and they are shifted to the edge of the array in the same direction with the input feature data.This dataflow is known as the output stationary defined in [19].
As shown in Figure 4, we introduce an LPE unit as the basic processing element, which executes local multiply accumulations.Besides, it also supports zero-skipping.An LPE unit mainly consists of a product table and an adder.The table can be updated by the same approach of transferring data in the array.The LPE can process one pair of the input feature data and weight at one cycle.In these trained models, the possibility that zeros occur among the values of activations is generally more than 30%.As known to all, multiplications with operand zero are ineffectual, besides, the zero products have no impact on the partial sum.In our design, multiplications with zero operands have been eliminated to reduce the power consumption.

Multi-Pattern Dataflow
From an overall perspective, the input feature data are processed by three computing units one by one.In the LCONV units, the computing begins with feeding the input feature data and weights to the first 3D tiled systolic array LCONV0 for convolutional operations by the output stationary dataflow.In this way, inside a tile, as shown in Figure 6, the LPEs of the same column will generate output points located at one output feature map, whereas the output points computed in the LPEs of different columns are located at different output feature maps.Each tile computes independently, but we can decide on the manner in which the data and weights are sent to the different tiles of the computing unit, depending on the configuration contexts, in order to improve the array use of the different sizes of CNNs.As depicted in Figure 7, one way is that tiles share one set of input feature maps but with different sets of filters (OIDF), another is that tiles load different input feature maps only with one set of filters (DIOF).If the number of filters is much larger than P × Q, we consider that the former dataflow is better.However, if the number of the filters is much less than P × Q, the second way should be selected to cater to the less-filters CONV layers.
This architecture is flexible to be used with any size of images and also any size of convolution kernel (e.g., 1 × 1, 3 × 3, 7 × 7, 11 × 11. ..), because of the dataflow inside a tile which gives the priority to data along the channel in the input feature maps.However, the stride does have a limitation, which cannot be over 4, due to the bandwidth limitation of SRAM.

Coarse Grain Task Partitioning Strategy
Tasks can be partitioned and allocated to several computing units in coarse or fine granularity.The coarse grain indicates that the partition works on the inter-layer level, while fine grain refers to partition works on the intra-layer level.Compared to the fine granularity, the coarse granularity partitioning can reduce the communication overheads between computing units, and avoid much control and configuration logic.Our design regards one layer processing as the minimum task, which cannot be further divided.
If tasks are not well partitioned, they will introduce the large load imbalance.The computing units holding easy tasks will be compelled to wait for the units bearing heavy tasks.Thus, the slowest computing unit will dominate the throughput.To address this problem, the coarse grain task partitioning (CGTP) strategy is proposed to keep the loads of two LCONV units as balanced as possible.The CONV and FCN layers contribute to almost all the computations and storage of CNNs, whereas other layers can be omitted.The FCN layers can be regarded as CONV layers when the size of input feature maps and filters are the same.Due to the limited hardware resources, it is necessary to unroll the loops and parallelize the low bit-width CNNs onto our existing hardware with a certain number of iterations [32].Suppose the size of computing array is W(row) × P(column) × Q(tile).Accordingly, the parallelism should be W × P × Q.The computing cycles of a single CONV layer is divided into two kinds, one is the OIDF dataflow case: and another is the DIOF dataflow case: In addition, as for FCN layers, the computing cycles can be calculated by the following formula: In our design, the layers of a CNN model are separated into three groups, which correspond to three tasks.Each group of layers will be mapped into one computing unit.In this work, we allocate an independent computing unit LFCN0 for FCN layers due to its low parallelism.Thus the problem is simplified into dividing CONV layers into two groups.Hence, we design Algorithm 1 to obtain two groups with the roughly equal runtime.Firstly, we get the whole processing time of a CNN model from the first loop computing.Secondly, during each pass in the second loop, it adds the processing time of each CONV layer one by one iteratively, and store the current and the last accumulated time.Then a series of comparisons will be made among these values to decide which group the current CONV layer should join.if Dataflow is OIDF then 3: else 5: end if

7:
T sum + = T l Conv 8: end for 9: for l = 1; l ≤ L; l + + do 10: if Dataflow is OIDF then 11: else 13: end if  The layers of activation, quantification, and optional pooling can also be adjusted by some transformations to simplify the executing process.For example, Equation (6) shows that the outputs from CONV or FCN layers will be activated by taking the minimum between the absolute value of the inputs and 1, which can restrict the value to [0,1] to meet the requirement of the input range for the quantification step.As for the quantification, it is demonstrated in (8) that the outputs will be quantized to four possible values a(0, 1/3, 2/3, 1) under the case of 2-bit quantification (k = 2).It can be implemented by a series of comparisons between the outputs of activation layers and the thresholds of th(1/6, 1/2, 5/6).For example, a certain output neuron equals to 1/4, which is bigger than 1/6 whereas smaller than 1/2, therefore, it will be quantized to 1/3.We make some adjustment to the original quantification function (8).The modified function is shown below, which mainly changes the function round(): a3, th 1 < x ≤ th 2 ; a4, x < th 1 .(14) where x represents the absolute value of the output point from CONV or FCN layers.
The modified quantification function will no longer be limited by the input range obtained by the activation, because this modified version has already taken those inputs out of the range [0, 1] into account.It can achieve the same effect with the previous activation and quantification functions.Besides, this modified quantification function also applies for the second activation function (7) where x denotes the output point from CONV or FCN layers.

Architecture of AQP Unit and Dataflow
The MaxP layers has the similar computing pattern with the modified quantification layers mentioned above.Moreover, the max-pooling operation does not always exist in every CONV-based layers block.Therefore, it is not efficient to allocate an independent piece of hardware for it.Hence, we propose an efficient reconfigurable three-stage AQP unit, which can be reconfigured to two modes: activation-quantification-pooling (AQP) mode and activation-quantification (AQ) mode.Thanks to the design of the AQP unit, the max-pooling operation can be incorporated in the process of the activation and quantification at the hardware level, which means that processing max-pooling operations will not bring any overhead.Besides, the design can be further optimized to be more energy-saving with the staged blocking strategy.Moreover, the three-stage pipeline structure can reduce the data path delay and support flexible window size.
As shown in Figure 8a, its function is controlled by a 1-bit configuration word Con f ig, which 1 denotes AQP function and 0 represents AQ function.The AQP unit is composed of three stages, each stage primarily comprises a comparator and some registers.The three-stage structure enables it to work in a pipelining way, which allows for supporting the different sizes of the max-pooling window.As depicted in Figure 8, the AQP module is designed as a separate part of the computing units, which mainly consists of the AQP units and the internal buffer.When Con f ig is set as AQP mode, it fetches the output feature data stored in the Ping-Pong internal buffers, which can ensure that the convolutional and AQP operation will not interfere with each other and can be processed simultaneously.Furthermore, the convolutional operation is far more computing intensive than the max-pooling operations.Therefore, the convolution can be processed in a non-stop way.When the dataflow in the 3D systolic-like array is OIDF, it will generate a portion of the data located at 16 output feature maps simultaneously over a period of continuous time.Each of 16 output feature maps has its own bank group to guarantee that data points of each output feature map can be accessed at the same time.When the dataflow in the array is DIOF, the array will produce part of data located at 4 output feature maps for a continuous period of time.In this way, every four AQP units will process data of one output feature map.When Con f ig is set as AQ mode, the data will stream into the AQP units directly without waiting in the internal buffers.There are 16 AQP units in each LCONV, and there is only one in LFC computing array.

Working Process of AQP Unit under Two Modes and Staged Blocking Strategy
The comparator will output positive when the operand above is bigger than the operand below.Figure 9a shows the details of the processing in AQP mode, and the Con f ig is set as 1.For example, assumes that (a, b, c, d) are four pixels of a subregion from one feature map, the value relationship among these pixels and thresholds is: These four pixels enter the AQP unit one by one.At the cycle 1, the pixel a compares with th 2 , since a is bigger than th 2 , the comparison signal CP1 turns to 1 and stored in the register EN1.Therefore, th 2 is chosen as the quantification result of a, stored in the register R1 in the stage 1.At the cycle 2, the pixel b enters the stage 1 of the AQP unit, for the register EN1 has already been set as 1 at the previous cycle, whatever the relationship of b and th 2 , the comparison signal CP1 should be kept at 1 controlled by MUX11.Therefore, the register R1 remains unchanged.At the same time, th 2 in the register R1 passes to stage 2 and compares with th 1 .Although th 2 is larger than th 1 , the signal MUX22 still remains 0 controlled by the EN1 passed from stage 1, therefore, the output will be kept at th 2 .In this way, the register R2 will keep the first comparison result th 2 , and so on.Once the signal Output_EN turns to 1, the result stored in the register R3 will be sent out.It means that if the input is bigger than the threshold at a certain stage, the register EN will turn to 1 to shut down the rest operations in both horizontal (the remaining pixels streaming in) and vertical (the remaining stages processing the current pixel) direction within the subregion to reduce high-low or low-high transitions.
Figure 9b depicts the details of the processing in AQ mode, the configuration word con f ig is set as 0, it means that the comparison signal CP will no longer be influenced by EN of the local stage at the previous cycle, but it will still be controlled by EN of the previous stage at the previous cycle.It means the set of EN registers will hold back the rest operations vertically.In one sense, the register EN works like a gate to block the useless operations.Due to less switching, a significant reduction in power consumption can be created.

Interleaving Bank Scheduling Scheme
In order to guarantee the conflict-free data accesses and improve memory resource efficiency, an interleaving bank scheduling scheme is proposed.Algorithm 2 depicts the details. Figure 10 shows one of the situations of the algorithm.We can observe that the multi-bank memory module can be divided at two levels: 1. Frame Level: the bank group 0 and the bank group 1 are loaded input feature maps of different frames from external memory alternatively.This means that all even-numbered bank groups are configured to provide and receive data on one frame, and all odd-numbered bank groups support another frame.2. Computing Unit Level: each computing unit corresponds to a specific set of bank groups, for example, the LCONV0 and the LCONV1 connect to the bank group 0-3 and the bank group 2-5, respectively, the LCFN0 links to the bank group 4-7.

Algorithm 2 Interleaving Bank Scheduling Scheme
1: G i represents the layer number of each group,i and j are the group index and the layer index within this group, respectively; B u denotes the bank group; F is number of the frames, v indicates the frame index; 2: // frame level 3: // computing units level 6: Read data from B s , write data to B t  If the data of input feature maps is more than capacity of SRAM, the feature maps will be divided according to its row instead of the channel.Within the bank group (e.g., B0-7), there is a Ping-Pong strategy working for transferring the intermediate outputs to the external memory when the capacity of SRAM is not enough for a certain network.So actually, there is no limitation on the input size because you can always divide the input feature maps if it is too large.For example, when processing the small network like S-Net which is derived from Dorefa-net, there is no need to cache for multiple times.Besides, the time transferring data from DRAM to SRAM can be covered by computation.
Our design has two single-port SRAM, one for the data buffer and the other for the weight buffer.The data buffer is composed of 14 banks, the weight buffer is composed of 6 banks.In the data buffer, 14 banks are partitioned into 4 groups: 4-bank, 4-bank, 4-bank, 2-bank.The first three is designed for supporting the CONV layers under the streaming architecture with Ping-Pong manner.The last one is to support the FCN layers, considering that it is the last step of CNNs processing, 2 banks are enough.In the weight buffer, 6 banks are partitioned into three 2-bank groups.Each group is used to provide weights to three computing units (two LCONV units and one LFCN unit).The capacity of the data buffer is 11.8 KB.The total bandwidth of each bank group is set as 104-bit, which can support the maximum stride of 4. The weight buffer is 30.7 KB, and the pooling buffer is 1.31 KB, which is sufficient to store the intermediate feature data.

Evaluation
In the evaluations, four networks AlexNet, VGG-16, D-Net and S-Net (SVHN-based CNN models) are chosen as benchmarks.These models are derived from DoReFa-Net proposed in [11], which weights and activations are represented by 1-bit and 2-bit respectively.First, the combined performance of different task partitioning strategies and dataflows will be evaluated with the four benchmarks.Then the comparison with some state-of-the-art designs is shown.

Computation Complexity
The total number of both additions and multiplications are taken into account according to Diannao [33].In this way, the number of operations can be calculated as follows: due to the optimization above, the operations of batch normalization is hidden in the convolutional operation, which requires no extra calculation.

Performance
It means that the reachable throughput of a system can be calculated theoretically as follows: where l denotes the number of the computing units and f represents the frequency.However, we cannot make the computing unit at full load all the time mainly for irregularly shaped networks.Therefore, we calculate the effective throughput as follows: where s represent the number of layers within each group after the task partitioning, T is the interval (i.e., the processing time of the slowest computing units).Besides, there are still two metrics to evaluate the efficiency, the energy efficiency is the area efficiency is In addition, the use of arithmetic units is noteworthy, thus

Performance Comparison
We firstly design a cycle accurate behavior model using SystemC language for the whole system, including the accelerator, the AXI interface, DMA and DRAM, to simulate and verify the data path, the control path, and the configuration path.Then the design is coded in register transfer level (RTL) and synthesized with TSMC 40 nm technology with Synopsys Design Compiler.The peak performance is 723.2GOPS, the core power is 100.1 mW at 800 MHz frequency.According to DoReFa-Net [11], the series of CNNs with 1-bit weight and 2-bit activation also have competitive accuracy.

Analysis on CGTP Strategy
The recent work of [21] partitions PE arrays into several parts, whereas only one block of the PE array is allocated to process CONV layers.The difference is that we design double processing units for the CONV layers and take the CGTP strategy to balance computing load and obtain minimal execution time.Without CGTP strategy, the processing time of CONV layers is much longer than that of the FCN layers, and this unbalance will lead to the long pipeline stalls and low throughput.We set the approach of [21] as a baseline, and compare the performance and array use between it and this work using the four benchmarks.
Figure 11 illustrates the comparison based on the D-Net model.Table 1 shows the D-Net model architecture.The blue-slash background grids indicate an example of processing a complete model in our design, which spans three short intervals, whereas the red-slash represent that of [21], which spans two long intervals.The white background grids show the suspension of computing units, which is caused by the different processing time of these computing units.We can observe that the idle time is less in our design.Thus the time to process one frame is shorter.Besides, the data dependencies between neighboring layers are allowed, which can remove some unnecessary data movement.As shown in Figure 12, the throughput and the array use are greatly improved by nearly 2× due to the CGTP strategy on average.

Analysis on Dataflow
In the processing of the CONV layers, we introduce three kinds of the dataflow: OIDF dataflow, DIOF dataflow, and their mixture, which apply to three kinds of CNN models, respectively.The OIDF dataflow is more suitable for the CNN models with a great number of filters at each layer, whereas the DIOF dataflow is designed for the CNN models consisting of less-filter layers.As for the third kind of CNN models, there are not many filters at the former layers, but the number will increase as going deeper.In such situations, the mixture dataflow is more appropriate.In general, the first and the third kind of CNN models are more popular.Figure 12 shows the comparison between these three kinds of datalows on four benchmarks after the CGTP optimization.Besides, we describe the task partitioning approach for each benchmark under every dataflow in a very detailed manner, as shown in Table 2.It is observed that the specific partition approach will change with the dataflow, due to the different processing time of each layer under different dataflows.For instance, when the benchmark is AlexNet, the LCONV1 unit will process the CONV1 and CONV2 layers, and the LCONV2 will process the CONV 3-5 layers under the DIOF dataflow.

Benchmarks
Each layer of both AlexNet and VGG-16 have a relatively large number of filters, so does D-Net.So the throughput and array use are the same high when adopting the OIDF dataflow and the mixture, whereas becoming lower with the DIOF dataflow.S-Net is a typical example of the third kind of CNN models, which is derived from Model D-Net by reducing the number of filters for all the CONV layers and the first FCN layer by 75%.S-Net has only about 3% degradation in top-5 accuracy, whereas the number of parameters largely decreases, compared to D-Net.In the test case of S-Net, we notice that about 30% degradation in the throughput and array use when adopting the OIDF dataflow compared to the DIOF dataflow.It is also observed that S-Net applied with the mixture dataflow can get nearly double performance compared to the OIDF dataflow.
Among these benchmarks, it is shown that this work can achieve 659.41 GOPS throughput and 91.79% array use on average.Moreover, we can observe that AlexNet, which has the biggest ratio between the computation loads of FCN layers and the computation loads of CONV layers among those benchmarks, shows the best performance with 703.4 GOPS.The reason is that when we allocate the relative number of LPEs for the CONV and LFC units, we give the priority to AlexNet because it has the heaviest computation loads of FCN layers.If we increase the ratio of the LPE number of the LCONV units and LFCN units, the processing time of the LFCN unit will be much longer than the other two units due to less LPEs in AlexNet.So, to some extent, AlexNet is the bottleneck of these benchmarks.As shown in Table 3, the resource use of LFCN unit is close to 100%.Besides, the performance of other three benchmarks are also about 90%, achieving great performance than previous works.The accelerator is composed of control modules (controller and configuration module), computing units, and on-chip buffers.We have synthesized the three parts with Synopsys Design Compiler respectively, and the power/area breakdown of the accelerator has been illustrated in Figure 13 in the revised version.The area and power breakdown are calculated based on the Synopsys Design Compiler.The logic module takes 9.91% area and 1.10% power.The computing units take 7.53% area and 8.99% power.The memory consumes 82.56% area and 89.91% power to store and transmit the weights, activations and integral values.

Comparison with Previous Works
Some research, like that in [21,34], usually cannot make full use of the arithmetic units due to the shallow channels or the various sizes of kernels.For example, in [34], the convolution unit engine is designed according to the normal size of 3 × 3 kernel, which will lead to heavy underuse when processing other window size convolution, especially 1 × 1 convolution.Table 4 shows the details of processing the benchmark AlexNet, including the layer description, the latency, the interval, and the effective throughput.It is noted that the latency of each layer scales linearly with the number of operations in our design, which indicates our architecture and dataflow can well solve the problem and improve the array use at each layer.When accelerating AlexNet, it only requires 6.15 ms to classify an image, achieving the TOP1 accuracy of 0.498.
Considering that those works like Eyeriss [15] and Diannao [33] are targeted at conventional CNNs with high precision are more complicated in computation, we only compare our work with low bit-width CNN accelerators.In order to make our design with 40 nm technology comparable with others, we follow a series of equations according to [17,35].This projection gives an idea of how the various implementations perform in a recent technology.

Conclusions
In this paper, we propose an efficient scalable accelerator for low bit-width CNNs based on a parallel streaming architecture.The proposed accelerator can optimize various sizes of CNN models with high throughput by applying the CGTP strategy, which is based on the heterogeneous computing units and multi-pattern dataflows.Besides, modified activation and quantification process is introduced to reduce redundancy of the computation.In addition, an efficient reconfigurable AQP unit is designed to support activation, quantification and pooling operations.Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture and reduce area overhead.The accelerator is implemented with TSMC 40 nm technology with a core size of 0.17 mm 2 .The result shows that this accelerator can achieve 7.03 TOPS/W energy efficiency and about 4.14 TOPS/mm 2 area efficiency, making it promising to be integrated with the embedded IoT devices.

Figure 1 .
Figure 1.Illustration of CONV or FCN layers.(a) Pseudo code of CONV or FCN layers.(b) Descriptions of CONV or FCN layer parameters.

Figure 2 .
Figure 2. Basic stage of a CONV-based layer block.

Figure 4 .
Figure 4. Overall architecture, systolic-like computing arrays for CONV and FCN layers, and LPE.

Figure 5 .
Figure 5. Pseudo code of combination layers.

Algorithm 1
Coarse Grain Task Partitioning Input: Number of CONV layers L; detailed dimensions of each CONV layer (M, C, H, R, E); detailed dimensions of LCONV (P, Q, W); dataflow mode Output: Two groups A, B 1: for l = 1; l ≤ L; l + + do 2:

Figure 9 .
Figure 9. Processing details of (a) AQP mode, (b) AQ mode.The dark background indicates that the stage is blocked.

Figure 12 .
Figure 12.Combined performance of the proposed CGTP strategy and different dataflows, (a) array utilization, (b) throughput.
Figure 11.Task flow comparison between CGTP and the approach in [21].97.

Table 2 .
Detailed task partitions when applying different dataflows on four benchmarks.

Table 3 .
Detailed performance of each computing unit under the optimal dataflow.

Table 5 .
Comparison with previous works.