Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

: As a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artiﬁcial intelligence applications. However, there are many parameters in FC layers, so the efﬁcient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compression approach combining block-circulant matrix-based weight representation and power-of-two quantization. Applying block-circulant matrices in FC layers can reduce the storage complexity from O ( k 2 ) to O ( k ) . By quantizing the weights into integer powers of two, the multiplications in the reference can be replaced by shift and add operations. The memory usages of models for MNIST, CIFAR-10 and ImageNet can be compressed by 171 × , 2731 × and 128 × with minimal accuracy loss, respectively. A conﬁgurable parallel hardware architecture is then proposed for processing the compressed FC layers efﬁciently. Without multipliers, a block matrix-vector multiplication module (B-MV) is used as the computing kernel. The architecture is ﬂexible to support FC layers of various compression ratios with small footprint. Simultaneously, the memory access can be signiﬁcantly reduced by using the conﬁgurable architecture. Measurement results show that the accelerator has a processing power of 409.6 GOPS, and achieves 5.3 TOPS/W energy efﬁciency at 800 MHz.


Introduction
Deep neural networks (DNNs) have been widely applied to various artificial intelligence (AI) applications [1][2][3][4] and achieve great performance in many tasks such as image recognition [5][6][7], speech recognition [8] and object detection [9]. To complete the tasks with higher accuracy, larger and more complex DNN models emerge and become increasingly popular. Large scale DNNs, however, require massive parameters and high computation complexity. For instance, there are 138 million weights and 15.5 billion multiply-accumulate (MAC) operations in VGG-16 for image recognition [10]. As a result, the implementation of these DNN models is challenging for portable devices with restricted computational capability and memory resources.
Fully-connected (FC) layers are applied in various deep learning systems. Neurons in an FC layer have connections to all activations in the previous layer, making FC layers the most memory (1) An efficient compression technique combining block-circulant matrices and power-of-two quantization for fully-connected layers is proposed. This approach can significantly save the networks' storage usage. Specifically, we directly train the FC layers in DNN models with block-circulant matrices and the storage requirement can be reduced from O(n 2 ) to O(n). Based on the above models, we further quantize each entry in weight matrices to an integer power of two. The computational complexity can be significantly reduced because the multiplications in the forward pass can be replaced by shift operations. (2) An efficient, configurable and scalable hardware architecture is proposed for the compressed networks. Instead of FFT-based acceleration methods, we design this architecture to take advantage of the circular feature of weight matrices and power-of-two quantization. Notably, multipliers are replaced by customized processing elements (PEs) using shift and add operations. Due to the adjustable sizes of block-circulant matrices, we design a configurable hardware architecture. A reorganization scheme is used to realize the configurability of layers and reduce the parameters' memory access. (3) Experiments on several datasets (MNIST, CIFAR10, and ImageNet) are presented to prove the general applicability of the proposed approach. According to our experiments, 4 bits are enough to represent the weights, thus quantization provides eight times additional compression ratio.
The proposed hardware architecture was also implemented and evaluated. The implementation results demonstrate that the proposed design has high energy efficiency and area efficiency.

Computation of FC Layers
Matrix-vector multiplication (MV) is the key computation in FC layers in the DNN inference. The computation in an FC layer is performed as follows where x is the input activation vector; y is the output activation vector; W ∈ R a×b is the weight matrix of this layer in which b input nodes connect with a output nodes; v ∈ R a is the bias vector; and f (·) is the nonlinear activation function and the Rectified Linear Unit (ReLU) is widely adopted in various DNN models. The space complexity and computational complexity of an FC layer are O(ab). In practice, the values of a and b in an FC layer can be large and there is no data reuse, so the computation of FC layers is memory intensive. For many neural network architectures, the memory access is the bottleneck to process FC layers efficiently.

Block-Circulant Matrix
A circulant matrix W c ∈ R k×k can be defined by a primitive vector w c = (w 1 c , w 2 c , . . . , w k c ), which is the first row of the circulant matrix: If a matrix is composed of a set of circulant sub-matrices (blocks), it is defined as a block-circulant matrix [24]. A block-circulant matrix W b can be expressed by where W c i,j ∈ R k×k is a circulant sub-matrix and can be represented with a primitive vector of length k. Thus, if a normal matrix is transformed into a block-circulant matrix, the numbers of parameters to be stored can be reduced by k times.

Proposed Compression Method
The approach to present weight matrices of DNNs using block-circulant matrices is proposed in [24], where an FFT-based acceleration method is applied to accelerate the inference and training. Instead of the FFT-based method, we propose to employ the power-of-two quantization in the block-circulant matrix-based DNNs to further reduce the storage requirement and computational complexity. This section describes the proposed compression method combining block-circulant matrix-based weight representation and power-of-two quantization.

Block-Circulant Matrix-Based FC Layers
When a block-circulant matrix is applied in an FC Layer, the original weight matrix W ∈ R a×b can be represented by 2D blocks of square sub-matrices, where each sub-matrix is a circulant matrix. Assume that the weight matrix W is partitioned into p × q sub-matrices and the block size (size of each sub-matrix) is denoted by k. Here, p = a k and q = b k . Then, W = [W ij ], i ∈ {1 . . . p}, j ∈ {1 . . . q}, and assume that the primitive vector of W ij is w ij = (w 1 ij , w 2 ij , w 3 ij , . . . , w k ij ). Simultaneously, the input vector is divided into q sub-vectors and then (1) is given by (with bias and ReLU omitted): where y i ∈ R k and y i = [y 1 i , y 2 i , . . . , y k i ] T . Figure 1 illustrates the representation approach. A weight matrix is partitioned into p × q circulant sub-matrices and the block size is 4. Then, the result of Wx can be computed by the multiplications of sub-matrices and corresponding sub-vectors. As described in Section 2.2, only the primitive vector of each sub-matrix need to be stored, so the storage complexity of an FC layer will be reduced from O(ab)(O(pqk 2 )) to O(pqk).  Because of the flexible size of block-circulant matrix, the compression ratio is adjustable and determined by the block size. Larger block size brings better compression ratio, but there will be more accuracy loss. On the contrary, smaller block size provides higher prediction accuracy, but the compression ratio may degrade. Notably, block-circulant matrix-based DNNs have been proven to asymptotically approach the accuracy of original networks using theoretical analysis [26].
For the training of block-circulant matrix-based DNNs, the corresponding training algorithm based on backward propagation is proposed in [24]. In this work, we do not use FFT in the inference and training, so we remove FFT from the training algorithm in [24]. The compressed DNNs are directly trained with block-circulant matrix-based weights, where no retrain process is required.

Power-of-Two Quantization
Suppose that a DNN model, which uses block-circulant matrices in FC layers, is pre-trained with full-precision weights. We propose to represent each entry in the weight matrix with an integer power of two. Thus, the weight matrix W can be approximated by a low-precision weight matrix W, and each entryŵ of W is defined byŵ where n is an integer and sign function is used to decide the sign ofŵ. Here, n is calculated by where n ∈ [n 1 , n 2 ], n 1 and n 2 are two integers, and n 1 < n 2 . Thus, all the entries of W can be selected from a codebook set S n = {0, ±2 n 1 , . . . , ±2 n 2 }.
If we use b-bit indices to represent the entries of S n , there are total M = 2 b − 1 combinations of indices. To determine the value of n 2 , we can find the maximal entry of W and then use Equation (6)  Since all entries are quantized into integer powers of two, the multiplications in the M × V can be replaced by shift operations. The computational complexity in the inference can be significantly reduced and no multipliers are required.

Training and Quantization Strategy
A two-stage training process combining block-circulant matrix-based weight representation and retrain-based quantization strategy is proposed. Algorithm 1 describes the training process. In the first stage, block-circulant matrices are employed in FC layers of a DNN model. Specifically, primitive vectors of the sub-matrices in FC layers are randomly initialized with normal distribution. Then, other parameters in the block-circulant matrices are generated by using the primitive vectors. Next, we train the network with the block-circulant matrix-based training algorithm. In the second stage, based on the pre-trained model, we further quantize the weights with power-of-two scheme, and then retrain the network with quantized parameters. At the same time, the network weight matrices are still represented with block-circulant matrices.

Algorithm 1 Training strategy with block-circulant matrices and weight quantization. PoT() is power-of-two quantization and C is the loss function.
Input: A minibatch of inputs I , targets a t , previous weights W t and previous learning rate η t .

Efficient Hardware Architecture
As shown in Figure 2, the weight matrix of an arbitrary FC layer can be represented by P row × P column circulant matrices, where P row and P column denote the row number and column number of blocks in the weight matrix, respectively. According to Equation (4), the MV in the computation of an FC layer can be partitioned into multiplications of circulant sub-matrices and sub-vectors. Each W ij x j can be computed independently and then the results of W ij x j in the same row can be accumulated to get the corresponding result vector y i . All the multiplications can be replaced by shift operations by using the quantization scheme proposed in Section 3.2. To take full advantage of the quantization scheme, we do not use the FFT-based acceleration method in [24] because: (1) FFT-based method limits the quantization of weights. In [24], the accuracy is low, if the weights are quantized to 4 bits. (2) If the weights are transformed by FFT, we cannot use shift-operations to remove multiplications. As shown in Figure 2, the weight matrix of an arbitrary FC layer can be represented by P row × P column circulant matrices, where P row and P column denote the row number and column number of blocks in the weight matrix, respectively. According to Equation (4), the MV in the computation of an FC layer can be partitioned into multiplications of circulant sub-matrices and sub-vectors. Each W ij x j can be computed independently and then the results of W ij x j in the same row can be accumulated to get the corresponding result vector y i . All the multiplications can be replaced by shift operations by using the quantization scheme proposed in Section 3.2. To take full advantage of the quantization scheme, we do not use the FFT-based acceleration method in [24] because: (1) FFT-based method limits the quantization of weights. In [24], the accuracy is low, if the weights are quantized to 4 bits. (2) If the weights are transformed by FFT, we cannot use shift-operations to remove multiplications.

Reorganization of Block-Circulant Matrix
We can directly compute the sub-matrix and sub-vector multiplications when the block size is small. However, when the block size is large, e.g. 256 × 256, directly computing the W ij x j is inefficient. In addition, because of the variable block size, a configurable hardware architecture is required. Thus, we propose to further divide every sub-matrix into smaller sub-blocks, each of which is still a circulant matrix. Similar chessboard division method is proposed in [27] for each sub-matrix. Figure 2 shows an example of the division, where each sub-matrix is partitioned into four sub-blocks. Specifically, we use a reorganization scheme [27] in each sub-matrix to further transform it into a block pseudocirculant matrix [28,29]. The detailed transformation scheme is illustrated as follows.
We assume that a sub-matrix W ij is partitioned into α 2 sub-blocks, each of which is γ by γ (γ = k α ) short circulant matrix denoted by P β ij : where β = 1, 2, 3, . . . , α. Correspondingly, vector x j is divided into α small vectors denoted by . . , α. The circulant matrix vector multiplication (C-MV) y i = W ij x j can be represented as follows where S γ is the cyclic shift operator matrix Then, we can compute Each P β ij X β j in the polynomial in Equation (11) is a C-MV and can be computed separately.

Architecture of Block-MV
Using the reorganization scheme in Section 4.1, we can transform a sub-matrix into smaller sub-blocks. By observing the block size, we set the 16 × 16 as the basic size of a sub-block, because all block sizes in our experiment were the multiples of 16. If the block size is 16 × 16, we can directly compute the C-MV. If the block size is larger than 16, the matrix can be naturally reorganized into several 16 × 16 sub-blocks. As shown in Figure 3, the Block-MV (B-MV) module is the key computing module for processing the short C-MV, P β ij X β j . Suppose the block size k = 64 and α = k 16 = 4. Then, W ij x j can be represented as follows When computing Y 1 i , X 1 j = (x 1 j , x 5 j , x 9 j , . . . , x 61 j ) and the primitive vector of P 1 ij , p 1 ij = (w 1 ij , w 5 ij , w 9 ij , . . . , w 61 ij ) will first be fetched. Then, the primitive vector will be shifted to form the other rows in P 1 ij . R 0 and C 0 are parameters to denote the row and column of each sub-block. If R 0 > C 0 , p 1 ij needs an additional left shift to implement the function of S γ . Then, the activation vector and corresponding weight vectors will be sent to 16 row processing elements (R-PEs) to compute dot products in parallel.
When P 1 ij is computed, X 2 j and p 2 ij will be prepared by a read controller (R-ctrl). After the calculation is finished, P 2 ij X 2 j will be computed in the B-MV and the result will be accumulated in R-PEs. Next, P 3 ij X 3 j and P 4 ij X 4 j will be computed successively. When computation of C-MVs in the first row is finished, the result will be sent to an accumulator module. It is the same for computing Y 2 i , Y 3 i , and Y 4 i in the B-MV. Finally, when the B-MV finishes all the C-MVs of one sub-matrix, it will start the processing for the following sub-matrix. With the parallel processing of R-PEs, a sub-block can be processed in 1 clock, so α 2 clocks are required to compute a sub-matrix.

Row Processing Element
As shown in Figure 4, R-PEs are basic processing elements in the B-MV and each R-PE can compute m MACs in parallel. The mathematical representation of an R-PE's output is  Table 1 shows the shift operations of a BSU with 4-bit indices.
is computed by the adder trees, the partial sum is stored in a register and accumulated every clock cycle. The reset signal in Figure 4 works when current partial sum is sent to the accumulator module for further processing.

Configurable Hardware Accelerator Architecture
The overall architecture of BPCA is shown in Figure 5. To fit for the adjustable block size and various network sizes of FC layers, BPCA is designed to provide flexibility for different needs. In general, designing a configurable accelerator involves the considerations of several factors: P size , the block size of one sub-matrix; P row , the row number of sub-matrices in a weight matrix; P column , the column number of sub-matrices in a weight matrix; and ACC num , the number of accumulators in the Accumulator module. According to the configuration parameters, the configurable controller (Config-ctrl) controls the computing of other components in BPCA. Weights, bias, and input activations are stored in three different SRAMs, which are W-RAM, B-RAM and A-RAM in Figure 5, respectively. The weights and input vectors are sent to the B-MV by the read-controller (Read-ctrl) according to different P size . The output vectors of all the sub-matrices in one row will be accumulated in the Accumulator module. In the following, we describe the configurable scheme in detail.
For different FC networks, the BPCA requires configurable parameters including P size , P row and P column . As described in Section 4.2, the Block-MV can process a fixed-sized sub-block for one time. According to P size , the Read-ctrl feeds the 16-by-16 sub-blocks and corresponding activations to the Block-MV for processing. When all sub-blocks in one sub-matrix are processed, the Config-ctrl will move on to the calculation of the next sub-matrix in the same sub-matrix row. As shown in Figure 2, BPCA performs a row-wise computing scheme. Thus, according to P column , the Config-ctrl needs to judge whether current sub-matrix is the last one in current sub-matrix row. If the sub-matrix is the last one, the P sum in the Accumulator module will be sent to the Bias-ReLU module for further processing, and then the final result vector of one sub-matrix row will be output. Because the size of a sub-matrix is adjustable, ACC num is determined by the biggest number of P size that BPCA supports. Since the size of output vector of each sub-matrix row is changeable, the Config-ctrl selects P size accumulators in the Accumulator module in every computing procedure.
As the main computing element, B-MV determines the processing time of an FC layer and the system throughput. In addition, the proposed hardware architecture is scalable, and the system throughput can be improved by adding more B-MVs. (16 16) Output Accumulator

Experiments of The Proposed Compression Method
We verified the effectiveness of the proposed method on three standard datasets: MNIST, CIFAR-10, and ImageNet. We compared the accuracy and storage cost of the compressed neural network models and the original neural network models. Since the effectiveness of applying block-circulant matrices in FC layers has been proven in [24,25], we pay more attention to the effectiveness of power-of-two quantization on the block-circulant matrix-based FC layers. To test quantization precision's influence on prediction accuracy, we selected different n 1 for S n and quantized the weights with different quantization precision. Notably, the aim of our experiment was not to seek the state-of-the-art results on these datasets, thus the accuracy of the original neural networks was only used as a baseline for a fair comparison with the compressed neural networks.

Experiments on MNIST
MNIST is a digit dataset that contains 28 × 28 grey-scale images of ten handwritten digits (0 to 9). In MNIST, there are 60,000 images for training and 10,000 images for testing. The baseline model that we used is a three-layer DNN denoted by Model-A. The matrix sizes and parameters of the FC layers in Model-A are shown in Table 2. We tested the proposed compression methods on Model-A and the training results are given in Table 3. We first applied block-circulant matrices (k = 16) in FC1 and FC2 layers. Then, we quantized the weights under different quantization precision. When n 1 = −2, the quantization precision is 1/4 and weights can be represented with 3 bits. Thus, quantization provides 10.7 times compression ratio (compared with 32-bit floating point). Other quantization precision requires 4-bit indices and the storage cost can be saved by eight times. We can see that quantization precision has little effect on the prediction accuracy. As a result, block size of 16 and 3-bit quantization can compress the FC layers by 170 times with 1% accuracy loss.

Experiments on CIFAR
CIFAR-10 dataset contains 60,000 natural 32 × 32 RGB images covering 10-class objects. There are 50,000 images for training and 10,000 images for testing. In our test, the network consisted of six convolutional layers and three fully-connected layers [18]. ReLU was used as the activation function. The proposed compression method was used in FC4 and FC5 layers. We also selected different n 1 to test the accuracies under different quantization precision, which is shown in Table 4. There is negligible accuracy loss when weights are quantized into 3 or 4 bits. Thus, the FC layers can be compressed by 2731× with 0.8% accuracy loss.

Experiments on ImageNet
We further evaluated the effectiveness of the proposed compression methods on the ImageNet ILSVRC-2012 dataset, which contains images of 1000 categories of objects. There are roughly 1000 images in each of the 1000 considered categories. AlexNet [5] was used as the baseline model denoted by Model-C, which consisted of five convolutional layers, two FC layers and one final softmax layer. The Top-1 accuracy and Top-5 accuracy of Model-C are 56.257% and 79.018%, respectively. The FC layers of Model-C are shown in Table 2. We applied block-circulant weight matrix in FC7, FC8 and FC9 layers with 16 × 16 sub-matrices. Then, the weights were quantized with 4 bits. The length of output vector in FC9 was not a multiple of 16, thus we added eight rows at the end of the weight matrix to make it a 4096 × 1008 matrix. Then, the network could be trained as normal, but the eight additional outputs were dropped to get the results of 1000 categories that we wanted. As shown in Table 5, the network can be compressed by 128 times with 1.8% accuracy loss. When power-of-two quantization is used in large scale networks, there is more accuracy loss compared with small networks.

Result Analysis
The power-of-two quantization scheme is effective when it is combined with block-circulant matrix-based weight representation on the three models. In general, the accuracy of a network is dominated by the block size of weight matrices. To achieve a trade-off between accuracy and compression ratio, the block size of each layer should be adjusted carefully. Quantization causes minimal accuracy loss on MNIST, CIFAR-10 and ImageNet, when the weights are quantized into 4 bits with precision 1/64.

Hardware Implementation Results
We evaluated the performance of the hardware architecture proposed in Section 4. The RTL of the design was implemented in Verilog and synthesized by using the Synopsys Design Compiler (DC) under the TSMC 40 nm CMOS technology. We measured the area, power, and critical path delay of the accelerator. Benchmarks were obtained from Models A-C.

Evaluation Results and Analysis
The synthesis results of the top architecture are presented in Table 6. A frequency of 800 MHz has been achieved. In our design, the basic size of a sub-block can be processed is 16 × 16. One B-MV containing 16 R-PEs is used for processing sub-blocks. The synthesis results of the R-PE are presented in Table 6. Table 6. Synthesis results of the proposed hardware architecture (logic part) and row processing element.

Features
Top-Level Architecture The on-chip SRAM consisted of W-RAM, A-RAM, and B-RAM. Notably, since the accelerator needed to process different scale networks, the SRAM sizes were selected to suitable for large scale FC layers. The on-chip SRAM was modeled using Cacti [30] under 45 nm process. In our design, 16 bits and 4 bits were used to represent activations and weights, respectively. The A-RAM was 9216 × 16 bit = 18 KB for storing activations, where the maximum input length of activations was 9216. The width of W-RAM was 64 bits and the size of W-RAM aws 1.747 MB. Thus, all compressed weights in FC layers of AlexNet could be stored in W-RAM. In addition, we used 16 bits to represent biases and B-RAM was 4096 × 16 bit = 8 KB, where the maximum number of biases was 4096.

Row Processing Element
The power/area breakdown of the accelerator is presented in Table 7. BPCA occupies 12.72 mm 2 , where the memory dominates the chip area with 98.87% of the total area. The logic part of BPCA only occupies 0.16 mm 2 . The total power consumption is 77.09 mW and the memory consumes 24.13 mW. To measure the performance of BPCA, we defined throughput (T n ) as the input numbers per seconds. T n is dominated by the B-MV and is given by In our design, γ is 16 and f clk = 800 MHz, thus T n = 256 P size numbers/s. The bigger the P size is, the smaller the input throughput is, which means less memory access is required. As for the computing performance, the giga operations per second (GOPS) is 409.6 GOPS corresponding to an uncompressed layer.
The computation time to process one a × b weight matrix in the steady stage is Nine pipeline stages are introduced in the design, so the actual runtime of one layer requires an additional nine clock cycles.
We selected benchmarks from Model-A, Model-B, and Model-C to evaluate performance of the accelerator. Layers A-E had different matrix sizes and block sizes. The configuration parameters and computation time are shown in Table 8. It takes 184.32 µs to process layer C, which is the largest FC layer in AlexNet. To process middle scale layers, such as Layers A and B, only a few microseconds are consumed.  16,32,64,128, and 256. The basic sub-block size, which is processed by a B-MV, can be changed according to different requirements.
To fit for larger weight matrices, BPCA can be scaled up by adding more B-PEs. As shown in Figure 6, a large weight matrix can be divided into two parts and the corresponding parameters are stored in different weight RAMs. The two-part weight matrices can be processed by two B-MVs in parallel. To further improve the system throughput, more B-MVs can be used in the system.

Comparisons with Related Works
Based on FPGA or ASIC platforms, numerous works on DNN accelerators have been proposed. We compared BPCA with representative state-of-the-art ASIC development and synthesis results. Table 9 shows the comparisons with accelerators focusing on compressed networks or uncompressed networks. In one PE of EIE, there are on-chip memory in sparse matrix read unit, activation read/write and pointer read unit for decoding the compressed weights. The memory takes 93.22% of the PE's area. The arithmetic unit in a PE performs multiply-accumulate operations by using one multiplier and one adder. Running at 800 MHz, the logic part of one PE takes an area of 43,258 µm 2 , resulting in area efficiency of 37 GOPS/mm 2 . The basic processing element in BPCA is R-PE. One R-PE can perform 16 multiply-accumulate operations in parallel using 16 adders and no multipliers. Because there are no additional decoding process, the area of an R-PE is dominated by computing units. As shown in Table 7, at 800 MHz, an R-PE takes 6944 µm 2 , resulting in area efficiency of 3.7 TOPS/mm 2 . Without multipliers and decoding units, R-PE achieves significantly higher area efficiency than the PE of EIE.
In EIE, the sparse weight matrices, which are encoded in compressed sparse column (CSC) format, can all be stored in on-chip SRAM. With the encoding format, additional 4-bit indices and pointer vectors are required for the decoding of compressed weight matrices. The weight SRAMs of 64 PEs is 8 MB and can hold 4-bit weights and 4-bit indices of AlexNet's FC layers with at least 8× weight sparsity. Each PE has a pointer SRAM and the total SRAM capacity for pointers is 2 MB. In addition, the activation SRAM in each PE is 2 KB and 64 PEs use 128 KB SRAM for activations. The on-chip SRAM dominates the area in EIE and takes 93.22% of the total area of EIE. At 800 MHz, EIE (64 PEs) can yield a performance of 102 GOPS and the area efficiency can achieve 2.5 GOPS/mm 2 (including SRAM). In BPCA, the accelerator also stores the FC layers of AlexNet in SRAM. Using 16 × 16 block-circulant matrix, the weight matrix can be compressed by 16×. By using power-of-two quantization, the weights can be represented with 4 bits. Storing the block-circulant matrix-based layers does not require indices or pointers such as storing the sparse matrices in EIE. Without using additional indices, 2× storage usage can be reduced. In addition, 2 MB storage can be saved by using no pointers. In the accelerator, the weight SRAM is 1.772 MB and the activation SRAM is 18 KB. The on-chip SRAM takes 98.76% of BPCA's area. For the same benchmark, AlexNet, BPCA can achieves an area efficiency of 31.8 GOPS/mm 2 , which is 13× of EIE.
Our work also outperforms EIE in energy efficiency. In one PE of EIE, the memory consumes 59.15% of the total power. Sixty-four PEs use 10.125-MB SRAM, which consumes 348 mW. Compared with EIE, we use less SRAM and save 14× power of memory. The logic part of one EIE's PE consumes 3.74 mW, but the power of R-PE is only 57.5% of EIE's PE. As shown in Table 9, the energy efficiency of BPCA can achieve 5.3 TOPS/W, which is 31× of EIE. DNPU [31] is configurable heterogeneous accelerators supporting CNNs, FCLs, and RNNs. We mainly compare BPCA with the part of DNPU on FC layers. The weights in DNPU can be quantized into 4-7 bits using dynamic fixed point. However, without other compression method, the parameters of DNPU are still required to be fetched from external memory (DRAM). By using the proposed compression method, our work can save much more DRAM access than DNPU. Besides, one compressed FC layer can be stored in on-chip SRAM, which consumes less energy than storing parameters in off-chip memory. In DNPU, quantization table-based matrix multiplication is used and 99% of the 16-bit fixed-point multiplications can be avoided. At 200 MHz, the energy efficiency of DNPU can achieve 1.1 TOPS/W. BPCA also does not use any multipliers and can achieve 5 TOPS/W at 800 MHz. Compared with DNPU, our work achieves competitive performance and energy efficiency with less memory access.

Conclusions
In this paper, we present an effective compression method and an efficient configurable hardware architecture for fully-connected layers. Block-circulant matrices and power-of-two quantization are employed in this approach to compress the DNNs. Block-circulant matrices provide an adjustable compression ratio for weight matrices by changing the block size. Using power-of-two quantization, we can represent the weights with low precision and reduce computational complexity by replacing the multiplications with shift operations. The approach was applied to three classic datasets and provided remarkable compression ratios. For MINIST, the FC layers can be compressed by 171× with 1% accuracy loss. The results on CIFAR-10 show that the FC layers can be compressed by 2731× with negligible accuracy loss. Our experiment on ImageNet compresses the weight storage of AlexNet by 128× with 1.8% accuracy loss. The proposed compression method outperforms pruning-based method in compression ratio and avoids irregular computations of pruned networks.
To process the compressed neural network efficiently, we develop a configurable, area-efficient and energy-efficient hardware architecture, BPCA. BPCA can be configurable to process FC layers with different compression ratios. Without using multipliers, a lot of chip area can be saved in R-PEs and system throughput can be significantly improved. The proposed design is implemented under the TSMC 40 nm CMOS technology. Using 1.772-MB SRAM, the accelerator can do 409.6 GOPS in an area of 12.72 mm 2 and dissipates 77.09 mW. Our work outperformed pruning-based accelerator, EIE [12], in energy efficiency by 31× and area efficiency by 13×.

Discussion
In this paper, the proposed compression scheme targets FC layers. In the experiments for Model B and Model C, we only applied the proposed compression method in FC layers and the accuracy of Model C has already shown some reduction on Imagenet dataset. Since FC layer has a lot of redundancy, the quantization may cause minimal accuracy loss. As for convolutional layers, which are sensitive to quantization, the proposed method may cause much accuracy loss. Thus, the quantization scheme should be examined carefully in block-circulant matrix-based CNNs. In our future work, we will evaluate the proposed method on convolutional layers and explore better quantization schemes on block-circulant matrix-based CNNs.