# CENNA: Cost-Effective Neural Network Accelerator

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background and Related Works

#### 2.1. Convolutional Neural Network (CNN)

**inFmap**) and a convolution kernel (

**cKernel**) are performed to extract features and generate an output feature map (

**outFmap**).

#### 2.2. Key Issues in CNN Accelerator Implementation

#### 2.2.1. Computation Complexity

**FIFO**), and arrays of PEs (

**PE array**). Each PE consists of a multiplier and an adder. In the PE-array structure, typically, a buffer called

**Global Buffer**loads the data from an off-chip dynamic random-access memory (DRAM). The loaded data is sent to FIFO that distributes the data to the PE array. The multiplication between the feature map and the convolution kernel can be executed in parallel if numerous PEs are available. Therefore, implementations in [9,10,11] employ many PEs to achieve highly parallel computation.

**Reduction Tree**(including multipliers and adders), a buffer for distributing input values (

**Distributor**), and a

**Prefetch Buffer**. As in the PE array structure, data are loaded from the off-chip DRAM and stored in the prefetch buffer. The distribution buffer takes data from the prefetch buffer and distributes input values to multipliers. To fully utilize the parallelism in this structure, numerous multipliers and a large reduction tree are required [12,13].

^{3}) in naïve multiplication is reduced to O(n

^{2.807}) in Strassen’s multiplication and O(n

^{2.795}) in Winograd’s multiplication. For example, to perform a 2 × 2 matrix multiplication, naïve multiplication requires eight multiplications, but both Strassen’s method and Winograd’s method require seven multiplications [20]. The number of multiply operations is reduced in Strassen’s and Winograd’s method. However, the number of add/sub operations and the number of computation steps in a matrix multiplication increase. This means that more complicated add/sub logic circuits and more memory transactions to store and retrieve intermediate results are needed.

_{11}, C

_{12}, C

_{21}and C

_{22}, computing C

_{11}requires four arithmetic steps, whereas computing C

_{12}requires three steps. However, as shown in Figure 3b and Figure 4b, in naïve multiplication, each result requires the same four arithmetic steps. The more the arithmetic steps, the more memory they require, which leads to additional power consumption. Furthermore, the larger the matrix size, the more irregular the arithmetic steps become. For example, to calculate a 4 × 4 Strassen’s multiplication, the total number of arithmetic steps is eight, and it requires six to eight steps depending on elements in the final product matrix. In contrast, the naïve multiplication requires only the same three steps for every element in the final result. This means that, as the size of the matrix increases, Strassen’s multiplication requires more memory than naïve multiplication, and the steps involved in computing the result shall become more irregular. The performance and the hardware cost of a pipelined implementation are approximately determined by how appropriately pipeline stages are divided [24,25]. In addition, if the delay in each pipeline is uneven, such irregularity causes a significant complexity increase and energy inefficiency [26]. Thus, it is not straightforward to determine the pipeline stages and the balanced delay of each pipeline stage because of the irregular arithmetic steps.

#### 2.2.2. Data Reuse

**Global Buffer**, and this buffer is shared among PEs to avoid large amounts of data transfers from the off-chip memory. As shown in Figure 2a, each PE exchanges data with other PEs via a bus and each is connected to the off-chip memory controller (

**Memory Controller, MC**). Each PE includes not only the logic circuits for computation but also a controller to communicate with other PEs. In the communication between PEs, each PE sends and receives the data of a feature map and a convolution kernel. If the communication between PEs becomes complex, the hardware cost increases. The implementation cost of the PE-array-based method is typically very high [9,11]. The implementation in [9] shows that more than half of the power consumption is derived from hardware logic blocks for data reuse.

**Communicator and Distributor**) are utilized for data reuse. To reduce the data movement from DRAM, this architecture typically stores reused data to a prefetch buffer (

**Prefetch Buffer**) and distributes it to a reduction tree (

**Reduction Tree**) using distribution logic circuits, as shown in Figure 2b. The reduction tree has a fixed data flow that performs multiplications in parallel and accumulates the results of the multiplications into an accumulator. In a fixed flow, data reuse is employed using a communicator logic circuit (

**Communicator**) that allows data reuse between multipliers, as shown in Figure 2b. Implementations in [10,11] employed a logic block for data sharing between multipliers to enable flexible data sharing, but heavy power consumption and silicon area were observed.

**Distributor**) provides both the feature map data and the convolution kernel data to multipliers. This makes it possible to reuse data either using a single feature map with multiple kernels or using multiple feature maps with a single kernel. Therefore, the distributor logic holds a lot of data, which leads to high power consumption and large silicon area. In addition, the implementation employs two reuse schemes (

**Communicator and Distributor**), and it suffers from excessive power consumption and large chip area [12].

## 3. Cost-Effective Neural Network Accelerator (CENNA) Architecture

#### 3.1. Proposed Matrix Multiplication Engine

_{1}–M

_{7}), and those seven sub-matrices are added and subtracted in the same way as Strassen’s method as shown in Figure 5b. The cost-centric multiplication operates in three arithmetic steps: (i) summation and difference of 2 × 2 sub-matrices (e.g., A

_{11}+ A

_{22}, B

_{12}− B

_{22}), (ii) the naïve multiplication of the summations and differences to obtain M

_{1}–M

_{7}, and (iii) summations and differences of some M

_{i}’s to compute the result (C

_{11}, C

_{12}, C

_{21}, and C

_{22}).

#### 3.2. CENNA Architecture

**64 KB static random-access memory, SRAM**) and the proposed matrix multiplier (

**Matrix Engine**). The memory block stores convolution kernels and feature maps. Data from the external DRAM are stored in the memory block and are sent to

**Matrix Engine**. The

**Matrix Engine**consists of components for the proposed matrix multiplication (

**1st Addition**,

**M**,

_{1}–M_{7}**2nd Addition**) and those for convolution operations (

**fMap**,

**cKernel**,

**Accumulator**,

**ReLU**, and

**pSum**).

**1st Addition**unit carries out the first step, which involves the summation and difference of the 2 × 2 sub-matrices (e.g., A

_{11}+ A

_{22}and B

_{12}− B

_{22}). Each

**M**unit in Figure 6 carries out naïve multiplication of summations and differences of results from the

**1st Addition**to obtain

**M**, and the

_{1}–M_{7}**2nd Addition**unit carries out summations and differences of some of the

**M**to compute the results (C

_{i}’s_{11}, C

_{12}, C

_{21}and C

_{22}). Each

**M**unit contains 8 multipliers and 4 adders. Because CENNA includes 7

**M**units (

**M**), a total of 56 multipliers operate in parallel.

_{1}–M_{7}**fMap**buffer and the

**cKernel**buffer store a portion of the feature map and the convolution kernel. We have provided a detailed description on

**fMap**and

**cKernel**in Section 3.3. The

**Accumulator**unit accumulates the result of matrix multiplication to obtain an output feature map of CNN. The

**pSum**buffer stores the result of the

**Accumulator**unit and the result is passed to either the

**ReLU**unit or the memory block depending on whether an output feature map is completely computed. The

**ReLU**unit performs the rectifier linear unit (ReLU) function. When an output feature map is completely generated, the values will go through the

**ReLU**unit, and eventually will be stored in the

**64K SRAM**block.

#### 3.3. Convolution Operation in CENNA

**fMap**,

**cKernel**). While loading data, the feature map and the convolution kernel are stored in a 7 × 7 size and in four types of convolution kernls in a 4 × 1 size in buffers. In CENNA, the kernel window moves along the 7 × 7 input feature map stored in the buffer. We will discuss the loading process detail in Section 3.4. Second, once data are loaded, matrix multiplication of a 4 × 4 size is performed. In the matrix multiplication between a feature map and a convolution kernel, the result of the matrix multiplication is a partial sum of the output feature map. Third, partial sums are combined to achieve an output feature map. Next, it is stored in the off-chip memory (

**External Memory**) via the activation function.

#### 3.4. Convolution Operation Using Matrix Mulitplication

_{i,j}) and four types of convolution kernels of size 4 × 1 (w

^{t}

_{i,j}–w

^{t}

_{i,j+3}), where i, j, and t indicate the row position, the column position, and the kernel type, respectively. In the first computation, the result (p

^{t1}

_{1,1}) is a partial sum of the output feature map that pertains to the first row in the input feature map (x

_{1,1}–x

_{1,4}) and the first type of convolution kernel (w

^{t1}

_{1,1}–w

^{t1}

_{1,4}). The second computation (p

^{t2}

_{1,1}) is a partial sum of the output feature map pertaining to the first row in the input feature map (x

_{1,1}–x

_{1,4}) and the second type of convolution kernel (w

^{t2}

_{1,1}–w

^{t2}

_{1,4}), etc. The computation between the second row in the input feature map (x

_{2,1}–x

_{2,4}) and the first type of convolution kernel (w

^{t1}

_{1,1}–w

^{t1}

_{1,4}) generates a partial sum (p

^{t1}

_{2,1}) that is the convolution operation when the kernel window moves to the second row of the feature map. The same process is repeated, and partial sums are thus generated (p

^{t1}

_{1,1}–p

^{t4}

_{4,1}).

_{1,1}–x

_{4,4}) and the component of the first row of the convolution kernels (w

^{t}

_{1,1}–w

^{t}

_{1,4}), the input feature map when the kernel window is moved down by one row (x

_{2,1}–x

_{5,4}) is multiplied by values corresponding to the second row of the convolution kernels (w

^{t}

_{2,1}–w

^{t}

_{2,4}). Similarly, the remaining partial sums are calculated in the same way, as shown in Figure 8. Finally, all partial sums are combined to generate output feature maps.

^{t1}

_{i,j}–w

^{t1}

_{i,j+3}) and four rows in an input feature map (x

_{i,j}–x

_{i+3,j+3}) reuse such values. That is, the results of matrix multiplication (p

^{t1}

_{1,1}–p

^{t1}

_{4,1}) are computed when one convolution kernel moves to the next row of an input feature map. In addition, it can be conducted in parallel.

#### 3.5. Tiling-Based Data Reorganization

**DT**) is proposed in CENNA. The proposed tile-based data management partitions an input feature map into tiles of size 7 × 7 and a convolution kernel into four tiles of size 4 × 4, respectively. This approach simplifies dataflow and reduces hardware implementation complexity by accessing data to the on-chip memory with a uniform size. To implement the proposed

**DT**method, CENNA employs an on-chip memory hierarchy that processes feature maps with several stages.

^{t}

_{1,1}and a

^{t}

_{2,1}) have 12 overlapped elements in the feature map. Notably, most overlapped elements can be reused for the next kernel window if we reorganize overlapped elements to be adjacent. As shown in Figure 9b, a 7 × 7 tiled block of an input feature map (

**BLK**) is stored in the

_{0}**fMap**buffer, and four types of 4 × 4 tiled convolution kernels are stored in the

**cKernel**buffer. Next, a 4 × 4 kernel window in the

**fMap**buffer moves across the current

**fMap**window (

**BLK**) and generates partial sums, which will be stored in the

_{0}**pSum**buffer. Through DT, the overlapped elements between adjacent 4×4 kernel windows can be reused when generating an output feature map (a

^{t}

_{1,1}, a

^{t}

_{2,1}and d

^{t}

_{3,1}, d

^{t}

_{4,1}) as depicted in Figure 9b. In addition, for a new 7 × 7 tiled block of an input feature map (

**BLK**) operation, only the newly needed data are loaded.

_{1}**Load**,

**Matrix Multiplication**,

**Convolution Operation**, and

**Store**stages. As explained in Section 3.1, the

**Matrix Multiplication**stage is further divided into 3 stages, which makes the entire pipeline regarded as a 6-stage pipeline. During the

**Load**stage, a 7 × 7 tiled block (e.g.,

**BLK**) of an input feature map is fetched. In the

_{0}**Matrix Multiplication**and

**Convolution Operation**stages, CENNA carries out the convolution operation with the loaded 7 × 7 tiled block. The computed results (a

^{t}

_{1}

_{,1(4)}) are stored in the

**pSum**buffer during the

**Store**stage. Eventually, after repeatedly processing all the tiled blocks, the final result (a

^{t}

_{1}

_{,1(5)}) is obtained through the ReLU operation. It should be noted that when the pipeline is fully filled, five elements in the output feature maps are computed in parallel with only one set of execution units.

## 4. Hardware Implementation

## 5. Evaluation

#### 5.1. Latency and Throughput of CENNA

#### 5.2. Performance Comparison with the State-of-the-Art Accelerators

**real throughput**and

**peak throughput**. In addition, we compared the

**frame rate**(frame/s) when the accelerator is running at

**real throughput**.

#### 5.2.1. Processing Elements (PE) Array-Based Accelerators

**row stationary**,

**2D-SIMD**, and

**1D chain**). Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks (Eyeriss) [9] is a well-known PE-array accelerator, which offers an efficient dataflow model called

**row stationary**. Row stationary is a way to increase the reusability of data. It is designed to maximize data reusability inside a PE. As shown in Table 7, Eyeriss consumes 4.99 times more power and takes 8.88 times larger silicon area than CENNA. More than 45% of power consumption is in the PE network block such as the clock network and the PE controller circuit. In terms of

**peak throughput**, Eyeriss and CENNA are similar. However, there is a large gap between

**peak throughput**and

**real throughput**when

**real throughput**is measured while executing convolution operations. This is because it takes a lot of time to pass data to each PE in the PE array. In Eyeriss, the time to transfer data to each PE is different for the convolution layer. In the worst case, the time for data transfer is about half the total execution time. Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS (ConvNet) [10] employs variable precision for each convolution layer to reduce power consumption. It includes a special PE that can compute results with variable precision, and the PE-array architecture employs a data flow structure called

**2D-SIMD**. 2D-SIMD can exploit the parallelism using a PE array configured in a mesh topology [32]. It takes advantage of computing 2D pixels of the image in parallel [33]. Compared to CENNA, the

**peak throughput**and the

**real throughput**of ConvNet is much better than that of CENNA. However,

**frame rate**is less than that. ConvNet’s 2D-SIMD is optimized for 16 × 16 matrix multiplication. Therefore, when computing a small size kernel on models like VGG-16, its average multiply and accumulate (MAC) utilization rate is less than 55%. Energy-efficient 1D chain architecture for accelerating deep convolutional neural networks (Chain-NN) is an implementation that reduce the communication overhead between PEs using

**1D chain**[11]. In conventional communication structures in the PE-array based accelerators, one PE is connected to multiple PEs to maximize data reuse. However, in

**1D chain**communication, only one adjacent PE is connected like a chain structure. Compared to other methods, the hardware cost is very high, and it can achieve high computing performance. Compared to CENNA, in terms of

**peak throughput**,

**real throughput**and

**frame rate**, Chain-NN is much better than CENNA. Because Chain-NN focuses on maximizing computing performance at the expense of high hardware cost, it uses more SRAM and operators than other accelerators. It achieves 11.69 times better

**real throughput**than CENNA, but it requires 7.75 and 11.99 times more silicon area and power consumption, respectively. When compared based on the 65 nm technology, Chain-NN requires 17.98 and 27.82 times more silicon area and power consumption than CENNA, respectively.

#### 5.2.2. Reduction Tree-Based Accelerators

**communicator**,

**filter bank**). Multiply-accumulate engine with reconfigurable interconnects (MAERI) [12] allows data to be reused by a logic block called

**communicator**that allows communication between multipliers and adders. It employs a switchable adder and multiplier logic block, and if there is a possibility for data reuse, the data are forwarded to adjacent operators. For example, in the case of convolution kernel reuse, a switchable multiplier forwards the weight of a convolution kernel to an adjacent multiplier. However, it requires a large amount of power consumption. More than 50% of power is dissipated in switchable logic blocks. In addition, these logic blocks take more than a quarter of the total silicon area. As shown in Table 8,

**Real throughput**and

**frame rate**of MAERI are similar to those of CENNA. However, MAERI consumes 7.82 times more power than CENNA. Origami [13] employs 12-bit precision computation and includes hardware logic blocks that can compute 7 × 7 sized convolution kernels, which is called the sum of product (SOP). To reuse data in Origami, four SOPs share a register called

**filter bank**, and it is used to hold weights of a convolution layer. Therefore, in Origami, area cost and power consumption due to the

**filter bank**are quite high. Compared to CENNA, its

**real throughput**is similar to that of CENNA, but it consumes 1.96 times more power. It holds more than 4 KB data in the register. However, CENNA stores convolution kernels in SRAM and uses registers minimally.

#### 5.3. Efficiency Comparison with the State-of-the-Art Accelerators

**real throughput**(tera-operations per second, TOPS) per watt. Second, area efficiency is defined as

**real throughput**(giga-operations per second, GOPS) per area (mm

^{2}). Lastly,

**overall efficiency**is defined as

**real throughput**per power consumption and area. In

**power efficiency**, CENNA turns out to be the most efficient accelerator compared to others. Compared to the accelerators, CENNA is up to 12.1 times more efficient. In

**area efficiency**, Chain-NN outperforms other accelerators. It is 1.58 times more area-efficient than that of CENNA. However, when implemented in the 65 nm technology, the area efficiency of CENNA is 1.47 times better that of Chain-NN. CENNA also achieves 2.33 times better area efficiency than Origami. When the

**overall efficiency**that takes real throughput, power consumption and silicon-area into account, is compared, CENNA is at least 4.63 times and up to 88 times better than the compared implementations. Therefore, we conclude that the proposed CENNA architecture is very cost effective.

## 6. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 91–99. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Abdel-Hamid, O.; Mohamed, A.-R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.
**2014**, 22, 1533–1545. [Google Scholar] [CrossRef] [Green Version] - Lee, G.; Jeong, J.; Seo, S.; Kim, C.; Kang, P. Sentiment Classification with Word Attention based on Weakly Supervised Learning with a Convolutional Neural Network. arXiv
**2017**, arXiv:1709.09885. [Google Scholar] - Cong, J.; Xiao, B. Minimizing computation in convolutional neural networks. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2014; pp. 281–290. [Google Scholar]
- Motamedi, M.; Fong, D.; Ghiasi, S. Fast and energy-efficient cnn inference on iot devices. arXiv
**2016**, arXiv:1611.07151. [Google Scholar] - Chen, Y.-H.; Emer, J.; Sze, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News; Association for Computing Machinery: New York, NY, USA, 2016; pp. 367–379. [Google Scholar]
- Moons, B.; Verhelst, M. An energy-efficient precision-scalable ConvNet processor in 40-nm CMOS. IEEE J. Solid State Circuits
**2016**, 52, 903–914. [Google Scholar] [CrossRef] - Wang, S.; Zhou, D.; Han, X.; Yoshimura, T. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. arXiv
**2017**, arXiv:1703.01457. [Google Scholar] - Kwon, H.; Samajdar, A.; Krishna, T. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In ACM SIGPLAN Notices; Association for Computing Machinery: New York, NY, USA, 2018; pp. 461–475. [Google Scholar]
- Cavigelli, L.; Benini, L. Origami: A 803-gop/s/w convolutional network accelerator. IEEE Trans. Circuits Syst. Video Technol.
**2016**, 27, 2461–2475. [Google Scholar] [CrossRef] [Green Version] - Kyrkou, C.; Plastiras, G.; Theocharides, T.; Venieris, S.I.; Bouganis, C.-S. DroNet: Efficient convolutional neural network detector for real-time UAV applications. arXiv
**2018**, arXiv:1807.06789v1. [Google Scholar] - Guo, T. Cloud-based or on-device: An empirical study of mobile deep inference. In Proceedings of the 2018 IEEE International Conference on Cloud Engineering (IC2E), Orlando, FL, USA, 17–20 April 2018; pp. 184–190. [Google Scholar]
- Tang, J.; Liu, S.; Yu, B.; Shi, W. PI-Edge: A Low-Power Edge Computing System for Real-Time Autonomous Driving Services. arXiv
**2018**, arXiv:1901.04978. [Google Scholar] - Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. arXiv
**2016**, arXiv:1608.06993. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Canziani, A.; Paszke, A.; Culurciello, E. An analysis of deep neural network models for practical applications. arXiv
**2016**, arXiv:1605.07678. [Google Scholar] - Wu, S.; Li, G.; Chen, F.; Shi, L. Training and inference with integers in deep neural networks. arXiv
**2018**, arXiv:1802.04680. [Google Scholar] - Horowitz, M. Energy table for 45 nm process. Stanford VLSI Wiki. 2014. Available online: https://en.wikipedia.org/wiki/VLSI_Project (accessed on 1 December 2019).
- Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
- Merchant, F.; Vatwani, T.; Chattopadhyay, A.; Raha, S.; Nandy, S.; Narayan, R. Accelerating BLAS on custom architecture through algorithm-architecture co-design. arXiv
**2016**, arXiv:1610.06385. [Google Scholar] - Hamilton, K.C. Optimization of energy and throughput for pipelined VLSI interconnect. In UC San Diego; California Digital Library: Oakland, CA, USA, 2010. [Google Scholar]
- Zyuban, V.; Brooks, D.; Srinivasan, V.; Gschwind, M.; Bose, P.; Strenski, P.N.; Emma, P.G. Integrated analysis of power and performance for pipelined microprocessors. IEEE Trans. Comput.
**2004**, 53, 1004–1016. [Google Scholar] [CrossRef] - Sartori, J.; Ahrens, B.; Kumar, R. Power balanced pipelines. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture, New Orleans, LA, USA, 25–29 February 2012; pp. 1–12. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Mackey, L.W.; Jordan, M.I.; Talwalkar, A. Divide-and-conquer matrix factorization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2011; pp. 1134–1142. [Google Scholar]
- Balasubramonian, R.; Kahng, A.B.; Muralimanohar, N.; Shafiee, A.; Srinivas, V. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim.
**2017**, 14, 14. [Google Scholar] [CrossRef] - Wu, S.; Wang, G.; Tang, P.; Chen, F.; Shi, L. Convolution with even-sized kernels and symmetric padding. arXiv
**2019**, arXiv:1903.08385. [Google Scholar] - Yao, S.; Han, S.; Guo, K.; Wangni, J.; Wang, Y. Hardware-frendly convolutional neural network with even-number filter size. Comput. Sci.
**2016**. Available online: https://pdfs.semanticscholar.org/10b9/92e86ee96cd4c5d73f3d667059beb4749ce3.pdf (accessed on 1 December 2019). - Cucchiara, R.; Piccardi, M. DARPA benchmark image processing on SIMD parallel machines. In Proceedings of the 1996 IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, Singapore, 11–13 June 1996; pp. 171–178. [Google Scholar]
- Kim, K.; Choi, K. SoC architecture for automobile vision system. In Algorithm & SoC Design for Automotive Vision Systems; Springer: Berlin/Heidelberg, Germany, 2014; pp. 163–195. [Google Scholar]

**Figure 1.**Convolution between convolution kernels and an input feature map. (

**a**) Illustration of the convolution operation between one feature map and T type convolution kernels; (

**b**) pseudo code of the convolution layer.

**Figure 2.**Two types of hardware accelerators: (

**a**) processing elements (PE) array structure and (

**b**) reduction tree structure.

**Figure 5.**Proposed 4 × 4 matrix multiplication method: (

**a**) Partitioning of a 4 × 4 matrix into a 2 × 2 sub-matrices; (

**b**) proposed matrix multiplication: Naïve (×) and Strassen (•).

**Figure 7.**Convolution operation inside CENNA architecture: (

**a**) pseudo code of convolution layer inside CENNA; (

**b**) equation of convolution using matrix multiplication; (

**c**) Illustration of matrix multiplication.

**Figure 9.**Tiling-based data reuse in CENNA: (

**a**) example of overlapped elements in adjacent windows; (

**b**) data tiling and memory hierarchy used for CENNA.

**Figure 10.**Execution flow of the 6-stage pipeline in CENNA (6-stages: Load, 3-stage Matrix Multiplication, Convolution Operation and Store).

**Table 1.**Computation and parameter size requirements in a convolutional neural network [19].

CNN Model | AlexNet * | VGG-16 ** | GoogLeNet *** | ResNet-152 **** |
---|---|---|---|---|

Operation (MAC) ^{1} | 0.73 M | 16 G | 2 G | 11 G |

Parameter Mem (Byte) ^{2} | 233 M | 528 M | 51 M | 220 M |

^{1}, Total weights including convolution kernel, bias

^{2}, mm

^{2})

^{2}, Neural network designed by Alex khrizevsky */Visual geometry group **/Google ***, Residual network ****.

**Table 2.**Rough relative cost in 45 nm 0.9 V from Eyeriss [20].

Operation | Energy (pJ) | Area (µm^{2}) | ||
---|---|---|---|---|

Multiplier | Adder | Multiplier | Adder | |

8-bit INT ^{1} | 0.2 pJ | 0.03 pJ | 282 µm^{2} | 36 µm^{2} |

16-bit FP ^{2} | 1.1 pJ | 0.4 pJ | 1640 µm^{2} | 1360 µm^{2} |

32-bit FP ^{2} | 3.7 pJ | 0.9 pJ | 7700 µm^{2} | 4184 µm^{2} |

^{1}, Floating-point operation

^{2}.

Cycle/Time (s) (10^{9} 4 × 4 Matrices) | Maximum Frequency | # of MUL/ADD | Area | Power | |
---|---|---|---|---|---|

Naïve ^{1} | (10^{9} + 2)/2.00 | 500 MHz | 64/48 | 0.270 mm^{2} | 21.45 mW |

Strassen ^{2} | (10^{9} + 3)/3.16 | 370 MHz | 49/198 | 0.203 mm^{2} | 21.70 mW |

Cost-Centric ^{3} | (10^{9} + 2)/2.00 | 500 MHz | 56/100 | 0.242 mm^{2} | 18.58 mW |

**M**-A-A)

^{1}, 4-stage (A-

**2 × 2 Strassen**-A-A)

^{2}, 3-stage (A-

**2 × 2 Naïve**-A)

^{3},

**Bold**is a critical path.

Design | Naïve | Strassen | CENNA |
---|---|---|---|

# of MUL/ADD | 64/108 | 49/258 | 56/160 |

Frequency | 500 MHz | 370 MHz | 500 MHz |

Local Buffer ^{1} | 448 B | 544 B | 400 B |

SRAM | 64 KB | 64 KB | 64 KB |

Area | 1.411 mm^{2} | 1.345 mm^{2} | 1.384 mm^{2} |

Power | 50.191 mW | 50.462 mW | 47.344 mW |

_{1}–M

_{7}, 2nd Addition, and Accumulator)

^{1}.

Layer | Input (W/H/C) ^{1} | Output (W/H/C) | # of MAC (Giga) | Total Time (ms) ^{2} | Memory Access Time (ms) |
---|---|---|---|---|---|

Conv1-1 | 224 × 224 × 3 | 224 × 224 × 64 | 0.16 | 4.86 | 0.29 |

Conv1-2 | 224 × 224 × 64 | 224 × 224 × 64 | 2.62 | 103.61 | 6.09 |

Conv2-1 | 112 × 112 × 64 | 112 × 112 × 128 | 1.26 | 47.85 | 1.44 |

Conv2-2 | 112 × 112 × 128 | 112 × 112 × 128 | 2.2 | 95.71 | 2.88 |

Conv3-1 | 56 × 56 × 128 | 56 × 56 × 256 | 1.42 | 43.16 | 0.66 |

Conv3-2 | 56 × 56 × 256 | 56 × 56 × 256 | 2.84 | 86.35 | 1.31 |

Conv3-3 | 56 × 56 × 256 | 56 × 56 × 256 | 2.84 | 86.36 | 1.31 |

Conv4-1 | 28 × 28 × 256 | 28 × 28 × 512 | 1.21 | 38.26 | 0.31 |

Conv4-2 | 28 × 28 × 512 | 28 × 28 × 512 | 2.42 | 76.52 | 0.63 |

Conv4-3 | 28 × 28 × 512 | 28 × 28 × 512 | 2.42 | 76.52 | 0.63 |

Conv5-1 | 14 × 14 × 512 | 14 × 14 × 512 | 0.42 | 21.05 | 0.63 |

Conv5-2 | 14 × 14 × 512 | 14 × 14 × 512 | 0.42 | 21.05 | 0.20 |

Conv5-3 | 14 × 14 × 512 | 14 × 14 × 512 | 0.42 | 21.05 | 0.20 |

Total | 20.65 | 722.35 | 16.58 |

^{1}, Including the computation time and memory access time

^{2}.

Design | Naïve | Strassen | CENNA |
---|---|---|---|

Total Time | 4.86 ms | 7.68 ms | 4.86 ms |

Real Throughput ^{1} | 65.84 GOPS | 41.67 GOPS | 65.84 GOPS |

Peak Throughput ^{2} | 86 GOPS | 63.64 GOPS | 86 GOPS |

Efficiency ^{3} | 1.31 TOPS/W | 0.83 TOPS/W | 1.39 TOPS/W |

^{1}, Theoretical performance

^{2}, Real throughput (Tera-operation) per watt

^{3}.

Metrics | Eyeriss [9] | ConvNet [10] | Chain-NN [11] | CENNA |
---|---|---|---|---|

Precision (bit) | 16 | 1–16 | 16 | 16 |

Process Technology (nm) | 65 | 40 | 28 | 65 |

Area (mm^{2}) ^{1} | 12.25 | 2.4 | 10.69 | 1.38 |

Frequency (MHz) | 200 | 204 | 700 | 500 |

# of operators ^{2} | 336 | 204 | 1152 | 232 |

SRAM (KB) | 192 | 144 | 352 | 64 |

Peak Throughput (GOPS) | 84 | 102 | 806.4 | 86 |

Real Throughput (GOPS) ^{3} | 24.6 | 63.38 | 668.16 | 57.17 |

Frame Rate (Frame/s) ^{3} | 0.6 | 1.54 | 16.8 | 1.38 |

Power (mW) ^{4} | 236 | 220 | 567.5 | 47.34 |

^{2}/358 mW (ConvNet), 24.81 mm

^{2}/1,317 mW (Chain-NN)

^{1,4}, Total number of multipliers and adders

^{2}, VGG-16 scaled up to similar amount of computation as 20.65 giga MAC operations (GMAC)

^{3}.

Metrics | MAERI [12] | Origami [13] | CENNA |
---|---|---|---|

Precision (bit) | 16 | 12 | 16 |

Process Technology (nm) | 28 | 65 | 65 |

Area (mm^{2}) | 3.84 | 3.09 | 1.38 |

Freq (MHz) | 200 | 189 | 500 |

# of operators | 336 | 388 | 232 |

SRAM (KB) | 80 | 43 | 64 |

Peak Throughput (GOPS) | 67.2 | 74 | 86 |

Real Throughput (GOPS) ^{1} | 58.27 | 55 | 57.17 |

Frame Rate (Frame/s) ^{1} | 1.4 | 7.26 | 1.38 |

Power (mW) | 370 | 93 | 47.34 |

^{1}.

Metric | Eyeriss [9] | ConvNet [10] | Chain-NN [11] | MAERI [12] | Origami [13] | CENNA |
---|---|---|---|---|---|---|

Power (TOPS/W) | 0.10 | 0.29 * | 1.18 * | 0.16 * | 0.59 | 1.21 |

Area (GOPS/mm^{2}) | 2.01 | 26.41 ** | 62.5 ** | 15.17 ** | 17.80 | 41.43 |

Overall (TOPS/W/mm^{2}) | 0.01 | 0.12 *** | 0.11 *** | 0.04 *** | 0.19 | 0.88 |

^{2}) **, and 0.05/0.02/0.01(TOPS/W/mm

^{2}) ***, respectively.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Park, S.-S.; Chung, K.-S.
CENNA: Cost-Effective Neural Network Accelerator. *Electronics* **2020**, *9*, 134.
https://doi.org/10.3390/electronics9010134

**AMA Style**

Park S-S, Chung K-S.
CENNA: Cost-Effective Neural Network Accelerator. *Electronics*. 2020; 9(1):134.
https://doi.org/10.3390/electronics9010134

**Chicago/Turabian Style**

Park, Sang-Soo, and Ki-Seok Chung.
2020. "CENNA: Cost-Effective Neural Network Accelerator" *Electronics* 9, no. 1: 134.
https://doi.org/10.3390/electronics9010134