Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Moghaddasi, Iraj; Nam, Byeong-Gyu

doi:10.3390/make6030070

Open AccessArticle

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

by

Iraj Moghaddasi

and

Byeong-Gyu Nam

^*

Department of Computer Science and Engineering, Chungnam National University, Daejeon 305-764, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2024, 6(3), 1484-1493; https://doi.org/10.3390/make6030070

Submission received: 1 June 2024 / Revised: 24 June 2024 / Accepted: 28 June 2024 / Published: 1 July 2024

(This article belongs to the Section Network)

Download

Browse Figures

Versions Notes

Abstract

In recent years, deep neural networks (DNNs) have addressed new applications with intelligent autonomy, often achieving higher accuracy than human experts. This capability comes at the expense of the ever-increasing complexity of emerging DNNs, causing enormous challenges while deploying on resource-limited edge devices. Improving the efficiency of DNN hardware accelerators by compression has been explored previously. Existing state-of-the-art studies applied approximate computing to enhance energy efficiency even at the expense of a little accuracy loss. In contrast, bit-serial processing has been used for improving the computational efficiency of neural processing without accuracy loss, exploiting a simple design, dynamic precision adjustment, and computation pruning. This research presents Serial/Parallel Systolic Array (SPSA) and Octet Serial/Parallel Systolic Array (OSPSA) processing elements for edge DNN acceleration, which exploit bit-serial processing on systolic array architecture for improving computational efficiency. For evaluation, all designs were described at the RTL level and synthesized in 28 nm technology. Post-synthesis cycle-accurate simulations of image classification over DNNs illustrated that, on average, a sample 16 × 16 systolic array indicated remarkable improvements of 17.6% and 50.6% in energy efficiency compared to the baseline, with no loss of accuracy.

Keywords:

systolic array; DNN accelerator; serial inference engine; TPU; energy efficiency

1. Introduction

The growing importance of deep learning (DL) lies in solving problems that may be difficult or even impossible for human experts. In this context, deep neural networks (DNNs) have demonstrated excellent accuracy in emerging DL applications [1,2]. To enhance their accuracy further, modern DNNs are becoming increasingly more complicated. For example, some DNNs need more than ten million parameters to perform billions of multiply-accumulate (MAC) operations in the inference phase, requiring significant data movement to support the process [3]. A direct parallel processing approach requires extensive data movement, resulting in numerous problems such as increased energy consumption or “power wall” [4,5]. In this regard, edge computing can improve energy efficiency by enabling processing near the data sources and mitigating data transmission, as it consumes more energy than data processing [6]. However, edge devices require special-purpose accelerators to provide better computation efficiency, even though these resource-limited devices struggle to execute complex DNNs.

DNN accelerators are often constructed from highly parallel processing elements (PEs), arranged in a two-dimensional systolic array (SA) architecture, to address intensive processing demands. In summary, SAs have two key features for efficiently accelerating DNNs, including straight support for matrix/vector multiplication as the main DNN computational block and a simple and efficient control mechanism. The SAs operate in different dataflows, e.g., weight stationary (WS) and output stationary (OS) [7], capable of executing an enormous number of operations concurrently, reaching tera-operations per second (TOPS). For instance, the Google Tensor Processing Unit (TPU) presents an SA with WS dataflow and a raw throughput of 92 TOPS [8]. Similarly, Samsung incorporates a Neural Processing Unit (NPU) for edge devices, featuring the 6 K MAC configuration, capable of delivering up to 14.7 TOPS [9]. This enormous performance of TOPS is achieved at the cost of higher power density, thermal effects, and reliability challenges [10,11], which can restrict them from being used in critical applications, e.g., healthcare. Therefore, there is an urgent need to improve the computational efficiency of SA-based DNN accelerators.

Meanwhile, bit-serial processing has been crucial for the efficient deployment of DNNs on resource-limited edge devices [12,13,14,15]. Briefly, there are three factors with serial PEs that help to improve performance and energy efficiency: (1) Dynamic precision adjustment, which could be excellent for heterogeneous precision across DNN layers, or dynamic adaptation according to the requested accuracy for the running workload. For DNN inferences without losing the accuracy, layer-wise precision adjustment can achieve a 2.29× speedup or energy reduction, on average, in 16-bit precision implementation (or equivalent 1.145× speedup in 8-bit) [13]. (2) Maximizing the operating frequency due to a simpler design over complex bit-parallel processing units. (3) Increasing throughput via concurrent input bit-group processing [16,17]. The key idea is to process a bit column of concurrent data items in each cycle, which can be useful when applying the same calculation to all data bits.

This research introduces Serial/Parallel SA (SPSA) as a novel approach for SA-based DNN processing. Following this, Octet Serial/Parallel SA (OSPSA) is proposed to improve SPSA energy efficiency even more. Both architectures exploit the benefits of improving efficiency through activation serial processing. Instead of design-time fixed precision assignment, they allow runtime and layer-wise precision adjustment according to the DNN model, requested accuracy, and operating conditions, e.g., performance, energy, temperature, and even reliability. The major contributions of this study are as follows:

(1): We propose a serial/parallel systolic array architecture and data flow;
(2): We introduce a bit-serial processing of activation with zero skipping capability;
(3): Our design exploits activation precision adjustment in a systolic array accelerator;
(4): We improve energy efficiency by replacing complicated multipliers with simpler and low-cost serial circuits.

Overall, the focus of this work is to combine the concepts of (a) serial processing, (b) systolic computing, and (c) heterogeneous bit-precision accelerator at the same time and to evaluate their collaborative impact on improving computational efficiency. The rest of this paper is organized as follows. Section 2 presents the baseline design and limitations. Section 3 provides a brief survey of related works. The proposed serial/parallel processing approach is clarified in Section 4. Section 5 overviews the proposed architecture and processing elements. Section 6 presents the experimental results and comparison over the baseline. Finally, Section 7 concludes the paper.

2. Baseline System

The proposed architectures extend TPU as the conventional SA-based DNN accelerator, replacing bit-parallel processing units with serial/parallel elements. The TPU targets the inference task, boasting 64 K (TPU1) and 4 K (TPU3) systolic MAC arrays for cloud and edge, respectively [8]. The TPU can be regarded as a hardware accelerator with spatial architecture and WS dataflow, which proposes a raw throughput of 92 TOPS. The TPU leverages the significant reduction in energy and area by 8-bit integer SA multipliers over the 32-bit floating-point GPU data path. It compresses 25 times MACs and 3.5 times on-chip memory while using less than half the power of GPU in a relatively small die. Meanwhile, the bit-parallel designs seem to have limitations:

Specifying fixed precision at design-time;
Static trade-offs among operational factors like accuracy;
High latency (cycle-time) due to complicated operations.

In this study, the proposed architectures improve the computational efficiency of the TPU-like bit-parallel design by serial/parallel processing, but the throughput is reduced due to performing operations sequentially. The OSPSA processes multiple bit-instances of concurrent activations simultaneously to compensate for the throughput reduction, exploiting the inherited parallelism in the DNNs. Instead of design-time fixed precision assignment for all layers, the proposed designs allow runtime layer-wise precision adjustment according to the DNN model, required accuracy, and operational conditions. In other words, they enable runtime trade-offs among different parameters, e.g., accuracy, energy, performance, temperature, and lifetime [11].

3. Related Work

In recent years, numerous research works have been presented toward efficient deployment of DNNs on edge devices. This section concisely overviews previous works on developing systolic array-based and bit-serial DNN accelerators. In 1978, the systolic array concept was formally introduced [18] as a highly parallel and pipelined array of simple PEs that operate in a lockstep, wave-like fashion. Thanks to spatial architecture and simple data flow, systolic arrays were adapted in parallel computing, finding successful applications, e.g., signal processing and image processing. Despite the initial excitement, the trend in computer architecture has shifted toward more adaptable temporal processing architectures, e.g., CPUs and GPUs [19]. However, with the rise of DNNs, there has been a renewed interest in systolic arrays due to three primary advantages [8,20]: (1) extensive throughput and parallel processing, (2) straightforward structure and data flow that improve energy efficiency by eliminating complex control mechanisms [20], and (3) data caching and reuse within the array for substantial energy consumption reduction by minimizing memory accesses. These attributes have thrust systolic arrays into the forefront of research on DNN accelerator architectures [21]. Regarding this, Google released a systolic array-based TPU based on constant precision bit-parallel PEs [8]. TPU is a programmable and reconfigurable hardware platform to compute neural networks faster and more energy efficiently over conventional CPUs and GPUs [19,20].

With the increasing complexity of modern DNNs, the conventional systolic arrays may not efficiently cover new features and requirements due to exceeding the energy budgets of running embedded edge devices [22]. Meanwhile, designers must consider numerous factors in the systolic array design, including data flow, array size, and bandwidth. These factors collectively influence the final accelerator’s performance, computation efficiency, and memory footprint [23,24]. Based on this consideration, two common solutions have been adopted for speeding up and improving the efficiency of inference engines, including compression and pruning [25,26,27]. Compression reduces processing and memory complexity by cutting down the bit-precision, including quantization and approximation. The state-of-the-art research [10] applied approximate multipliers to lower power density and temperature at the expense of nearly 1% accuracy loss. Although some applications can tolerate the loss of accuracy, it is not acceptable for critical applications such as healthcare and automotive. Conversely, pruning decreases ineffectual computation amounts such as zero skipping [28,29] or early computation termination [12,25]. Overall, compression can improve efficiency at the price of potential output accuracy loss, while pruning relies on detecting and pruning ineffectual computation, incurring area and power overheads. In this context, multi-precision computing is a major approach for conserving accuracy at the expense of the area overhead of including processing elements with different precisions at the same time. Accordingly, designers tried to increase the efficiency via heterogeneous DNN accelerators (HDAs) [30,31]. Heterogeneity refers to utilizing multiple NPUs with different precision. HDA balances the latency/power trade-off, providing a restricted scaling and imposing area overhead. The core idea is integrating multiple NPUs with different precisions and arranging them in various topologies. This better fits the specific requirements of various computational blocks intra and inter-DNNs.

On the other hand, numerous efforts have focused on enhancing computation efficiency through serial processing, leveraging circuit simplification and dynamic precision adjustment. Generally, serial designs encompass bit-level and bit-serial architectures. In bit-level designs, processing elements (PEs) are constructed from lower-precision components and feature spatiotemporal processing [32,33]. In contrast, bit-serial PEs handle input data bits sequentially and are categorized into serial and serial/parallel types. Serial engines [34,35] use bit-serial approaches for both weights and activations. Despite their low energy consumption, which suits sparsely active systems with low data rates, serial engines suffer from degraded performance and increased response times. Serial/parallel designs [13,27] employ PEs with combined serial and parallel inputs. Regarding this, Ref. [14] reduces latency and energy consumption by nearly 50% on average via dynamic precision adjustment and storing reusable partial products in LUTs. However, this approach incurs memory and computation overheads of refreshing LUT contents when fetching new data. It also relies on tightly coupled data memory (TCDM), significantly reducing data movement by close coupling to the PEs and avoiding deep memory hierarchies. Although dynamic precision adjustment significantly accelerates processing, these designs still face high latency and response times, requiring more optimizations, e.g., exploiting available sparsity. For example, Ref. [27] skips ineffectual bits of inputs, providing an accelerator for CNNs that processes only the essential activation bits, thus excluding all zero bits.

Overall, our work differentiates from the existing state-of-the-art as it combines the concepts of (a) systolic computing, (b) multi-precision architecture, and (c) bit-serial processing. Distinguishing from the forefront studies, we are the first to design a serial and variable bit-precision systolic MAC array for DNN inference in which we reduce resource usage while conserving accuracy. Hence, we satisfy the tight accuracy, area, and latency constraints, while delivering better computation efficiency.

4. Octet Serial Processing Approach

This study proposes two serial/parallel SA (SPSA and OSPSA) designs for DNN accelerators with higher capabilities than conventional bit-parallel SA (PSA), previously used in famous accelerators, e.g., TPU. They mainly improve efficiency by reducing resource consumption through spatiotemporal serial processing, which makes them suitable for DL applications on edge devices. Overall, SPSA and OSPSA (1) reduce energy consumption via the simpler architecture of serial/parallel PEs, (2) enable ineffectual computation pruning in two ways: (a) layer-wise precision adjustment and (b) bit-column-wise zero skipping, and (3) allow higher throughput without trading accuracy through latency reduction and bit-column-wise computing. Although bit-parallel multiplication is useful for high-performance DNN inference, it seems less appropriate for resource-constrained edge devices. Figure 1a shows the computing model of PSA. Herein, activations and weights are fetched in bit-parallel and are multiplied to produce the partial product. Then, the partial product will be added to the input partial sum to produce output for the consequent layer. PEs are in an SA architecture with a target operation of matrix multiplication O = S × N, where S and N represent weights and activations matrixes, respectively.

Figure 1b,c illustrate the proposed computing model of PEs with serial/parallel and octet serial/parallel architectures. In this method, weights are in bit-parallel, and activations arrive in bit-serial. Herein, serial activation bits are ANDed by all corresponding weight bits to produce partial products. To boost the throughput to the level of a bit-parallel design, octet serial/parallel PE processes a column of eight concurrent bits of separate activations in the same bit-position. In OSPSA, eight partial products feed to the adder tree (compressor) to generate a partial sum for the current bit-position. Both PEs are working in an 8-cycle loop, starting from MSB and processing one bit of activations per cycle. Figure 2 illustrates the processing approach of the OSPSA for 2 × 2 SA and 2-bit inputs, for simplicity. As shown, four concurrent activations per cycle are fed to the OSPSA array, which iterates in a 2-cycle loop. Herein, the square activations matrices have been shifted into something more akin to a parallelogram so that each input activation reaches the right PE at the right cycle.

5. SPSA Accelerator Architecture

5.1. Overall Accelerator Architecture

The accelerator architecture is introduced by borrowing the general structure from TPU with a WS dataflow. As shown in Figure 3, the PEs are replaced by components with serial/parallel design. Like TPU, the Matrix Multiply Unit (MMU) is the heart of the accelerator including 16 × 16 serial/parallel MAC elements in this sample design. This design can be extended for bigger array sizes, like 32 × 32 or 64 × 64 like TPU, only by increasing the bit lines of partial sums. However, the number of rows is preferred to be low to increase the PE utilization percentage, considering the limited filter size of the target DNNs. In the case of bigger networks, like Transformers, the row count can be increased even more. In this design, PEs perform signed operations over separate activations bits and 8-bit parallel weights. The partial products are collected by the Shifter and Accumulator unit below the MMU. The MMU computes at half-speed in case of 8-bit activations and 16-bit weights (or vice versa), and for 16-bit weights and 16-bit activations, it computes at a quarter-speed. In this design, a Transposer element is added to convert the bit-parallel activations and read from memory to serial bit stream for processing by PEs.

5.2. SPSA-MAC Processing Elements

SPSA and OSPSA replace power-hungry bit-parallel multipliers with simple serial circuits and reduce the adder size and output p-sum bit-lines. In the WS dataflow, 8-bit weights are pre-loaded before processing starts. Then, activations arrive bit-group-wise sequentially to exploit the benefits of temporal serial processing and spatial parallel processing. To consider signs, all weights activations, outputs, and intermediate data are in two’s complement. The MSB input indicates the arrival of the most significant sign bit in PEs. Figure 4a,b demonstrate the detailed designs of SP-MAC and OSP-MAC. The number inside the brackets indicates the bit position is processed in each cycle. In SP-MAC, serial activations bits are ANDed with preloaded 8-bit weights per cycle, which is extended to eight concurrent serial activations in OSP-MAC. In OSP-MAC, an adder tree compresses 8 × 8-bit partial products to produce an 11-bit partial sum. The partial sum is summed with the input partial sum to generate the output partial sum. In this sample design with a 16 × 16 MAC array, the partial sum bit-width is set to 11 and 15 for SPSA and OSPSA, respectively, to accumulate generated partial sums in one column of SA.

6. Evaluation and Comparison

In this study, a cross-layer framework is deployed to evaluate the computational efficiency of the proposed designs. Initially, all designs, including SPSA, OSPSA, and PSA, are described in RTL (Verilog) and synthesized using a 28 nm cell library to produce the netlist and standard delay format (SDF) files. In parallel, several DNN models on different datasets are deployed in Python (3.12.0) and Matlab (2022) to provide evaluation benchmarks of activation and weight values. Here, the VGG16 on the MNIST dataset and the AlexNet and ReNet18 on the ImageNet dataset are profiled in 8-bit and 16-bit two’s complement format. Next, cycle-accurate simulations are conducted on generated netlists and benchmarks to produce the output data, timing reports, and value change dump (VCD) profiles. After that, the VCD files are used to generate power reports using a power analysis tool. Finally, energy efficiency factors are reported. A summary of the reported primary features by the synthesis tool is shown in Table 1. Herein, the array sizes for PSA and SPSA are assumed to be 16 × 16. However, the equivalent dimension of OSPSA with the same functionality is 2 × 16. Demonstrably, all design factors are improved through serial/parallel SA processing, including area, power efficiency, latency, frequency, performance/area, and performance/watt.

6.1. Computation Pruning by Zero Skipping

Compared with bit-parallel processing, there is a higher probability of zero observation and skipping in sequential bit-column-wise computing of activations. This differs from zero skipping in the bit-parallel processing of sparse matrixes and is even applicable to non-sparse matrixes. This is because of the separate processing of bit groups with different positions. For example, if we consider an input image file, there is a higher probability of a zero-bit observation in the most significant bits of neighbor pixels. Figure 5 shows the average potential of zero skipping by SPSA/OSPSA (8.74%) for 8-bit precision compared to PSA in non-sparse input activations.

6.2. Energy Efficiency Improvement

The power-delay product (PDP) is a measure of merit correlated with the energy efficiency of a circuit. The PDP is the product of average power consumption and the input-output delay, or the duration of a workload run. It has an energy dimension and measures the energy consumption per workload execution. Here, simulations are performed for VGG16, AlexNet, and ResNet18 benchmarks. Figure 6 demonstrates an average of 17.6% and 50.6% reduction in PDP for SPSA and OSPSA compared to the conventional bit-parallel baseline. This is achieved for 8-bit activations precision, which could be increased for higher activations bit-precisions, e.g., 16-bit due to dynamic precision adjustment.

7. Conclusions and Future Works

DNNs have achieved amazing accuracy with increasing complexity, making it difficult for them to be deployed on edge devices. So, there is an urgent need for accelerators with higher computation efficiency on edge devices. In this study, new serial/parallel architectures, namely SPSA and OSPSA, have been introduced based on the conventional bit-parallel systolic accelerators to increase the computation efficiency of DNN execution on edge devices. The proposed design exploited serial processing to significantly improve computational efficiency. As far as we know, this is the first DNN inference engine designed with a serial systolic array architecture. The functionality and efficacy of the SPSA were evaluated based on different DNN models and datasets. The experimental results proved that the proposed designs significantly improve energy efficiency without trading accuracy. Furthermore, the proposed architectures demonstrate a higher probability of zero activation skipping by utilizing bit-serial processing.

For future work, we aim to improve the SPSA and OSPSA designs to support a fully serial systolic array with adjustable bit-precision of the activations and weights. Additionally, we will explore other methods to exploit zero skipping more in the weights and activations, leveraging the available bit-sparsity to enhance the energy efficiency even further. Ultimately, we intend to expand the SPSA and OSPSA capabilities to support the training and inference phases.

Author Contributions

Conceptualization, I.M. and B.-G.N.; methodology, I.M.; software, I.M.; validation, I.M. and B.-G.N.; formal analysis, I.M.; investigation, I.M.; resources, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, B.-G.N.; visualization, I.M.; supervision, B.-G.N.; project administration, B.-G.N.; funding acquisition, B.-G.N. All authors have read and agreed to the published version of the manuscript.

Funding

The research fund of Chungnam National University supported this work.

Data Availability Statement

The dataset is available from the corresponding author and can be provided on reasonable request.

Acknowledgments

The authors would like to thank IDEC for CAD support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Chen, Y.-H.; Emer, J.; Sze, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput. Archit. News 2016, 44, 367–379. [Google Scholar] [CrossRef]
Villa, O.; Johnson, D.R.; Oconnor, M.; Bolotin, E.; Nellans, D.; Luitjens, J.; Sakharnykh, N.; Wang, P.; Micikevicius, P.; Scudiero, A.; et al. Scaling the power wall: A path to exascale. In Proceedings of the SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 16–21 November 2014; pp. 830–841. [Google Scholar]
Horowitz, M. Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-state Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14. [Google Scholar]
Chen, Y.; Xie, Y.; Song, L.; Chen, F.; Tang, T. A Survey of Accelerator Architectures for Deep Neural Networks. Engineering 2020, 6, 264–274. [Google Scholar] [CrossRef]
Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar]
Park, J.S.; Jang, J.W.; Lee, H.; Lee, D.; Lee, S.; Jung, H.; Lee, S.; Kwon, S.; Jeong, K.; Song, J.H.; et al. 9.5 A 6K-MAC feature-map-sparsity-aware neural processing unit in 5nm flagship mobile SoC. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 152–154. [Google Scholar]
Zervakis, G.; Anagnostopoulos, I.; Salamin, S.; Spantidi, O.; Roman-Ballesteros, I.; Henkel, J.; Amrouch, H. Thermal-aware design for approximate dnn accelerators. IEEE Trans. Comput. 2022, 71, 2687–2697. [Google Scholar] [CrossRef]
Moghaddasi, I.; Gorgin, S.; Lee, J.-A. Dependable DNN Accelerator for Safety-critical Systems: A Review on the Aging Perspective. IEEE Access 2023, 11, 89803–89834. [Google Scholar] [CrossRef]
Kim, N.; Park, H.; Lee, D.; Kang, S.; Lee, J.; Choi, K. ComPreEND: Computation pruning through predictive early negative detection for ReLU in a deep neural network accelerator. IEEE Trans. Comput. 2021, 71, 1537–1550. [Google Scholar] [CrossRef]
Judd, P.; Albericio, J.; Hetherington, T.; Aamodt, T.M.; Moshovos, A. Stripes: Bit-serial deep neural network computing. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; pp. 1–12. [Google Scholar]
Lee, J.; Kim, C.; Kang, S.; Shin, D.; Kim, S.; Yoo, H.-J. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circuits 2018, 54, 173–185. [Google Scholar] [CrossRef]
Houshmand, P.; Sarda, G.M.; Jain, V.; Ueyoshi, K.; Papistas, I.A.; Shi, M.; Zheng, Q.; Bhattacharjee, D.; Mallik, A.; Debacker, P.; et al. Diana: An end-to-end hybrid digital and analog neural network soc for the edge. IEEE J. Solid-State Circuits 2022, 58, 203–215. [Google Scholar] [CrossRef]
Eckert, C.; Wang, X.; Wang, J.; Subramaniyan, A.; Iyer, R.; Sylvester, D.; Blaaauw, D.; Das, R. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 2018 ACM/IEEE 45Th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 383–396. [Google Scholar]
Wang, X.; Yu, J.; Augustine, C.; Iyer, R.; Das, R. Bit prudent in-cache acceleration of deep convolutional neural networks. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 81–93. [Google Scholar]
Kung, H.-T. Why systolic architectures? Computer 1982, 15, 37–46. [Google Scholar] [CrossRef]
Wang, Y.E.; Wei, G.-Y.; Brooks, D. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv 2019, arXiv:1907.10701. [Google Scholar]
Xu, R.; Ma, S.; Guo, Y.; Li, D. A Survey of Design and Optimization for Systolic Array-Based DNN Accelerators. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Samajdar, A.; Joseph, J.M.; Zhu, Y.; Whatmough, P.; Mattina, M.; Krishna, T. A systematic methodology for characterizing scalability of dnn accelerators using scale-sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, USA, 23–25 August 2020; pp. 58–68. [Google Scholar]
Ardakani, A.; Condo, C.; Ahmadi, M.; Gross, W.J. An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 65, 1349–1362. [Google Scholar] [CrossRef]
Lu, L.; Guan, N.; Wang, Y.; Jia, L.; Luo, Z.; Yin, J.; Cong, J.; Liang, Y. Tenet: A framework for modeling tensor dataflow based on relation-centric notation. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 720–733. [Google Scholar]
Chen, Y.-H. Architecture Design for Highly Flexible and Energy-Efficient Deep Neural Network Accelerators. Doctoral Dissertation, Massachusetts Institute of Technology, Cambridge, MA, USA, 2018. [Google Scholar]
Albericio, J.; Judd, P.; Hetherington, T.; Aamodt, T.; Jerger, N.E.; Moshovos, A. Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM SIGARCH Comput. Archit. News 2016, 44, 1–13. [Google Scholar] [CrossRef]
Ayachi, R.; Said, Y.; Ben Abdelali, A. Optimizing Neural Networks for Efficient FPGA Implementation: A Survey. Arch. Comput. Methods Eng. 2021, 28, 4537–4547. [Google Scholar] [CrossRef]
Lu, H.; Chang, L.; Li, C.; Zhu, Z.; Lu, S.; Liu, Y.; Zhang, M. Distilling bit-level sparsity parallelism for general purpose deep learning acceleration. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual, 18–22 October 2021; pp. 963–976. [Google Scholar]
Kim, S.; Lee, J.; Kang, S.; Han, D.; Jo, W.; Yoo, H.-J. Tsunami: Triple sparsity-aware ultra energy-efficient neural network training accelerator with multi-modal iterative pruning. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 1494–1506. [Google Scholar] [CrossRef]
Mao, W.; Wang, M.; Xie, X.; Wu, X.; Wang, Z. Hardware Accelerator Design for Sparse DNN Inference and Training: A Tutorial. IEEE Trans. Circuits Syst. II Express Briefs 2023, 71, 1708–1714. [Google Scholar] [CrossRef]
Xu, R.; Ma, S.; Wang, Y.; Guo, Y.; Li, D.; Qiao, Y. Heterogeneous systolic array architecture for compact cnns hardware accelerators. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 2860–2871. [Google Scholar]
Spantidi, O.; Zervakis, G.; Alsalamin, S.; Roman-Ballesteros, I.; Henkel, J.; Amrouch, H.; Anagnostopoulos, I. Targeting dnn inference via efficient utilization of heterogeneous precision dnn accelerators. IEEE Trans. Emerg. Top. Comput. 2022, 11, 112–125. [Google Scholar] [CrossRef]
Dai, L.; Cheng, Q.; Wang, Y.; Huang, G.; Zhou, J.; Li, K.; Mao, W.; Yu, H. An energy-efficient bit-split-and-combination systolic accelerator for nas-based multi-precision convolution neural networks. In Proceedings of the 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), Taipei, Taiwan, 17–20 January 2022; pp. 448–453. [Google Scholar]
Sharma, H.; Park, J.; Suda, N.; Lai, L.; Chau, B.; Kim, J.K.; Chandra, V.; Esmaeilzadeh, H. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 764–775. [Google Scholar]
Sharify, S.; Lascorz, A.D.; Siu, K.; Judd, P.; Moshovos, A. Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks. In Proceedings of the 55th Annual Design Automation Conference, San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar]
Chhajed, H.; Raut, G.; Dhakad, N.; Vishwakarma, S.; Vishvakarma, S.K. Bitmac: Bit-serial computation-based efficient multiply-accumulate unit for dnn accelerator. Circuits Syst. Signal Process. 2022, 41, 2045–2060. [Google Scholar] [CrossRef]

Figure 1. Processing models: (a) bit-parallel, (b) serial/parallel, and (c) octet serial/parallel.

Figure 2. Octet serial/parallel systolic dataflow.

Figure 3. Overall SPSA-OSPSA architecture.

Figure 4. PEs detailed design: (a) SP-MAC and (b) OSP-MAC.

Figure 5. Zero skipping in PSA and SPSA-OSPSA.

Figure 6. Improving energy efficiency by serial processing.

Table 1. Overview of Synthesis Reports.

Feature	PSA (16 × 16)	SPSA (16 × 16)	Improvement	OSPSA (2 × 16)	Improvement
Area	243,624	63,475	+74%	37,401	+84%
Leakage Power	1.456 mW	0.504 mW	+65%	0.299 mW	+79%
Dynamic Power	17.88 mW	4.42 mW	+72%	2.44 mW	+88%
Latency (Cycle Time)	2.30 ns	1.00 ns	+56%	1.38 ns	+40%
Max Frequency	434 MHz	1000 MHz	+130%	724 MHz	+67%
Max Performance	111 GMac/s	32.0 GMac/s	−71%	23.2 GMac/s	−0.79%
Max Perf./Area (PPA)	456 × 10³	504 × 10³	+10%	619 × 10³	+35%
Max Perf./Watt (PPW)	5.74 × 10¹²	6.50 × 10¹²	+13%	8.47 × 10¹²	+47%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moghaddasi, I.; Nam, B.-G. Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing. Mach. Learn. Knowl. Extr. 2024, 6, 1484-1493. https://doi.org/10.3390/make6030070

AMA Style

Moghaddasi I, Nam B-G. Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing. Machine Learning and Knowledge Extraction. 2024; 6(3):1484-1493. https://doi.org/10.3390/make6030070

Chicago/Turabian Style

Moghaddasi, Iraj, and Byeong-Gyu Nam. 2024. "Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing" Machine Learning and Knowledge Extraction 6, no. 3: 1484-1493. https://doi.org/10.3390/make6030070

APA Style

Moghaddasi, I., & Nam, B.-G. (2024). Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing. Machine Learning and Knowledge Extraction, 6(3), 1484-1493. https://doi.org/10.3390/make6030070

Article Menu

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Abstract

1. Introduction

2. Baseline System

3. Related Work

4. Octet Serial Processing Approach

5. SPSA Accelerator Architecture

5.1. Overall Accelerator Architecture

5.2. SPSA-MAC Processing Elements

6. Evaluation and Comparison

6.1. Computation Pruning by Zero Skipping

6.2. Energy Efficiency Improvement

7. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI