1. Introduction
In recent years, artificial intelligence and deep learning have been extensively used to solve many real-world problems. Currently, convolutional neural networks (CNNs) are one of the most advanced deep learning algorithms and are used to solve recognition problems in several scenarios. CNNs are more accurate than conventional algorithms. However, many parameters of the convolution operation require a considerable amount of computational resources and memory access [
1]. This is a computational challenge for the central processing unit (CPU) as it consumes excessive power. Instead, hardware accelerators such as a graphics processing unit (GPU), field-programmable gate array (FPGA), and application-specific integrated circuit (ASIC) have been used to increase the throughput of CNNs [
2,
3]. When CNNs are integrated through hardware, latency is improved, and the energy consumption is reduced. GPUs are the most widely used processors and can improve the training and inference processes of CNNs. However, GPUs consume excessive power, which is a key performance metric in modern digital systems. ASIC designs achieve high throughput and low power consumption but require more development time and cost. By contrast, FPGAs increase the capacity of hardware resources, providing thousands of floating-point computing units and lower power consumption. Therefore, FPGA-based accelerators, like ASICs, are an efficient alternative that offer high throughput and configurability at low power consumption and a reasonable cost.
With the development of FPGA-based hardware accelerators, algorithms for improving the accuracy of CNNs are also evolving. Advances in CNN algorithms require many convolution operation parameters, which increase complexity and speed. Many object detection series algorithms, such as the two-stage-based R-CNN [
4,
5,
6] and one-stage-based YOLO [
7,
8], have been developed to improve speed and accuracy equally. However, implementing edge computing of CNNs faces restrictions because of the complexity and the requirements associated with increased computations. To address these problems, CNN model compression methods have been attracting considerable attention.
CNN model compression technology involves simplifying the deep learning model structure, reducing the number of model parameters, reducing the number of model bits, reducing the amount of computation, improving the deep learning model inference speed, and reducing the required storage resources. In the usage environment of edge devices, fast response speed, low memory usage, and low energy consumption are required. Deep learning model compression can efficiently improve model inference speed, reduce model storage space, and reduce model energy consumption. In the application scenario of large deep learning models, model compression can improve product competitiveness by reducing edge equipment cost, increasing efficiency, and improving low-carbon environmental protection. In the existing deep learning model compression methods, the compression technique is relatively complex, and the difficulty is primarily related to the following aspects.
The current model compression methods rely on the learning process without considering the time cost. However, owing to a lack of raw data or algorithm complexity, users cannot obtain the training code and cannot reproduce the training process.
- 2.
The several types of model compression algorithms and the difficulty of parameter adjustment:
Taking post-training quantization as an example, classic and commonly used offline quantization algorithms exist, including MSE, ABS_MAX, Bias Correctiony, AVG, HIST, KLD, AdaRound, and EMD. Each offline quantization algorithm has 2–4 parameters. The efficient selection of an appropriate offline quantization algorithm for a model and its parameters in a specific scenario is a significant problem in the implementation of a model compression technology project.
- 3.
The complexity of combining the model compression of multiple strategies:
In addition to offline quantization, model compression includes various compression techniques such as pruning and distillation. A combination of various compression algorithms can also be used as the demand for model miniaturization increases. Compression algorithms affect each other and cannot simply accumulate their effects. Selecting a suitable compression algorithm set from various candidate compression algorithms is highly dependent on human experience and long-term experimentation.
- 4.
The numerous compressed model structures and the complexity of the deployment environments:
In terms of model structure, the backbone network improves rapidly, and the active function continues to evolve. Different structures and active functions have different sensitivities and lossless compression ratios. In addition, in terms of the deployment environment, the FPGA characteristics and optimization details of the inference library are all factors to consider when compressing. Considering the model structure and deployment environment, manual compression faces difficulty in achieving the expected goal.
In this study, we develop an automatic model compression tool set that can significantly reduce the size of deep learning models by combining model optimization methods with model compression to solve the aforementioned problems. Automatic parallelization of FPGA designs such that software and hardware coordinate CNN tasks and cleaning up CNNs to better adapt to FPGA and advanced RISC machine (ARM) NEON architectures are examined.
Section 2.1 and
Section 2.2 discuss recent studies related to FPGA design CNNs and CNN optimization.
Section 3 focuses on the proposed research method. The developed method can address several problems by focusing on the following main ideas.
We propose a model compression method that considers memory resources to enable software and hardware acceleration through a lightweight CNN model. In the proposed model compression method, the quantization method is determined through the distillation method. Further, the pruning method also balances accuracy and speed to ensure maximum performance for CNN prediction.
The determined pruning is grouped to achieve an appropriate balance between CNN prediction accuracy and accessing group-specific memory. The per-memory data flow varies and helps with parallel computations. To improve the computational efficiency of CNN, we propose a CPU-FPGA cooperative computation method. Based on the CNN architecture and advanced pruning method, we design software and hardware to be parallelized in CNN computation. This approach achieves high hardware parallelism, including weight groups and memory access methods. Moreover, it enables parallelism of the ARM NEON architecture and FPGAs.
We propose an automated method to design an optimal accelerator with the performance of ARM and FPGA balanced by utilizing a space exploration model. The proposed method generates a CNN auto-accelerator through network analysis, layer decomposition, software-defined system-on-chip (SDSOC) template mapping, and long short-term memory (LSTM)-based CNN auto-generator design steps. Based on this process, an automatic CNN model optimization engine is designed by implementing a CNN model generation LSTM that satisfies the FPGA design performance suitable for CNN implementation.
3. Materials and Methods
3.2. CNN Model Compression Method through Distillation
We primarily identified the effects of automated compression in an open-source model dealing with image classification, image semantic segmentation, and image object detection. Automatic compression also supports inference models created in PyTorch and TensorFlow. In contrast to the conventional manual compression, the automatic aspect of automatic compression is primarily reflected in four areas: deep learning training code separation, offline quantized hyperparameter search, automatic algorithm combination, and hardware recognition model. Users can perform compression methods that rely on training processes such as quantitative training and sparse training by providing only an inference model and unlabeled data. Automated compression adds training logic to the inference model through distillation.
Figure 4 shows the distillation process, and
Figure 5 shows the compression process. First, the user-specified inference model file is loaded, the inference model is copied to the memory as the teacher model of knowledge distillation, and the original model is copied as the student model. It then automatically analyzes the model structure to find a suitable layer for adding distillation losses and finds a layer with learnable parameters. Finally, the teacher model supervises the quantization training of the original model through distillation loss.
Offline quantization with many models and fast iteration rates in the search scenario is the best compression method for this scenario. Various offline quantization algorithms are implemented. Offline quantization algorithms perform a few algorithms that can be used in combination.
The automatic compression function analyzes the model structure and automatically selects an appropriate combination algorithm according to the user-specified model structure characteristics and deployment environment. Determining the parameters of each compression algorithm after selecting a joint compression algorithm is challenging. Setting the parameters of the compression algorithm is closely related to the deployment environment. Various factors, such as chip characteristics and the degree of optimization of the inference library, must be considered. As an agent of the deployment environment, the hardware awareness module models and learns the characteristics of the deployment environment and provides a performance inquiry service for parameter setting.
The relationship between compression parameters and inference speed, constrained by optimizations such as inference library operator fusion, is not linear. Taking sparsity as an example, an inference library can support matrix multiplication operations where sparsity is greater than 75%. That is, 60% sparsity and 10% sparsity have no inference acceleration effect. Therefore, setting the sparsity to 60% is impractical. In addition, the sparse acceleration effect is also affected by the input form of the matrix multiplication operator. In conclusion, accurately evaluating the relationship between compression parameters and inference speed based on human experience or simple formulas in various model structures and deployment environments is impossible.
To this end, we developed a hardware delay estimation function. This feature uses data tables combined with deep learning models to model factors that affect inference speed and guide the setting of parameters for the joint algorithm. The two key modules of the hardware delay estimation function are the delay estimation table and the estimator.
Estimation table: Sample and test the inference performance of multiple operators for each deployment environment and record it in the data table. Each row in the data table contains the operator type, parameters of the operator itself (e.g., input shape stride and padding), sparsity, quantization, and other information. Estimation tables can accurately estimate the information of the target operator; however, it faces difficulty in dealing with all possible parameters of the operator.
Predictors: Use data from the estimation table to train predictors for each type of operator to predict inference performance. The predictor accuracy is not as good as the prediction table; however, it has stronger generalizability and can contain more operator parameter values.
The workflow for this function is shown in the flowchart on the right in
Figure 6.
Step 1: Analyze the model structure and perform OP fusion on the inference model (to get the OP executed during deployment).
Step 2: Check the estimation table in turn for every OP of the inference model created in step 1, and check the estimator if it does not meet the target.
Step 3: Accumulate the time of all OPs to obtain the final inference performance of the candidate model.
By supporting the above functions, quickly obtaining model inference performance under various compression parameters and locking a small number of candidate models according to the user-specified inference acceleration multiplier on specific hardware are possible. Finally, the accuracy of the candidate models is verified individually.
The candidate model obtained after verification optimizes the CNN branch by pruning to perform parallel computation in the CPU–FPGA co-design method of the CNN. The branch optimizes with structured pruning and unstructured pruning for both CPU and FPGA processors, respectively. The CPU processor focused on unstructured pruning, which arbitrarily removed CNN weights. Unstructured pruning is a technique that can achieve high sparsity, averaging 90%, which helps reduce resources for on-chip storage needs when computing CNNs. However, high sparsity does not necessarily lead to high-performance speed-ups owing to the additional encoding and indexing overhead, workload imbalance, and poor data locality. The performance is degraded when the sparsity distribution is heavily skewed. Significant progress has recently been made in structured pruning, which aims to prune networks according to specific sparsity patterns. Starting with a strict sparsity pattern that follows particular mathematical properties, it designs the hardware to support the necessary mathematical transformations. These techniques can achieve high regularity and computational efficiency suitable for FPGA hardware designs. A typical convolutional computation is a computationally intensive kernel that traverses a multidimensional tensor (feature map, weights) to perform addition and multiplication. We must reconstruct the CNN branch weights into blocks to increase the parallelism of the CNN architecture. Multiple branches access tensors simultaneously with each processor solving memory bottlenecks and helping with parallel computing performance resistance issues during computation.
3.3. Hardware Acceleration through CPU–FPGA Co-Design
A parallel operation method is required as a joint design operation approach for the CPU and FPGA to utilize resources more efficiently in time execution during CNN inference processing. Because most of the latest CNNs use multi-branch architecture frequently, we performed hardware acceleration through the CPU–FPGA co-design computation method. Implementing CNN inference with only one processor on the system-on-chip (SOC) board wastes excessive computational resources. In particular, the CPU has excessive wasted resources when the FPGA performs the operation. Therefore, if the CPU and FPGA perform calculations simultaneously, we can obtain high efficiency and speed in resource use.
Hardware such as FPGAs are better suited for highly parallel tasks, such as convolution tasks in CNNs, than CPUs. In a CNN multi-branch structure, there is a difference in operation time between a branch structure with many convolutional layers and a branch structure with relatively few or no convolutional layers. In addition, owing to the branch-by-branch pruning previously performed, there is a structural difference between groups in addition to time. If a branch is optimized through structured pruning, it operates as an FPGA, and another branch of unstructured pruning operates as a CPU. Then, CPU and FPGA can be used simultaneously. If the convolutional layer of the branch of structured pruning is continuous, it can be operated sequentially using FPGA.
As shown in
Figure 7, we stored the weights of CNNs divided into groups in DRAM for calculation. The CPU and FPGA logical spaces inside dynamic random-access memory (DRAM) can be copied between each other; therefore, arranging the weights to move to each processor memory at CNN runtime is possible. In pruning, the sort criterion takes the form of a structure copied into space on the CPU and FPGA. When the corresponding CNN branch is executed, the weight groups copied to each processor space are copied again to different memories for operation, and the weight groups of the unstructured pruning are moved from the CPU logical space to the SRAM. However, the weight group of structured pruning is moved from the FPGA logical space to the FPGA local memory.
Branch operation time is defined as t_u for unstructured pruning and as t_s for structured pruning to measure the operation time of each CPU and FPGA in a branch of the CNN. For CNN branch architecture with various branches, the representable computing times are t_s1, t_s2, t_s3, …, t_u1, t_u2, and t_u3. When branch1 and branch2 of
Figure 3 are operated simultaneously from the CPU and FPGA to different processors, the operation time is determined by the long operation time, expressed by the following equation:
If the processors for branch1 and branch2 are computed in CPU-CPU and FPGA-FPGA, the minimum time is calculated differently. If the two structures in the branch are identical and the operation is performed on the same processor, serial computing is performed. If each branch is computed on the same processor, the computation time is determined as follows:
By comparing the serial operation time T_same of the same processor and the CPU-FPGA parallel operation time T_diff of different processors, the optimal processor for branch1 and branch2 can be selected as the minimum. The CPU and FPGA execution times are measured in each branch, and T_total is the total computation time for both branches.
Consequently, this calculation computes the CNN inference process with the processor combination that is the minimum value of the total computation time T_total. When running on this processor, implementing CNN inference with high speed and high hardware utilization is possible through a CPU–FPGA co-design approach.
Author Contributions
Date curation, S.J.; Formal analysis, S.J. and W.L.; Founding acquisition, Y.C.; Methodology, W.L.; Software, S.J.; Supervision, Y.C.; Validation W.L.; Writing—original draft, S.J.; Writing—review & editing, Y.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Korea Evaluation Institute of Industrial Technology (KEIT) under the Industrial Embedded System Technology Development (R&D) Program 20016341. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.
Institutional Review Board Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Qi, X.; Liu, C. Enabling Deep Learning on IoT Edge: Approaches and Evaluation. In Proceedings of the IEEE/ACM Symposium on Edge Computing (SEC), Seattle, WA, USA, 25–27 October 2018; pp. 25–27. [Google Scholar] [CrossRef]
- Li, Q.; Zhang, X.; Xiong, J.; Hwu, W.; Chen, D. Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 23 January 2019; pp. 21–24. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 22–24. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Wang, J.; Lou, Q.; Zhang, X.; Zhu, C.; Lin, Y.; Chen, D. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; pp. 27–31. [Google Scholar]
- Aydonat, U.; OʹConnell, S.; Capalija, D.; Ling, A.C.; Chiu, G.R. An openclTM deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 22–24. [Google Scholar]
- Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. In Proceedings of the 2019 7th International Conference on Learning Representation (ICLR), New Orleans, LA, USA, 6–9 May 2019; pp. 6–9. [Google Scholar]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 21–23. [Google Scholar]
- Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized Compression for Implementing Convolutional Neural Networks on FPGA. Electronics 2019, 8, 295. [Google Scholar] [CrossRef] [Green Version]
- Zeng, H.; Chen, R.; Zhang, C.; Prasanna, V. A framework for generating high throughput CNN implementations on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 25–27 February 2018; pp. 25–27. [Google Scholar]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, S. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 2017 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 4–8 September 2017; pp. 4–8. [Google Scholar]
- Guan, Y.; Liang, H.; Xu, N.; Wang, W.; Shi, S.; Chen, X.; Sun, G.; Zhang, W.; Cong, J. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, USA, 30 April–2 May 2017. [Google Scholar]
- Guo, Y.; Yao, A.; Chen, Y. Dynamic network surgery for efficient dnns. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- He, Y.; Zhang, X.; Sun, J. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 27–29 October 2017. [Google Scholar]
- Kaiming, H.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Howard, G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Iandola, F.; Han, S.; Moskewicz, M.; Ashraf, K.; Dally, W.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Wu, D.; Zhang, Y.; Jia, X.; Tian, L.; Li, T.; Sui, L.; Xie, D.; Shan, Y. A High-Performance CNN Processor Based on FPGA for MobileNets. In International Conference on Field Programmable Logic and Applications (FPL); IEEE: Piscataway, NJ, USA, 2019; pp. 136–143. [Google Scholar]
- Bai, L.; Zhao, Y.; Huang, X. A CNN Accelerator on FPGA using Depthwise Separable Convolution. In IEEE Transactions on Circuits and Systems II: Express Briefs; IEEE: Piscataway, NJ, USA, 2018; Volume 65, pp. 1415–1419. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).