Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules
Abstract
1. Introduction
- Aiming at the problem of large resource consumption and high power consumption of the existing convolutional neural network architecture on high-order FPGAs, a depthwise separable convolution network accelerator suitable for low-end FPGAs is proposed. The accelerator achieves a flexible and configurable computing architecture by optimizing the calculation process of the deep separable convolution and squeeze-and-excite module.
- The impact of parallel computing on resource consumption and computing speed is analyzed in detail, and a strategy of adjusting parallelism according to actual needs is proposed to achieve a balance between computing speed and resource consumption. The expand module and pointwise convolution module in the accelerator can be implemented in parallel by stacking multipliers, which improves the computational efficiency.
- By optimizing the data flow and caching mechanism, the caching requirements of intermediate data are reduced and the computational efficiency is improved. The intermediate data generated by different stages in the accelerator do not need to be cached, and they are read directly by the next level, avoiding frequent memory read and write operations. In addition, a simple forward–backward valid signal synchronization mechanism is proposed to ensure the correct transmission of data between modules and reduce the complexity of the control mechanism.
2. Related Theories
2.1. Depthwise Separable Convolution
2.2. Squeeze and Excitation
2.3. MobileNet
3. The Proposed Computing Architecture
3.1. Module Design
3.1.1. Overall Design
3.1.2. Pointwise Convolution Module
Algorithm 1 Pointwise Convolution in Expand Module |
|
Algorithm 2 Regular Pointwise Convolution |
|
3.1.3. Depthwise Convolution Module
Algorithm 3 Depthwise Convolution |
|
3.1.4. Squeeze-and-Excite Module
3.2. Parallelism
4. Experimental Results and Analysis
4.1. Simulation Result
4.2. Experimental Results and Analysis
4.2.1. Improvements on Modules
4.2.2. Comparison Against Other Implementations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, C.; Zhang, Q.; Wang, X.; Zhou, L.; Li, Q.; Xia, Z.; Ma, B.; Shi, Y.Q. Light-Field Image Multiple Reversible Robust Watermarking Against Geometric Attacks. IEEE Trans. Dependable Secur. Comput. 2015, 189, 106896. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef]
- Ma, B.; Li, K.; Xu, J.; Wang, C.; Li, X. A High-Performance Image Steganography Scheme Based on Dual-Adversarial Networks. IEEE Signal Process. Lett. 2024, 31, 2655–2659. [Google Scholar] [CrossRef]
- Wan, Y.; Xie, X.; Chen, J.; Xie, K.; Yi, D.; Lu, Y.; Gai, K. ADS-CNN: Adaptive Dataflow Scheduling for lightweight CNN accelerator on FPGAs. Future Gener. Comput. Syst. 2024, 158, 138–149. [Google Scholar] [CrossRef]
- Wan, Y.; Xie, X.; Yi, L.; Jiang, B.; Chen, J.; Jiang, Y. Pflow: An end-to-end heterogeneous acceleration framework for CNN inference on FPGAs. J. Syst. Archit. 2024, 150, 103113. [Google Scholar] [CrossRef]
- Nakamura, T.; Saito, S.; Fujimoto, K.; Kaneko, M.; Shiraga, A. Spatial- and time-division multiplexing in CNN accelerator. Parallel Comput. 2022, 111, 102922. [Google Scholar] [CrossRef]
- Hu, X.; Yu, S.; Zheng, J.; Fang, Z.; Zhao, Z.; Qu, X. A hybrid CNN-LSTM model for involuntary fall detection using wrist-worn sensors. Adv. Eng. Inform. 2025, 65, 103178. [Google Scholar] [CrossRef]
- Xu, Z.; Zhang, B.; Luo Fan, L.; Hengzhou Yan, E.; Li, D.; Zhao, Z.; Sze Yip, W.; To, S. Deep-learning-driven intelligent tool wear identification of high-precision machining with multi-scale CNN-BiLSTM-GCN. Adv. Eng. Inform. 2025, 65, 103234. [Google Scholar] [CrossRef]
- Nguyen Thanh, P.; Cho, M.Y. Advanced AIoT for failure classification of industrial diesel generators based hybrid deep learning CNN-BiLSTM algorithm. Adv. Eng. Inform. 2024, 62, 102644. [Google Scholar] [CrossRef]
- Liu, T.; Zheng, H.; Zheng, P.; Bao, J.; Wang, J.; Liu, X.; Yang, C. An expert knowledge-empowered CNN approach for welding radiographic image recognition. Adv. Eng. Inform. 2023, 56, 101963. [Google Scholar] [CrossRef]
- Wu, H.; Li, H.; Chi, H.L.; Peng, Z.; Chang, S.; Wu, Y. Thermal image-based hand gesture recognition for worker-robot collaboration in the construction industry: A feasible study. Adv. Eng. Inform. 2023, 56, 101939. [Google Scholar] [CrossRef]
- Liu, C.; Jiang, Q.; Peng, D.; Kong, Y.; Zhang, J.; Xiong, L.; Duan, J.; Sun, C.; Jin, L. QT-TextSR: Enhancing scene text image super-resolution via efficient interaction with text recognition using a Query-aware Transformer. Neurocomputing 2025, 620, 129241. [Google Scholar] [CrossRef]
- Yang, Z.; Tian, Y.; Wang, L.; Zhang, J. Enhancing Generalization in Camera Trap Image Recognition: Fine-Tuning Visual Language Models. Neurocomputing 2025, 634, 129826. [Google Scholar] [CrossRef]
- Fu, C.; Zhou, T.; Guo, T.; Zhu, Q.; Luo, F.; Du, B. CNN-Transformer and Channel-Spatial Attention based network for hyperspectral image classification with few samples. Neural Netw. 2025, 186, 107283. [Google Scholar] [CrossRef]
- Wei, H.; Yi, D.; Hu, S.; Zhu, G.; Ding, Y.; Pang, M. Multi-granularity classification of upper gastrointestinal endoscopic images. Neurocomputing 2025, 626, 129564. [Google Scholar] [CrossRef]
- Mou, K.; Gao, S.; Deveci, M.; Kadry, S. Hyperspectral Image Classification Based on Dual Linear Latent Space Constrained Generative Adversarial Networks. Appl. Soft Comput. 2025, 112962. [Google Scholar] [CrossRef]
- Shu, X.; Li, Z.; Tian, C.; Chang, X.; Yuan, D. An active learning model based on image similarity for skin lesion segmentation. Neurocomputing 2025, 630, 129690. [Google Scholar] [CrossRef]
- Xu, Z.; Wang, H.; Yang, R.; Yang, Y.; Liu, W.; Lukasiewicz, T. Aggregated Mutual Learning between CNN and Transformer for semi-supervised medical image segmentation. Knowl. Based Syst. 2025, 311, 113005. [Google Scholar] [CrossRef]
- Li, K.; Yuan, F.; Wang, C. An effective multi-scale interactive fusion network with hybrid Transformer and CNN for smoke image segmentation. Pattern Recognit. 2024, 159, 111177. [Google Scholar] [CrossRef]
- Ding, W.; Huang, Z.; Huang, Z.; Tian, L.; Wang, H.; Feng, S. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA. J. Syst. Archit. 2019, 97, 278–286. [Google Scholar] [CrossRef]
- Yan, Y.; Ling, Y.; Huang, K.; Chen, G. An efficient real-time accelerator for high-accuracy DNN-based optical flow estimation in FPGA. J. Syst. Archit. 2022, 136, 102818. [Google Scholar] [CrossRef]
- Karamimanesh, M.; Abiri, E.; Shahsavari, M.; Hassanli, K.; van Schaik, A.; Eshraghian, J. Spiking neural networks on FPGA: A survey of methodologies and recent advancements. Neural Netw. 2025, 186, 107256. [Google Scholar] [CrossRef]
- Lin, Y.; Xie, Z.; Chen, T.; Cheng, X.; Wen, H. Image privacy protection scheme based on high-quality reconstruction DCT compression and nonlinear dynamics. Expert Syst. Appl. 2024, 257, 124891. [Google Scholar] [CrossRef]
- Zhang, L.; Lin, Y.; Yang, X.; Chen, T.; Cheng, X.; Cheng, W. From Sample Poverty to Rich Feature Learning: A New Metric Learning Method for Few-Shot Classification. IEEE Access 2024, 12, 124990–125002. [Google Scholar] [CrossRef]
- Rani, M.; Yadav, J.; Rathee, N.; Panjwani, B. Optifusion: Advancing visual intelligence in medical imaging through optimized CNN-TQWT fusion. Vis. Comput. 2024, 40, 7075–7092. [Google Scholar] [CrossRef]
- Shang, J.; Zhang, K.; Zhang, Z.; Li, C.; Liu, H. A high-performance convolution block oriented accelerator for MBConv-Based CNNs. Integration 2022, 88, 298–312. [Google Scholar] [CrossRef]
- Cittadini, E.; Marinoni, M.; Buttazzo, G. A hardware accelerator to support deep learning processor units in real-time image processing. Eng. Appl. Artif. Intell. 2025, 145, 110159. [Google Scholar] [CrossRef]
- Liu, F.; Li, H.; Hu, W.; He, Y. Review of neural network model acceleration techniques based on FPGA platforms. Neurocomputing 2024, 610, 128511. [Google Scholar] [CrossRef]
- Nandanwar, H.; Katarya, R. Deep learning enabled intrusion detection system for Industrial IOT environment. Expert Syst. Appl. 2024, 249, 123808. [Google Scholar] [CrossRef]
- Wang, B.; Yu, D. Orthogonal Progressive Network for Few-shot Object Detection. Expert Syst. Appl. 2025, 264, 125905. [Google Scholar] [CrossRef]
- Li, T.; Jiang, H.; Mo, H.; Han, J.; Liu, L.; Mao, Z.G. Approximate Processing Element Design and Analysis for the Implementation of CNN Accelerators. J. Comput. Sci. Technol. 2023, 38, 309–327. [Google Scholar] [CrossRef]
- Chatterjee, S.; Pandit, S.; Das, A. Coupling of a lightweight model of reduced convolutional autoencoder with linear SVM classifier to detect brain tumours on FPGA. Expert Syst. Appl. 2025, 290, 128444. [Google Scholar] [CrossRef]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16), New York, NY, USA, 21–23 February 2016; pp. 26–35. [Google Scholar] [CrossRef]
- Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized Compression for Implementing Convolutional Neural Networks on FPGA. Electronics 2019, 8, 295. [Google Scholar] [CrossRef]
- Luo, Y.; Cai, X.; Qi, J.; Guo, D.; Che, W. FPGA–accelerated CNN for real-time plant disease identification. Comput. Electron. Agric. 2023, 207, 107715. [Google Scholar] [CrossRef]
- Kim, J.; Kang, J.K.; Kim, Y. A Low-Cost Fully Integer-Based CNN Accelerator on FPGA for Real-Time Traffic Sign Recognition. IEEE Access 2022, 10, 84626–84634. [Google Scholar] [CrossRef]
- Shi, K.; Wang, M.; Tan, X.; Li, Q.; Lei, T. Efficient Dynamic Reconfigurable CNN Accelerator for Edge Intelligence Computing on FPGA. Information 2023, 14, 194. [Google Scholar] [CrossRef]
- Bouguezzi, S.; Fredj, H.B.; Belabed, T.; Valderrama, C.; Faiedh, H.; Souani, C. An Efficient FPGA-Based Convolutional Neural Network for Classification: Ad-MobileNet. Electronics 2021, 10, 2272. [Google Scholar] [CrossRef]
- Cai, L.; Wang, C.; Xu, Y. A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection. Electronics 2021, 10, 2889. [Google Scholar] [CrossRef]
- Fuketa, H.; Katashita, T.; Hori, Y.; Hioki, M. Multiplication-Free Lookup-Based CNN Accelerator Using Residual Vector Quantization and Its FPGA Implementation. IEEE Access 2024, 12, 102470–102480. [Google Scholar] [CrossRef]
- Xuan, L.; Un, K.F.; Lam, C.S.; Martins, R.P. An FPGA-Based Energy-Efficient Reconfigurable Depthwise Separable Convolution Accelerator for Image Recognition. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 4003–4007. [Google Scholar] [CrossRef]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
- Neris, R.; Rodríguez, A.; Guerra, R.; López, S.; Sarmiento, R. FPGA-Based Implementation of a CNN Architecture for the On-Board Processing of Very High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3740–3750. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, H.; Pan, Z. An efficient CNN accelerator for pattern-compressed sparse neural networks on FPGA. Neurocomputing 2025, 611, 128700. [Google Scholar] [CrossRef]
- Zhang, Y.; Jiang, H.; Liu, X.; Cao, H.; Du, Y. High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow. J. Supercomput. 2022, 78, 3205–3225. [Google Scholar] [CrossRef]
- Guo, K.; Sui, L.; Qiu, J.; Yu, J.; Wang, J.; Yao, S.; Han, S.; Wang, Y.; Yang, H. Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 37, 35–47. [Google Scholar] [CrossRef]
Input Size | Input Channel | Expand Channel | Squeeze Channel | Output Size | Output Channel |
---|---|---|---|---|---|
7 × 7 | 96 | 576 | 144 | 7 × 7 | 96 |
This Work (16-1-1-16) | [41] | [47] | |
---|---|---|---|
Network | MobileNetV3-Small | MobileNetV2 | MobileNetV1-Lite |
FPGA Model | Xilinx XC7Z020 | Xilinx VC709 | Xilinx XCKU040 |
Quantization | 16b fixed | float | 16b fixed |
Logic | 49,074/53,200 | 107,325/433,200 | 59,862/242,400 |
DSP | 34/220 | 0/3600 | 603/1920 |
BRAM | 124/140 | 13.7 Mb/51.7 Mb | 233/1200 |
[48] | [49] | [50] | |
Network | VGG-16 | VGG-16 | VGG-16 |
FPGA Model | Xilinx ZCU102 | Xilinx ZCU102 | Xilinx XC7Z045 |
Quantization | 8b fixed | 8b fixed | 16b fixed |
Logic | 324 K (64%) | - | 182,616/218,600 |
DSP | 528 (21%) | 1024/2520 | 780/900 |
BRAM | 1108 (60%) | - | 486/545 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shen, J.; Cheng, X.; Yang, X.; Zhang, L.; Cheng, W.; Lin, Y. Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules. AI 2025, 6, 244. https://doi.org/10.3390/ai6100244
Shen J, Cheng X, Yang X, Zhang L, Cheng W, Lin Y. Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules. AI. 2025; 6(10):244. https://doi.org/10.3390/ai6100244
Chicago/Turabian StyleShen, Jiahe, Xiyuan Cheng, Xinyu Yang, Lei Zhang, Wenbin Cheng, and Yiting Lin. 2025. "Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules" AI 6, no. 10: 244. https://doi.org/10.3390/ai6100244
APA StyleShen, J., Cheng, X., Yang, X., Zhang, L., Cheng, W., & Lin, Y. (2025). Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules. AI, 6(10), 244. https://doi.org/10.3390/ai6100244