MDPI - Publisher of Open Access Journals

22 pages, 831 KB

Open AccessArticle

Energy-Efficient Dual-Core RISC-V Architecture for Edge AI Acceleration with Dynamic MAC Unit Reuse

by Cristian Andy Tanase

Computers 2026, 15(4), 219; https://doi.org/10.3390/computers15040219 - 1 Apr 2026

Viewed by 359

This paper presents a dual-core RISC-V architecture designed for energy-efficient AI acceleration at the edge, featuring dynamic MAC unit sharing, frequency scaling (DFS), and FIFO-based resource arbitration. The system comprises two RISC-V cores that compete for shared computational resources—a single Multiply–Accumulate (MAC) unit [...] Read more.

This paper presents a dual-core RISC-V architecture designed for energy-efficient AI acceleration at the edge, featuring dynamic MAC unit sharing, frequency scaling (DFS), and FIFO-based resource arbitration. The system comprises two RISC-V cores that compete for shared computational resources—a single Multiply–Accumulate (MAC) unit and a shared external memory subsystem—governed by a channel-based arbitration mechanism with CPU-priority semantics, while each core maintains private instruction and data caches. The architecture implements a tightly coupled Neural Processing Unit (NPU) with CONV, GEMM, and POOL operations that execute opportunistically in the background when the MAC unit is available. Dynamic frequency scaling (DFS) with three levels (100/200/400 MHz) is applied to the shared MAC unit, allowing the dynamic acceleration of CNN workloads. The arbitration mechanism uses SystemC sc_fifo channels with CPU-priority polling, ensuring that CPU execution is minimally impacted by background AI processing while the NPU makes progress during idle MAC slots. The NPU supports 3 × 3 convolutions, matrix multiplication (GEMM) with 10 × 10 tiles, and pooling operations. The implementation is cycle-accurate in SystemC, targeting FPGA deployment. Experimental evaluation demonstrates that the dual-core architecture achieves 1.87× speedup with 93.5% efficiency for parallel workloads, while DFS enables 70% power reduction at low frequency. The system successfully executes simultaneous CPU and AI workloads, with CPU-priority arbitration ensuring no CPU starvation under contention. The proposed design offers a practical solution for embedded AI applications requiring both general-purpose computation and neural network acceleration, validated through comprehensive SystemC simulation on modern FPGA platforms. Full article

(This article belongs to the Special Issue High-Performance Computing (HPC) and Computer Architecture)

► Show Figures

Figure 1

20 pages, 1680 KB

Open AccessArticle

Efficient Inference of Neural Networks with Cooperative Integer-Only Arithmetic on a SoC FPGA for Onboard LEO Satellite Network Routing

by Bogeun Jo, Heoncheol Lee, Bongsoo Roh and Myonghun Han

Aerospace 2026, 13(3), 277; https://doi.org/10.3390/aerospace13030277 - 16 Mar 2026

Viewed by 236

Abstract

Low Earth orbit (LEO) satellite networks require real-time routing to cope with dynamic topology variations caused by continuous orbital motion. As an alternative to conventional routing approaches, deep reinforcement learning (DRL) has recently gained attention as an effective means for optimizing routing paths. [...] Read more.

Low Earth orbit (LEO) satellite networks require real-time routing to cope with dynamic topology variations caused by continuous orbital motion. As an alternative to conventional routing approaches, deep reinforcement learning (DRL) has recently gained attention as an effective means for optimizing routing paths. To solve routing problems modeled as a grid-based Markov decision process (grid-based MDP), DRL methods such as CNN-based Dueling DQN have been proposed. However, these approaches are difficult to implement in practice. In particular, the substantial floating-point computation and memory traffic of CNN inference make real-time onboard inference challenging under the stringent power and resource constraints of satellite platforms. To address these constraints, this paper proposes an INT8 quantization and hardware–software co-design framework using heterogeneous SoC FPGA acceleration. We offload compute-intensive CNN inference to the programmable logic (PL), while the processing system (PS) orchestrates overall control and data movement, forming a collaborative PS–PL architecture. Furthermore, we integrate the NITI-style two-pass scaling with PS–PL exponent propagation to preserve end-to-end integer consistency without floating-point conversion. To demonstrate its practical onboard feasibility, we employ standard accelerator implementation choices—such as output-stationary scheduling and on-chip prefetching—and conduct an ablation study over independently tunable axes (PE array size and PS-side buffer reuse) to quantify their incremental contributions. Experimental results show that the proposed PS–PL cooperative scheme dramatically reduces computation time compared to a PS-only reference implementation on the same platform. Full article

(This article belongs to the Section Astronautics & Space Science)

► Show Figures

Figure 1

27 pages, 12041 KB

Open AccessArticle

FPGA-Based CNN Acceleration on Zynq-7020 for Embedded Ship Recognition in Unmanned Surface Vehicles

by Abdelilah Haijoub, Aissam Bekkari, Anas Hatim, Mounir Arioua, Mohamed Nabil Srifi and Antonio Guerrero-Gonzalez

Sensors 2026, 26(5), 1626; https://doi.org/10.3390/s26051626 - 5 Mar 2026

Viewed by 444

Abstract

Unmanned surface vehicles (USVs) increasingly rely on vision-based perception for safe navigation and maritime surveillance, while onboard computing is constrained by strict size, weight, and power (SWaP) budgets. Although deep convolutional neural networks (CNNs) offer strong recognition performance, their computational and memory requirements [...] Read more.

Unmanned surface vehicles (USVs) increasingly rely on vision-based perception for safe navigation and maritime surveillance, while onboard computing is constrained by strict size, weight, and power (SWaP) budgets. Although deep convolutional neural networks (CNNs) offer strong recognition performance, their computational and memory requirements pose significant challenges for deployment on low-cost embedded platforms. This paper presents a hardware–software co-design architecture and deployment study for CNN acceleration on a heterogeneous ARM–FPGA system, targeting energy-efficient near-sensor processing for embedded maritime applications. The proposed approach exploits a fully streaming hardware architecture in the FPGA fabric, based on line-buffered convolutions and AXI-Stream dataflow, while the ARM processing system is responsible for lightweight configuration, scheduling, and data movement. The architecture was evaluated using representative CNN models trained on a maritime ship dataset. Our experimental results on a Zynq-7020 system-on-chip demonstrate that the proposed co-design strategy achieves a balanced trade-off between throughput, resource utilisation, and power consumption under tight embedded constraints, highlighting its suitability as a practical building block for onboard perception in USVs. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

24 pages, 1430 KB

Open AccessArticle

Lightweight CNN-CEM for Efficient Hyperspectral Target Detection on Resource-Constrained Edge Devices

by Teng Yun, Jinrong Yang, Fang Gao, Jiaoyang Xing, Jingyan Fang, Tong Zhu, Huaixi Zhu, Ran Zhou and Yikun Wang

Appl. Sci. 2026, 16(4), 1719; https://doi.org/10.3390/app16041719 - 9 Feb 2026

Viewed by 402

Abstract

Efficient target detection in hyperspectral images faces significant deployment challenges on resource-constrained edge platforms due to the large data volume and high computational complexity of detection algorithms. This paper proposes a CEM target detection method based on 1D-CNN feature dimensionality reduction. A lightweight [...] Read more.

Efficient target detection in hyperspectral images faces significant deployment challenges on resource-constrained edge platforms due to the large data volume and high computational complexity of detection algorithms. This paper proposes a CEM target detection method based on 1D-CNN feature dimensionality reduction. A lightweight 1D-CNN reduces spectral dimensions from L bands to 16 features, decreasing the core matrix inversion complexity from

O (L^{3})

to

O (16^{3})

. Unlike PCA-based dimensionality reduction requiring online eigenvalue decomposition, the proposed approach employs fixed pre-trained weights with simple convolution operations, enabling high parallelizability for FPGA implementation. A Zynq-based PS + PL collaborative acceleration scheme is designed, deploying CNN on the PL side through RTL implementation and CEM on the PS side using double-precision floating-point computation. Experimental validation on multiple hyperspectral datasets demonstrates that the proposed method achieves an AUC of 0.9953 with less than 1% difference compared to traditional CEM, processes 40,000 pixels in approximately 10.8 s, and consumes only 2.067 W, making it suitable for power-sensitive edge applications such as UAV reconnaissance and satellite on-board processing. The system achieves a processing rate of 3704 pixels/s. Full article

► Show Figures

Figure 1

19 pages, 2072 KB

Open AccessArticle

A Reconfigurable CNN-2D Hardware Architecture for Real-Time Brain Cancer Multi-Classification on FPGA

by Ayoub Mhaouch, Wafa Gtifa, Ibtihel Nouira, Abdessalem Ben Abdelali and Mohsen Machhout

Algorithms 2026, 19(2), 107; https://doi.org/10.3390/a19020107 - 1 Feb 2026

Viewed by 571

Abstract

Brain cancer classification using deep learning has gained significant attention due to its potential to improve early diagnosis and treatment planning. In this work, we propose a reconfigurable and hardware-optimized CNN-2D architecture implemented on FPGA for multiclass classification of brain tumors from MRI [...] Read more.

Brain cancer classification using deep learning has gained significant attention due to its potential to improve early diagnosis and treatment planning. In this work, we propose a reconfigurable and hardware-optimized CNN-2D architecture implemented on FPGA for multiclass classification of brain tumors from MRI images. The contribution of this study lies in the development of a lightweight CNN model and a modular hardware design, where three key IP coresConv2D, MaxPooling, and ReLUare architected with parameterizable kernels, efficient dataflow, and optimized memory reuse to support real-time processing on resource-constrained platforms. These IPs are iteratively reconfigured to process each CNN layer, enabling flexibility while maintaining low latency. To evaluate the proposed architecture, we first implement the model in software on a Dual-Core Cortex-A9 processor and then deploy the hardware-accelerated version on an XC7Z020 FPGA. Performance is assessed in terms of execution time, power consumption, and classification accuracy. The FPGA implementation achieves a 93.21% reduction in latency and a 67.5% reduction in power consumption, while maintaining a competitive accuracy of 96.09% compared with 98.43% for the software version. These results demonstrate that the proposed reconfigurable FPGA-based architecture offers a strong balance between accuracy, real-time performance, and energy efficiency, making it highly suitable for embedded brain tumor classification systems. Full article

► Show Figures

Figure 1

19 pages, 1607 KB

Open AccessArticle

Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA

by Rodrigo Lopes da Silva, Gustavo Jacinto, Mário Véstias and Rui Policarpo Duarte

Electronics 2026, 15(2), 354; https://doi.org/10.3390/electronics15020354 - 13 Jan 2026

Viewed by 708

Abstract

Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio [...] Read more.

Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio classification. However, their computational complexity poses significant challenges for deployment on low-power embedded platforms. This paper presents a low-power embedded system for real-time bird audio detection. A hybrid CNN–RNN architecture is adopted, redesigned, and quantized to significantly reduce model complexity while preserving classification accuracy. To support efficient execution, a custom hardware accelerator was developed and integrated into a Zynq UltraScale+ ZU3CG FPGA. The proposed system achieves an accuracy of 87.4%, processes up to 5 audio samples per second, and operates at only 1.4 W, demonstrating its suitability for autonomous, energy-efficient wildlife monitoring applications. Full article

(This article belongs to the Special Issue System-on-Chip (SoC) and Field-Programmable Gate Array (FPGA) Design, 2nd Edition)

► Show Figures

Figure 1

27 pages, 3492 KB

Open AccessArticle

Filter-Wise Mask Pruning and FPGA Acceleration for Object Classification and Detection

by Wenjing He, Shaohui Mei, Jian Hu, Lingling Ma, Shiqi Hao and Zhihan Lv

Remote Sens. 2025, 17(21), 3582; https://doi.org/10.3390/rs17213582 - 29 Oct 2025

Cited by 3 | Viewed by 1162

Abstract

Pruning and acceleration has become an essential and promising technique for convolutional neural networks (CNN) in remote sensing image processing, especially for deployment on resource-constrained devices. However, how to maintain model accuracy and achieve satisfactory acceleration simultaneously remains to be a challenging and [...] Read more.

Pruning and acceleration has become an essential and promising technique for convolutional neural networks (CNN) in remote sensing image processing, especially for deployment on resource-constrained devices. However, how to maintain model accuracy and achieve satisfactory acceleration simultaneously remains to be a challenging and valuable problem. To break this limitation, we introduce a novel pruning pattern of filter-wise mask by enforcing extra filter-wise structural constraints on pattern-based pruning, which achieves the benefits of both unstructured and structured pruning. The newly introduced filter-wise mask enhances fine-grained sparsity with more hardware-friendly regularity. We further design an acceleration architecture with optimization of calculation parallelism and memory access, aiming to fully translate weight pruning to hardware performance gain. The proposed pruning method is firstly proven on classification networks. The pruning rate can achieve 75.1% for VGG-16 and 84.6% for ResNet-50 without accuracy compromise. Further to this, we enforce our method on the widely used object detection model, the you only look once (YOLO) CNN. On the aerial image dataset, the pruned YOLOv5s achieves a pruning rate of 53.43% with a slight accuracy degradation of 0.6%. Meanwhile, we implement the acceleration architecture on a field-programmable gate array (FPGA) to evaluate its practical execution performance. The throughput reaches up to 809.46MOPS. The pruned network achieves a speedup of 2.23× and 4.4×, with a compression rate of 2.25× and 4.5×, respectively, converting the model compression to execution speedup effectively. The proposed pruning and acceleration approach provides crucial technology to facilitate the application of remote sensing with CNN, especially in scenarios such as on-board real-time processing, emergency response, and low-cost monitoring. Full article

► Show Figures

Figure 1

19 pages, 819 KB

Open AccessArticle

Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules

by Jiahe Shen, Xiyuan Cheng, Xinyu Yang, Lei Zhang, Wenbin Cheng and Yiting Lin

AI 2025, 6(10), 244; https://doi.org/10.3390/ai6100244 - 1 Oct 2025

Cited by 13 | Viewed by 2722

Abstract

With the rapid development of artificial intelligence technology in the field of intelligent manufacturing, convolutional neural networks (CNNs) have shown excellent performance and generalization capabilities in industrial applications. However, the huge computational and resource requirements of CNNs have brought great obstacles to their [...] Read more.

With the rapid development of artificial intelligence technology in the field of intelligent manufacturing, convolutional neural networks (CNNs) have shown excellent performance and generalization capabilities in industrial applications. However, the huge computational and resource requirements of CNNs have brought great obstacles to their deployment on low-end hardware platforms. To address this issue, this paper proposes a scalable CNN accelerator that can operate on low-performance Field-Programmable Gate Arrays (FPGAs), which is aimed at tackling the challenge of efficiently running complex neural network models on resource-constrained hardware platforms. This study specifically optimizes depthwise separable convolution and the squeeze-and-excite module to improve their computational efficiency. The proposed accelerator allows for the flexible adjustment of hardware resource consumption and computational speed through configurable parameters, making it adaptable to FPGAs with varying performance and different application requirements. By fully exploiting the characteristics of depthwise separable convolution, the accelerator optimizes the convolution computation process, enabling flexible and independent module stackings at different stages of computation. This results in an optimized balance between hardware resource consumption and computation time. Compared to ARM CPUs, the proposed approach yields at least a 1.47× performance improvement, and compared to other FPGA solutions, it saves over 90% of Digital Signal Processors (DSPs). Additionally, the optimized computational flow significantly reduces the accelerator’s reliance on internal caches, minimizing data latency and further improving overall processing efficiency. Full article

► Show Figures

Figure 1

19 pages, 6410 KB

Open AccessArticle

Optimized FPGA Architecture for CNN-Driven Subsurface Geotechnical Defect Detection

by Xiangyu Li, Linjian Che, Shunjiong Li, Zidong Wang and Wugang Lai

Electronics 2025, 14(13), 2585; https://doi.org/10.3390/electronics14132585 - 26 Jun 2025

Cited by 1 | Viewed by 1193

Abstract

Convolutional neural networks (CNNs) are widely used in geotechnical engineering. Real-time detection in complex geological environments, combined with the strict power constraints of embedded devices, makes Field-Programmable Gate Array (FPGA) platforms ideal for accelerating CNNs. Conventional parallelization strategies in FPGA-based accelerators often result [...] Read more.

Convolutional neural networks (CNNs) are widely used in geotechnical engineering. Real-time detection in complex geological environments, combined with the strict power constraints of embedded devices, makes Field-Programmable Gate Array (FPGA) platforms ideal for accelerating CNNs. Conventional parallelization strategies in FPGA-based accelerators often result in imbalanced resource utilization and computational inefficiency due to varying kernel sizes. To address this issue, we propose a customized heterogeneous hybrid parallel strategy and refine the bit-splitting approach for Digital Signal Processor (DSP) resources, improving timing performance and reducing Look-Up Table (LUT) consumption. Using this strategy, we deploy the lightweight YOLOv5n network on an FPGA platform, creating a high-speed, low-power subsurface geotechnical defect-detection system. A layer-wise quantization strategy reduces the model size with negligible mean average precision (mAP) loss. Operating at 300 MHz, the system reduces LUT usage by 33%, achieves a peak throughput of 328.25 GOPs in convolutional layers, and an overall throughput of 157.04 GOPs, with a power consumption of 9.4 W and energy efficiency of 16.7 GOPs/W. This implementation demonstrates more balanced performance improvements than existing solutions. Full article

► Show Figures

Figure 1

12 pages, 870 KB

Open AccessEditor’s ChoiceArticle

An Improved Strategy for Data Layout in Convolution Operations on FPGA-Based Multi-Memory Accelerators

by Yongchang Wang and Hongzhi Zhao

Electronics 2025, 14(11), 2127; https://doi.org/10.3390/electronics14112127 - 23 May 2025

Cited by 2 | Viewed by 1571

Abstract

Convolutional Neural Networks (CNNs) are fundamental to modern AI applications but often suffer from significant memory bottlenecks due to non-contiguous access patterns during convolution operations. Although previous work has optimized data layouts at the software level, hardware-level solutions for multi-memory accelerators remain underexplored. [...] Read more.

Convolutional Neural Networks (CNNs) are fundamental to modern AI applications but often suffer from significant memory bottlenecks due to non-contiguous access patterns during convolution operations. Although previous work has optimized data layouts at the software level, hardware-level solutions for multi-memory accelerators remain underexplored. In this paper, we propose a hardware-level approach to mitigate memory row conflicts in FPGA-based CNN accelerators. Specifically, we introduce a dynamic DDR controller generated using Vivado 2019.1, which optimizes feature map allocation across memory banks and operates in conjunction with a multi-memory architecture to enable parallel access. Our method reduces row conflicts by up to 21% and improves throughput by 17% on the KCU1500 FPGA, with validation across YOLOv2, VGG16, and AlexNet. The key innovation lies in the layer-specific address mapping strategy and hardware-software co-design, providing a scalable and efficient solution for CNN inference across both edge and cloud platforms. Full article

(This article belongs to the Special Issue FPGA-Based Reconfigurable Embedded Systems)

► Show Figures

Figure 1

22 pages, 3160 KB

Open AccessArticle

HE-BiDet: A Hardware Efficient Binary Neural Network Accelerator for Object Detection in SAR Images

by Dezheng Zhang, Zehan Liang, Rui Cen, Zhihong Yan, Rui Wan and Dong Wang

Micromachines 2025, 16(5), 549; https://doi.org/10.3390/mi16050549 - 30 Apr 2025

Cited by 1 | Viewed by 1638

Abstract

Convolutional Neural Network (CNN)-based Synthetic Aperture Radar (SAR) target detection eliminates manual feature engineering and improves robustness but suffers from high computational costs, hindering on-satellite deployment. To address this, we propose HE-BiDet, an ultra-lightweight Binary Neural Network (BNN) framework co-designed with hardware acceleration. [...] Read more.

Convolutional Neural Network (CNN)-based Synthetic Aperture Radar (SAR) target detection eliminates manual feature engineering and improves robustness but suffers from high computational costs, hindering on-satellite deployment. To address this, we propose HE-BiDet, an ultra-lightweight Binary Neural Network (BNN) framework co-designed with hardware acceleration. First, we develop an ultra-lightweight SAR ship detection model. Second, we design a BNN accelerator leveraging four-directions of parallelism and an on-chip data buffer with optimized addressing to feed the computing array efficiently. To accelerate post-processing, we introduce a hardware-based threshold filter to eliminate redundant anchor boxes early and a dedicated Non-Maximum Suppression (NMS) unit. Evaluated on SAR-Ship, AirSAR-Ship 2.0, and SSDD, our model achieves 91.3%, 71.0%, and 92.7% accuracy, respectively. Implemented on a Xilinx Virtex-XC7VX690T FPGA, the system achieves

189.3

FPS, demonstrating real-time capability for spaceborne deployment. Full article

(This article belongs to the Section E：Engineering and Technology)

► Show Figures

Figure 1

18 pages, 5095 KB

Open AccessArticle

FPGA-Based Low-Power High-Performance CNN Accelerator Integrating DIST for Rice Leaf Disease Classification

by Jingwen Zheng, Zefei Lv, Dayang Li, Chengbo Lu, Yang Zhang, Liangzun Fu, Xiwei Huang, Jiye Huang, Dongmei Chen and Jingcheng Zhang

Electronics 2025, 14(9), 1704; https://doi.org/10.3390/electronics14091704 - 22 Apr 2025

Cited by 6 | Viewed by 3830

Abstract

Agricultural pest and disease monitoring has recently become a crucial aspect of modern agriculture. Toward this end, this study investigates methodologies for implementing low-power, high-performance convolutional neural networks (CNNs) on agricultural edge detection devices. Recognizing the potential of field-programmable gate arrays (FPGAs) to [...] Read more.

Agricultural pest and disease monitoring has recently become a crucial aspect of modern agriculture. Toward this end, this study investigates methodologies for implementing low-power, high-performance convolutional neural networks (CNNs) on agricultural edge detection devices. Recognizing the potential of field-programmable gate arrays (FPGAs) to enhance inference parallelism, we leveraged their computational capabilities and intensive storage to propose an embedded FPGA-based CNN accelerator design aimed at optimizing rice leaf disease image classification. Additionally, we trained the MobileNetV2 network using multimodal image data and employed knowledge distillation from a stronger teacher (DIST) as the hardware benchmark. The solution was deployed on the ZYNQ-AC7Z020 hardware platform using High-Level Synthesis (HLS) design tools. Through a combination of fine-grained pipelining, matrix blocking, and linear buffering optimizations, the proposed system achieved a power consumption of 3.21 W, an accuracy of 97.41%, and an inference speed of 43 ms per frame, making it a practical solution for edge-based rice leaf disease classification. Full article

(This article belongs to the Topic Smart Farming 2.0: IoT and Edge AI for Precision Crop Management and Sustainability)

► Show Figures

Figure 1

27 pages, 6389 KB

Open AccessArticle

FPGA-Accelerated Lightweight CNN in Forest Fire Recognition

by Youming Zha and Xiang Cai

Forests 2025, 16(4), 698; https://doi.org/10.3390/f16040698 - 18 Apr 2025

Cited by 1 | Viewed by 1537

Abstract

Using convolutional neural networks (CNNs) to recognize forest fires in complex outdoor environments is a hot research direction in the field of intelligent forest fire recognition. Due to the storage-intensive and computing-intensive characteristics of CNN algorithms, it is difficult to implement them at [...] Read more.

Using convolutional neural networks (CNNs) to recognize forest fires in complex outdoor environments is a hot research direction in the field of intelligent forest fire recognition. Due to the storage-intensive and computing-intensive characteristics of CNN algorithms, it is difficult to implement them at edge terminals with limited memory and computing resources. This paper uses a FPGA (Field-Programmable Gate Array) to accelerate CNNs to realize forest fire recognition in the field environment and solves the problem of the difficulty in giving consideration to the accuracy and speed of a forest fire recognition network in the implementation of edge terminal equipment. First, a simple seven-layer lightweight network, LightFireNet, is designed. The network is compressed using a knowledge distillation method and the classical network ResNet50 is used as the teacher network to supervise the learning of LightFireNet so that its accuracy rate reaches 97.60%. Compared with ResNet50, the scale of LightFireNet is significantly reduced. Its model parameter amount is 24 K and its calculation amount is 9.11 M, which are 0.1% and 1.2% of ResNet50, respectively. Secondly, the hardware acceleration circuit of LightFireNet is designed and implemented based on the FPGA development board ZYNQ Z7-Lite 7020. In order to further compress the network and speed up the forest fire recognition circuit, the following three methods are used to optimize the circuit: (1) the network convolution layer adopts a depthwise separable convolution structure; (2) the BN (batch normalization) layer is fused with the upper layer (or full connection layer); (3) half float or ap_fixed<16,6>-type data is used to express feature data and model parameters. After the circuit function is realized, the LightFireNet terminal circuit is obtained through the circuit parallel optimization method of loop tiling, ping-pong operation, and multi-channel data transmission. Finally, it is verified on the test dataset that the accuracy of the forest fire recognition of the FPGA edge terminal of the LightFireNet model is 96.70%, the recognition speed is 64 ms per frame, and the power consumption is 2.23 W. The results show that this paper has realized a low-power-consumption, high-accuracy, and fast forest fire recognition terminal, which can thus be better applied to forest fire monitoring. Full article

(This article belongs to the Special Issue Forest Ecology and Resource Monitoring Based on Sensors, Signal and Image Processing)

► Show Figures

Figure 1

20 pages, 732 KB

Open AccessArticle

VCONV: A Convolutional Neural Network Accelerator for FPGAs

by Srikanth Neelam and A. Amalin Prince

Electronics 2025, 14(4), 657; https://doi.org/10.3390/electronics14040657 - 8 Feb 2025

Cited by 5 | Viewed by 3484

Abstract

Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give [...] Read more.

Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs. Full article

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 3rd Edition)

► Show Figures

Figure 1

13 pages, 2045 KB

Open AccessArticle

A Hardware Accelerator for Real-Time Processing Platforms Used in Synthetic Aperture Radar Target Detection Tasks

by Yue Zhang, Yunshan Tang, Yue Cao and Zhongjun Yu

Micromachines 2025, 16(2), 193; https://doi.org/10.3390/mi16020193 - 7 Feb 2025

Cited by 1 | Viewed by 1845

Abstract

The deep learning object detection algorithm has been widely applied in the field of synthetic aperture radar (SAR). By utilizing deep convolutional neural networks (CNNs) and other techniques, these algorithms can effectively identify and locate targets in SAR images, thereby improving the accuracy [...] Read more.

The deep learning object detection algorithm has been widely applied in the field of synthetic aperture radar (SAR). By utilizing deep convolutional neural networks (CNNs) and other techniques, these algorithms can effectively identify and locate targets in SAR images, thereby improving the accuracy and efficiency of detection. In recent years, achieving real-time monitoring of regions has become a pressing need, leading to the direct completion of real-time SAR image target detection on airborne or satellite-borne real-time processing platforms. However, current GPU-based real-time processing platforms struggle to meet the power consumption requirements of airborne or satellite applications. To address this issue, a low-power, low-latency deep learning SAR object detection algorithm accelerator was designed in this study to enable real-time target detection on airborne and satellite SAR platforms. This accelerator proposes a Process Engine (PE) suitable for multidimensional convolution parallel computing, making full use of Field-Programmable Gate Array (FPGA) computing resources to reduce convolution computing time. Furthermore, a unique memory arrangement design based on this PE aims to enhance memory read/write efficiency while applying dataflow patterns suitable for FPGA computing to the accelerator to reduce computation latency. Our experimental results demonstrate that deploying the SAR object detection algorithm based on Yolov5s on this accelerator design, mounted on a Virtex 7 690t chip, consumes only 7 watts of dynamic power, achieving the capability to detect 52.19 512 × 512-sized SAR images per second. Full article

(This article belongs to the Section E：Engineering and Technology)

► Show Figures

Figure 1

Search Results (79)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (79)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI