MDPI - Publisher of Open Access Journals

27 pages, 4977 KB

Open AccessArticle

SpChipADF: An Architecture Design Framework for Radar Signal Processing Hardware Accelerators

by Huan Wang, Shu Yang, Zhen Chen, Haoyu Sun, Yang Shen, Hang Li, Zhiyu Jiang, Yanlei Li and Xingdong Liang

Micromachines 2026, 17(5), 535; https://doi.org/10.3390/mi17050535 - 27 Apr 2026

Viewed by 131

Abstract

Lightweight Unmanned Aerial Vehicles (UAVs) have limited space, low payload capacity, and constrained power supply capabilities. Therefore, their payloads are constrained by size, weight, and power (SWaP). Thus, designing edge-side signal processing architectures for the payloads of UAVs faces severe challenges. Traditional ASIC [...] Read more.

Lightweight Unmanned Aerial Vehicles (UAVs) have limited space, low payload capacity, and constrained power supply capabilities. Therefore, their payloads are constrained by size, weight, and power (SWaP). Thus, designing edge-side signal processing architectures for the payloads of UAVs faces severe challenges. Traditional ASIC design based on manual optimization struggles to meet the demands of low latency and low resource occupancy in edge-side applications. To address this challenge, this paper proposes a signal processing hardware accelerator architecture design framework with algorithm-hardware co-design. The framework employs a cross-level dataflow graph representation to formally capture task characteristics. Reconfigurable dataflow templates and reusable operator IP components are systematically constructed based on this representation. Through multi-objective design space exploration, the framework achieves Pareto-optimal mapping from algorithmic specifications to hardware implementations. Finally, automatic generation of top-level hardware descriptions enables rapid FPGA-based prototyping and functional validation. Taking synthetic aperture radar (SAR) imaging as a study example, compared with non-reconfigurable architectures, this scheme reduces the equivalent gate count by 51.4% without increasing processing latency. Compared with a conventional reconfigurable dataflow architecture, the design improves energy efficiency from 12.8 MS/J to 16.0 MS/J, representing a 25.4% enhancement, while also scaling the supported data processing size by a factor of 4×. It provides a high-performance and scalable hardware acceleration solution for lightweight edge-side computing platforms. Full article

(This article belongs to the Special Issue Fabrication, Reliability, Simulation, and Protection of Advanced Semiconductor Devices and Integrated Circuits: Enabled by Emerging Semiconductor Materials)

34 pages, 5833 KB

Open AccessArticle

High-Level Synthesis-Based FPGA Hardware Accelerator for Generalized Hebbian Learning Algorithm for Neuromorphic Computing

by Shivani Sharma and Darshika G. Perera

Electronics 2026, 15(8), 1725; https://doi.org/10.3390/electronics15081725 - 18 Apr 2026

Viewed by 669

Abstract

With the advent of AI and the smart systems era, neuromorphic computing will be imperative to support next-generation AI-related applications. Existing intelligent systems, (such as smart cities, robotics), face many challenges and requirements including, high performance, adaptability, scalability, dynamic decision-making, and low power. [...] Read more.

With the advent of AI and the smart systems era, neuromorphic computing will be imperative to support next-generation AI-related applications. Existing intelligent systems, (such as smart cities, robotics), face many challenges and requirements including, high performance, adaptability, scalability, dynamic decision-making, and low power. Neuromorphic computing is emerging as a complementary solution to address these challenges and requirements of next-gen intelligent systems. Neuromorphic computing comprises many traits, such as adaptive, low-power, scalable, parallel computing, that satisfies the requirements of future intelligent systems. There is a need for innovative solutions (in terms of models, architectures, techniques) for neuromorphic computing to support next-gen intelligent systems to overcome several challenges hindering the advancement of neuromorphic computing. In this research work, we introduce a novel and efficient FPGA-HLS-based hardware accelerator for the Generalized Hebbian learning algorithm (GHA) for neuromorphic computing applications. We decided to focus on GHA, since it was demonstrated that GHA enables online and incremental learning, and provides a hardware-efficient unsupervised learning framework that aligns closely with the principles of biological adaptation—traits that are vital for neuromorphic computing applications. In addition, our previous work showed that FPGAs have many features, such as low power, customized circuits, parallel computing capabilities, low latency, and especially adaptive nature, which make FPGAs suitable for neuromorphic computing applications. We propose two different hardware versions of FPGA-HLS-based GHA hardware accelerators: one is memory-mapped interface-based and another one is streaming interface-based. Our streaming interface-based FPGA-HLS-based GHA hardware IP achieves up to 51.13× speedup compared to its embedded software counterpart, while maintaining small area and low power requirements of neuromorphic computing applications. Our experimental results show great potential in utilizing FPGA-based architectures to support neuromorphic computing applications. Full article

(This article belongs to the Special Issue Design and Implementation of Embedded Systems for Real-Time Applications)

► Show Figures

Figure 1

16 pages, 7078 KB

Open AccessArticle

FPGA Implementation of a Radar-Based Fall Detection System Using Binarized Convolutional Neural Networks

by Hyeongwon Cho, Soongyu Kang and Yunho Jung

Sensors 2026, 26(8), 2469; https://doi.org/10.3390/s26082469 - 17 Apr 2026

Viewed by 277

Abstract

As the number of elderly individuals living alone increases, the risk of fall-related accidents correspondingly rises, underscoring the need for rapid fall detection systems. Because falls are difficult to predict in terms of location, detection systems must be deployed in a distributed manner, [...] Read more.

As the number of elderly individuals living alone increases, the risk of fall-related accidents correspondingly rises, underscoring the need for rapid fall detection systems. Because falls are difficult to predict in terms of location, detection systems must be deployed in a distributed manner, which in turn requires compact and low-power implementations. Unlike camera sensors, radar sensors do not raise privacy concerns and are not limited by line-of-sight constraints. Moreover, compared with wearable sensors, radar enables continuous monitoring without user intervention. However, prior radar-based approaches incur high computational complexity, leading to increased power consumption and larger hardware area, thereby necessitating efficient hardware design. This paper proposes a lightweight fall detection system based on continuous-wave (CW) radar and a binarized convolutional neural network (BCNN). Radar signals are preprocessed using short-time Fourier transform (STFT) to generate binary spectrograms, which are then fed into a BCNN-based classification network. The proposed system performs binary classification of five fall activities and seven non-fall activities with an accuracy of 96.1%. The preprocessing module and classification network were implemented as hardware accelerators and integrated with a microprocessor in a system-on-chip (SoC) architecture on a field-programmable gate array (FPGA). Compared with the software implementation, the proposed hardware achieved speedups of 387.5× and 86.7× for the preprocessing and classification modules, respectively. Furthermore, the overall system processing time was 2.58 ms, corresponding to an 89.5× speedup over the software baseline. Full article

(This article belongs to the Special Issue Sensor-Based Movement Signal Acquisition, Processing and Analysis)

► Show Figures

Figure 1

23 pages, 42794 KB

Open AccessArticle

Crypto-Agile FPGA Architecture with Single-Cycle Switching for OFDM-Based Vehicular Networks

by Mahmoud Elomda, Ahmed A. Ibrahim and Mahmoud Abdelaziz

Signals 2026, 7(2), 38; https://doi.org/10.3390/signals7020038 - 16 Apr 2026

Viewed by 364

Abstract

This paper presents a hardware-accelerated signal processing architecture for OFDM-based vehicular networks that integrates crypto-agile adaptive encryption on a Xilinx Kintex-7 FPGA. The encryption layer is tightly coupled to the OFDM modulation/demodulation pipeline, enabling secure real-time signal processing for V2X communications without disrupting [...] Read more.

This paper presents a hardware-accelerated signal processing architecture for OFDM-based vehicular networks that integrates crypto-agile adaptive encryption on a Xilinx Kintex-7 FPGA. The encryption layer is tightly coupled to the OFDM modulation/demodulation pipeline, enabling secure real-time signal processing for V2X communications without disrupting the baseband chain. A context-aware pre-selection unit dynamically selects among hardware cipher primitives based on latency constraints, security requirements, and channel conditions. The current prototype implements and synthesizes AES-128 as the primary block cipher, while ASCON (NIST lightweight AEAD) and Keccak (SHA-3 foundation) are validated through RTL simulation and architectural integration, demonstrating crypto-agility across block, AEAD, and sponge-based primitives. DES is retained solely as a legacy reference for backward-compatibility evaluation and is not recommended for secure V2X deployment. The design adopts a modular decoupling strategy in which cryptographic engines interface with a unified buffering and interleaving subsystem, enabling hardware-based single-cycle cipher switching without partial reconfiguration. FPGA results demonstrate sub-microsecond cryptographic processing latencies with moderate resource utilization, preserving the timing budget of latency-sensitive vehicular services. AES-128 provides standard-strength encryption, while ASCON and Keccak offer lightweight and sponge-based alternatives suited to constrained IoV platforms. Specifically, the implemented AES-128 core achieves a throughput of 1.02 Gbps with a switching latency of 86 ns, verified across 10 randomized transitions with a 99.99% success rate and zero data corruption. The ASCON and Keccak cores attain throughput-to-area efficiencies of 2.01 and 1.47 Mbps/LUT, respectively, at a unified clock frequency of 50 MHz. All acronyms are defined at first use and a complete list of abbreviations is provided prior to the reference section. Full article

(This article belongs to the Special Issue Advanced Signal Processing Technologies: Integrating AI, Future Communications, and Innovative Applications)

► Show Figures

Figure 1

19 pages, 151357 KB

Open AccessArticle

An Energy-Efficient Zero-Shot AI-ISP for Real-Time Low-Light Enhancement with Intelligent Vehicles

by Fangzhou He, Bowen Liu, Zhicheng Dong, Jie Li, Jun Luo and Dongcai Zhao

Mathematics 2026, 14(8), 1324; https://doi.org/10.3390/math14081324 - 15 Apr 2026

Viewed by 390

Abstract

Conventional Image Signal Processors (ISPs) employ manually crafted designs with limited adaptability, resulting in suboptimal performance in dynamic environments for both visual quality and machine vision applications. While deep learning facilitates adaptive AI-ISPs, supervised approaches encounter domain shift limitations and substantial computational demands [...] Read more.

Conventional Image Signal Processors (ISPs) employ manually crafted designs with limited adaptability, resulting in suboptimal performance in dynamic environments for both visual quality and machine vision applications. While deep learning facilitates adaptive AI-ISPs, supervised approaches encounter domain shift limitations and substantial computational demands that impede edge deployment. This work introduces an adaptive zero-shot AI-ISP that dynamically optimizes processing pipelines without requiring paired training data. The proposed architecture implements dual specialized subnetworks for illumination estimation and denoising enhancement, operating collaboratively under Retinex theory principles to achieve boundary-aware illumination mapping and noise-resilient image restoration. Additionally, a physically constrained loss function is introduced to enhance color fidelity and noise suppression. For practical implementation, an FPGA-accelerated computing engine replaces transposed convolution with optimized bilinear interpolation, effectively eliminating artifacting while achieving superior memory efficiency through customized buffering architectures. A comprehensive evaluation demonstrates highly competitive performance, achieving a PSNR of 19.91/16.62 and an SSIM of 0.591/0.475 on LSRW-Huawei/Nikon datasets, alongside NIQE scores of 2.065/3.025 on DCIM and TM-DIED datasets. The hardware implementation attains 42.5 GOPS/W power efficiency, representing 35.4× and 7.3× improvements over conventional CPU and GPU platforms, establishing a comprehensive edge deployment solution for next-generation intelligent image processing systems. Full article

(This article belongs to the Special Issue Applications of Computational Intelligence, Computer Vision and Pattern Recognition)

► Show Figures

Figure 1

21 pages, 8614 KB

Open AccessArticle

Breaking the DSP Wall: A Software–Hardware Co-Designed, Adaptive Error-Compensated MAC Architecture for Efficient Edge AI

by Changyan Liu and Juntai Heiyan

Electronics 2026, 15(8), 1586; https://doi.org/10.3390/electronics15081586 - 10 Apr 2026

Viewed by 381

Abstract

The deployment of Convolutional Neural Networks (CNNs) on entry-level Edge FPGAs is severely constrained by the scarcity of Digital Signal Processing (DSP) blocks, a phenomenon termed the “DSP Wall”. To circumvent this bottleneck, this paper presents AEMAC, a Software–Hardware Co-Designed accelerator architecture that [...] Read more.

The deployment of Convolutional Neural Networks (CNNs) on entry-level Edge FPGAs is severely constrained by the scarcity of Digital Signal Processing (DSP) blocks, a phenomenon termed the “DSP Wall”. To circumvent this bottleneck, this paper presents AEMAC, a Software–Hardware Co-Designed accelerator architecture that decouples arithmetic computation from DSP availability. The proposed methodology synergizes a software-level Dynamic Integer Scaling strategy with a hardware-level Adaptive Error-Compensated Multiply-Accumulate unit. By mapping floating-point activations to an optimal integer domain and employing a DSP-free, LUT-based tri-mode datapath, the architecture achieves extreme resource efficiency. To mitigate the precision loss inherent in logic-based truncation, a statistical bias compensation mechanism is integrated into the accumulator chain. Experimental validation on a Xilinx Zynq-7020 FPGA demonstrates a strictly zero-DSP implementation with minimal logic utilization (100 LUTs). Post-implementation timing simulations confirm a dynamic power of 0.490 W for a 64-core cluster under worst-case random workloads, yielding a verified energy efficiency of 26.1 GOPS/W. Micro-level analysis confirms a 16.7% reduction in arithmetic Mean Absolute Error (MAE) compared to naive truncation. Furthermore, macro-level evaluation on the CIFAR-10 dataset reveals that the co-design strategy recovers system accuracy to 64.74%, outperforming the uncompensated baseline by 0.55% and achieving statistical comparability to floating-point baselines. To ensure absolute internal consistency, all hardware metrics are strictly validated via SAIF-based post-implementation simulations. Based on a conservative full-chip projection that incorporates a routing derating model, these internally consistent results establish AEMAC as a highly scalable and reliable solution for breaking the DSP wall in resource-constrained edge intelligence. Full article

(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)

► Show Figures

Figure 1

31 pages, 7359 KB

Open AccessArticle

LwAMP-Net: A Lightweight Network-Based AMP Detector on FPGA for Massive MIMO

by Zhijie Lin, Yuewen Fan, Yujie Chen, Liyan Liang, Yishuo Meng, Jianfei Wang and Chen Yang

Electronics 2026, 15(7), 1494; https://doi.org/10.3390/electronics15071494 - 2 Apr 2026

Viewed by 316

Abstract

The rapid growth of 5G necessitates wireless receivers capable of high-speed, low-latency communication under complex channel conditions. Traditional receivers struggle with the performance–complexity trade-off in massive MIMO systems, where linear detectors underperform and maximum likelihood (ML) detection becomes computationally prohibitive. Deep-learning-based model-driven approaches [...] Read more.

The rapid growth of 5G necessitates wireless receivers capable of high-speed, low-latency communication under complex channel conditions. Traditional receivers struggle with the performance–complexity trade-off in massive MIMO systems, where linear detectors underperform and maximum likelihood (ML) detection becomes computationally prohibitive. Deep-learning-based model-driven approaches have demonstrated a favorable balance between detection performance and computational cost. However, despite their algorithmic promise, the transition of these learned detectors into practical, real-time systems is critically hampered by inefficient hardware mapping, resulting in suboptimal throughput, high resource overhead, and limited scalability. To bridge this gap, this paper presents LwAMP-Net, a dedicated FPGA accelerator for a lightweight learned AMP detector. We propose a modular and multi-mode hardware architecture for LwAMP-Net, featuring an outer-product-based dataflow that mitigates pipeline stalls and multi-mode processing elements that adapt to diverse computation patterns. These innovations jointly enhance computational parallelism and resource utilization on the FPGA. Implemented on a Xilinx XC7VX690T FPGA for a 128 × 8 MIMO system with 16QAM, the accelerator achieves a 49.2% higher normalized throughput per iteration, an 85.4% improvement in throughput per LUT slice, and a 12.7% improvement in throughput per DSP compared to the state-of-the-art methods. This work provides a complete architectural solution for deploying high-performance, hardware-efficient learned MIMO detectors in real-world systems. Full article

(This article belongs to the Special Issue From Circuits to Systems: Embedded and FPGA-Based Applications)

► Show Figures

Figure 1

22 pages, 831 KB

Open AccessArticle

Energy-Efficient Dual-Core RISC-V Architecture for Edge AI Acceleration with Dynamic MAC Unit Reuse

by Cristian Andy Tanase

Computers 2026, 15(4), 219; https://doi.org/10.3390/computers15040219 - 1 Apr 2026

Viewed by 755

Abstract

This paper presents a dual-core RISC-V architecture designed for energy-efficient AI acceleration at the edge, featuring dynamic MAC unit sharing, frequency scaling (DFS), and FIFO-based resource arbitration. The system comprises two RISC-V cores that compete for shared computational resources—a single Multiply–Accumulate (MAC) unit [...] Read more.

This paper presents a dual-core RISC-V architecture designed for energy-efficient AI acceleration at the edge, featuring dynamic MAC unit sharing, frequency scaling (DFS), and FIFO-based resource arbitration. The system comprises two RISC-V cores that compete for shared computational resources—a single Multiply–Accumulate (MAC) unit and a shared external memory subsystem—governed by a channel-based arbitration mechanism with CPU-priority semantics, while each core maintains private instruction and data caches. The architecture implements a tightly coupled Neural Processing Unit (NPU) with CONV, GEMM, and POOL operations that execute opportunistically in the background when the MAC unit is available. Dynamic frequency scaling (DFS) with three levels (100/200/400 MHz) is applied to the shared MAC unit, allowing the dynamic acceleration of CNN workloads. The arbitration mechanism uses SystemC sc_fifo channels with CPU-priority polling, ensuring that CPU execution is minimally impacted by background AI processing while the NPU makes progress during idle MAC slots. The NPU supports 3 × 3 convolutions, matrix multiplication (GEMM) with 10 × 10 tiles, and pooling operations. The implementation is cycle-accurate in SystemC, targeting FPGA deployment. Experimental evaluation demonstrates that the dual-core architecture achieves 1.87× speedup with 93.5% efficiency for parallel workloads, while DFS enables 70% power reduction at low frequency. The system successfully executes simultaneous CPU and AI workloads, with CPU-priority arbitration ensuring no CPU starvation under contention. The proposed design offers a practical solution for embedded AI applications requiring both general-purpose computation and neural network acceleration, validated through comprehensive SystemC simulation on modern FPGA platforms. Full article

(This article belongs to the Special Issue High-Performance Computing (HPC) and Computer Architecture)

► Show Figures

Figure 1

20 pages, 1680 KB

Open AccessArticle

Efficient Inference of Neural Networks with Cooperative Integer-Only Arithmetic on a SoC FPGA for Onboard LEO Satellite Network Routing

by Bogeun Jo, Heoncheol Lee, Bongsoo Roh and Myonghun Han

Aerospace 2026, 13(3), 277; https://doi.org/10.3390/aerospace13030277 - 16 Mar 2026

Viewed by 345

Abstract

Low Earth orbit (LEO) satellite networks require real-time routing to cope with dynamic topology variations caused by continuous orbital motion. As an alternative to conventional routing approaches, deep reinforcement learning (DRL) has recently gained attention as an effective means for optimizing routing paths. [...] Read more.

Low Earth orbit (LEO) satellite networks require real-time routing to cope with dynamic topology variations caused by continuous orbital motion. As an alternative to conventional routing approaches, deep reinforcement learning (DRL) has recently gained attention as an effective means for optimizing routing paths. To solve routing problems modeled as a grid-based Markov decision process (grid-based MDP), DRL methods such as CNN-based Dueling DQN have been proposed. However, these approaches are difficult to implement in practice. In particular, the substantial floating-point computation and memory traffic of CNN inference make real-time onboard inference challenging under the stringent power and resource constraints of satellite platforms. To address these constraints, this paper proposes an INT8 quantization and hardware–software co-design framework using heterogeneous SoC FPGA acceleration. We offload compute-intensive CNN inference to the programmable logic (PL), while the processing system (PS) orchestrates overall control and data movement, forming a collaborative PS–PL architecture. Furthermore, we integrate the NITI-style two-pass scaling with PS–PL exponent propagation to preserve end-to-end integer consistency without floating-point conversion. To demonstrate its practical onboard feasibility, we employ standard accelerator implementation choices—such as output-stationary scheduling and on-chip prefetching—and conduct an ablation study over independently tunable axes (PE array size and PS-side buffer reuse) to quantify their incremental contributions. Experimental results show that the proposed PS–PL cooperative scheme dramatically reduces computation time compared to a PS-only reference implementation on the same platform. Full article

(This article belongs to the Section Astronautics & Space Science)

► Show Figures

Figure 1

27 pages, 3783 KB

Open AccessArticle

FPGA-Based Front-End Low-Light Enhancement for Deterministic Vision-Only Driving Perception

by Fuwen Xie, Hanhui Jing, Zhiting Lu, Shaoxin Ju, Bochun Peng, Tianle Xie, Linfang Yang, Wenman Han, Zhizhong Wang and Gaole Sai

Electronics 2026, 15(6), 1224; https://doi.org/10.3390/electronics15061224 - 15 Mar 2026

Viewed by 388

Abstract

Vision-only driving perception systems are highly sensitive to illumination variations, particularly under low-light conditions where reduced contrast and structural degradation impair detection and segmentation accuracy. Rather than treating enhancement as a post-processing step, this work investigates the system-level impact of relocating low-light enhancement [...] Read more.

Vision-only driving perception systems are highly sensitive to illumination variations, particularly under low-light conditions where reduced contrast and structural degradation impair detection and segmentation accuracy. Rather than treating enhancement as a post-processing step, this work investigates the system-level impact of relocating low-light enhancement to the FPGA-based front end within a heterogeneous FPGA–ARM architecture. A hardware-accelerated visual pipeline is designed to perform color space conversion, fixed-point convolutional enhancement, and multi-channel fusion prior to high-level perception on the ARM processor. Experimental results demonstrate that the proposed FPGA-based front-end enhancement introduces only 13 ms of additional processing latency, which executes in parallel with the preceding frame’s neural network inference and therefore imposes zero net overhead on the end-to-end pipeline. In contrast, an equivalent software-based back-end enhancement approach would add its full processing time serially to the inference stage, increasing total system latency proportionally. The system achieves a sustained throughput of 58 fps while supporting real-time multi-task perception including lane detection (YOLOPv2, 539 ms per frame), object detection and emergency braking (YOLOv5, 432 ms per frame), and hardware-level multi-camera synchronization. Full article

(This article belongs to the Special Issue Hardware and Software Co-Design in Intelligent Systems)

► Show Figures

Figure 1

18 pages, 4228 KB

Open AccessArticle

Design Space Exploration on Blind Equalization Algorithms: Numerical Representation Analysis for SoC-FPGA

by David Marquez-Viloria, L. J. Morantes-Guzman, Neil Guerrero-Gonzalez and Marin B. Marinov

Appl. Sci. 2026, 16(6), 2777; https://doi.org/10.3390/app16062777 - 13 Mar 2026

Viewed by 333

Abstract

Field-Programmable Gate Arrays (FPGAs) have become an important platform for accelerating real-time communication systems, and System-on-Chip (SoC) devices provide the flexibility to design and optimize architectures that support high data rates, different modulation formats, and channel equalization schemes. Selecting the appropriate architecture can [...] Read more.

Field-Programmable Gate Arrays (FPGAs) have become an important platform for accelerating real-time communication systems, and System-on-Chip (SoC) devices provide the flexibility to design and optimize architectures that support high data rates, different modulation formats, and channel equalization schemes. Selecting the appropriate architecture can be guided through Design Space Exploration (DSE) using high-level synthesis tools, which enables the identification of numerical representations that balance performance with reduced hardware resource consumption. Despite their relevance, recent developments in communication systems often overlook the impact of numerical precision in Digital Signal Processing algorithms, particularly the trade-offs between floating- and fixed-point arithmetic when targeting hardware implementations. In this work, two widely used blind equalization algorithms, the Constant Modulus Algorithm (CMA) and the Multi-Modulus Algorithm (MMA), were implemented on a low-cost Ultra96 SoC-FPGA to analyze the effect of a fixed-point representation. A multi-objective Design Space Exploration methodology was applied to minimize hardware utilization while maintaining reliable transmission performance. Resource consumption, latency, and throughput were measured across different binary formats using the Minimum Mean Square Error (MMSE) criterion. Parallelization techniques were incorporated to improve throughput. The DSE generated comprehensive performance surfaces quantifying latency, MMSE convergence, and FPGA resource utilization (DSP48E/FF/LUT/BRAM) across fixed-point formats, achieving optimal 4 MS/s throughput configurations. Although this throughput is naturally lower than the Gigabit speeds required in backbone optical networks, the results demonstrate the effectiveness of numerical representation optimization in resource-constrained SoC-FPGA devices, offering a practical approach for real-time Edge and IoT implementations where cost and hardware limitations are critical. Full article

(This article belongs to the Special Issue Interdisciplinary Approaches in Intelligent Engineering and Industrial Processes)

► Show Figures

Figure 1

20 pages, 3159 KB

Open AccessArticle

ROM-Less Co(Sine) Synthesizer

by Florentina-Giulia Stoica, Alex Calinescu and Marius Enachescu

Electronics 2026, 15(5), 1093; https://doi.org/10.3390/electronics15051093 - 5 Mar 2026

Viewed by 2189

Abstract

Sine and cosine wave synthesis is utilized for generating sinusoidal-like values in the digital domain. While this task is commonly handled through software, dedicated hardware like Direct Digital Synthesis (DDS) is also available. However, both methods rely on memory resources, such as look-up [...] Read more.

Sine and cosine wave synthesis is utilized for generating sinusoidal-like values in the digital domain. While this task is commonly handled through software, dedicated hardware like Direct Digital Synthesis (DDS) is also available. However, both methods rely on memory resources, such as look-up tables and Read-Only Memories (ROMs), which face latency limitations related to additional memory access times on top of additional

S i

area. With the advent of real-time arithmetic for sine wave approximation, this paper presents a digital module that employs iterative multiply-accumulate (MAC) operations for sine and cosine synthesis. To support the integration of this module into Systems-on-Chip (SoCs), Field-Programmable Gate Arrays (FPGAs), and standalone Application-Specific Integrated Circuits (ASICs), a comprehensive figure of merit (FoM) comparison against various ROM-less methods is provided. When implemented on a Xilinx (AMD) XC7A100T-3CSG324 FPGA, the proposed architecture compared to other ROM-less solutions like the Taylor approximation, achieves 80.80% lower resource utilization, 80.89% reduced propagation delay, and 36.66% higher accuracy in sine and cosine wave approximation, both operating as 32-bit systems with one sample per clock cycle. Furthermore, the proposed sine accelerator, accompanying control and communication IPs, and custom firmware were deployed on an FPGA-based function generator platform and experimentally validated. Full article

(This article belongs to the Section Circuit and Signal Processing)

► Show Figures

Figure 1

27 pages, 12041 KB

Open AccessArticle

FPGA-Based CNN Acceleration on Zynq-7020 for Embedded Ship Recognition in Unmanned Surface Vehicles

by Abdelilah Haijoub, Aissam Bekkari, Anas Hatim, Mounir Arioua, Mohamed Nabil Srifi and Antonio Guerrero-Gonzalez

Sensors 2026, 26(5), 1626; https://doi.org/10.3390/s26051626 - 5 Mar 2026

Viewed by 629

Abstract

Unmanned surface vehicles (USVs) increasingly rely on vision-based perception for safe navigation and maritime surveillance, while onboard computing is constrained by strict size, weight, and power (SWaP) budgets. Although deep convolutional neural networks (CNNs) offer strong recognition performance, their computational and memory requirements [...] Read more.

Unmanned surface vehicles (USVs) increasingly rely on vision-based perception for safe navigation and maritime surveillance, while onboard computing is constrained by strict size, weight, and power (SWaP) budgets. Although deep convolutional neural networks (CNNs) offer strong recognition performance, their computational and memory requirements pose significant challenges for deployment on low-cost embedded platforms. This paper presents a hardware–software co-design architecture and deployment study for CNN acceleration on a heterogeneous ARM–FPGA system, targeting energy-efficient near-sensor processing for embedded maritime applications. The proposed approach exploits a fully streaming hardware architecture in the FPGA fabric, based on line-buffered convolutions and AXI-Stream dataflow, while the ARM processing system is responsible for lightweight configuration, scheduling, and data movement. The architecture was evaluated using representative CNN models trained on a maritime ship dataset. Our experimental results on a Zynq-7020 system-on-chip demonstrate that the proposed co-design strategy achieves a balanced trade-off between throughput, resource utilisation, and power consumption under tight embedded constraints, highlighting its suitability as a practical building block for onboard perception in USVs. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

20 pages, 1420 KB

Open AccessArticle

High-Level Synthesis (HLS)-Enabled Field-Programmable Gate Array (FPGA) Algorithms for Latency-Critical Plasma Diagnostics and Neural Trigger Prototyping in Next-Generation Energy Projects

by Radosław Cieszewski, Krzysztof Poźniak, Ryszard Romaniuk and Maciej Linczuk

Energies 2026, 19(4), 1091; https://doi.org/10.3390/en19041091 - 21 Feb 2026

Viewed by 717

Abstract

Large-scale advanced energy systems, including fusion devices, high-power plasma sources, and accelerator-driven energy platforms, increasingly depend on real-time, hardware-level data processing for diagnostics, control, and protection. In such installations, ultra-low latency, deterministic throughput, and multi-decade operational lifetimes are not optional design goals but [...] Read more.

Large-scale advanced energy systems, including fusion devices, high-power plasma sources, and accelerator-driven energy platforms, increasingly depend on real-time, hardware-level data processing for diagnostics, control, and protection. In such installations, ultra-low latency, deterministic throughput, and multi-decade operational lifetimes are not optional design goals but strict system-level requirements. While similar timing constraints exist in high-energy physics infrastructures, energy applications place a stronger emphasis on long-term stability, maintainability, and reproducibility of digital signal processing pipelines. This work investigates whether high-level synthesis (HLS) provides a practical and sustainable design methodology for implementing both classical pattern-based and compact neural network (NN) trigger logic on Field-Programmable Gate Arrays (FPGAs) under realistic energy-system constraints. Using representative commercial toolchains (Intel HLS and hls4ml) as reference workflows, we demonstrate the capabilities of fixed-point, fully pipelined streaming architectures, while also identifying critical shortcomings of pragma-driven HLS approaches in terms of architecture transparency, long-term portability, and systematic multi-objective design-space exploration, all of which are crucial for long-lived energy projects and plasma diagnostic systems. These limitations directly motivate the development of a custom, vendor-agnostic, extensible HLS framework (PyHLS), specifically oriented toward deterministic latency, reproducibility, and physics-grade verification demands of advanced energy infrastructures. Gas Electron Multipliers (GEMs) are modern gaseous detectors increasingly employed in plasma diagnostics, radiation monitoring, and high-power energy experiments, where high rate capability, fine spatial resolution, and radiation tolerance are required. Their massively parallel signal structure and continuous data streams make GEMs a representative and demanding benchmark for FPGA-based real-time trigger and preprocessing systems in energy-related environments. The primary objective of this study is to establish a pragmatic technological baseline, demonstrating that contemporary HLS workflows can reliably support both template-based and neural inference-based trigger architectures within strict timing, resource, and power constraints typical for advanced energy installations. Furthermore, we outline a scalable development path toward multi-channel and two-dimensional (pixelated) GEM readout architectures, directly applicable to fusion diagnostics, plasma accelerators, beam–plasma interaction studies, and radiation-hard energy monitoring platforms. Although the proposed methodology remains fully transferable to large-scale physics trigger systems, its principal relevance is directed toward real-time diagnostics and protection layers in next-generation energy systems. Full article

(This article belongs to the Special Issue Modern High-Performance Electronic Systems for Advanced Energy Projects and Large-Scale Research Infrastructures)

► Show Figures

Figure 1

19 pages, 662 KB

Open AccessArticle

FPGA Programmable Logic Block Architecture with High-Density MAC for Deep Learning Inference

by Yanlin Wang, Lijiang Gao and Haigang Yang

Electronics 2026, 15(4), 801; https://doi.org/10.3390/electronics15040801 - 13 Feb 2026

Viewed by 563

Abstract

Compared to half- or single-precision floating-point, reducing the precision of Deep Neural Network (DNN) inference accelerators can yield significant efficiency gains with little to no accuracy degradation by enabling more multiplication operations per unit area. The variable precision capabilities of FPGAs are extremely [...] Read more.

Compared to half- or single-precision floating-point, reducing the precision of Deep Neural Network (DNN) inference accelerators can yield significant efficiency gains with little to no accuracy degradation by enabling more multiplication operations per unit area. The variable precision capabilities of FPGAs are extremely valuable, as a wide range of precisions fall on the Pareto-optimal curve of hardware efficiency versus accuracy, with no single precision dominating. We propose seven variants across three types of logical block designs to improve the area efficiency of multiply accumulate (MAC) implemented in soft structures. Ultimately, we use COFFE and VTR tools to fully evaluate these enhancements. The 2-bit adder BLE (ADD2_BLE) architecture achieves a 7.3% area optimization with only a 1.7% increase in tile area by improving the fracturability of LUTs in the baseline BLE and adding an additional 1-bit adder. However, this comes at the expense of reduced speed. The 9-bit Compact Multiplier (CMUL) architecture based on ADD2_BLE achieved the greatest optimization among the six variants using the Compact Multiplier (CMUL). On average, it reduces the DAP result by up to 72%. Nonetheless, it results in a 13% increase in logic tile area for universal benchmarks that do not use multiplication. Full article

(This article belongs to the Special Issue FPGA-Based Accelerators for Deep Neural Networks)

► Show Figures

Figure 1

Search Results (248)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (248)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI