MDPI - Publisher of Open Access Journals

27 pages, 9802 KiB

Open AccessArticle

Flight-Safe Inference: SVD-Compressed LSTM Acceleration for Real-Time UAV Engine Monitoring Using Custom FPGA Hardware Architecture

by Sreevalliputhuru Siri Priya, Penneru Shaswathi Sanjana, Rama Muni Reddy Yanamala, Rayappa David Amar Raj, Archana Pallakonda, Christian Napoli and Cristian Randieri

Drones 2025, 9(7), 494; https://doi.org/10.3390/drones9070494 - 14 Jul 2025

Viewed by 244

Abstract

Predictive maintenance (PdM) is a proactive strategy that enhances safety, minimizes unplanned downtime, and optimizes operational costs by forecasting equipment failures before they occur. This study presents a novel Field Programmable Gate Array (FPGA)-accelerated predictive maintenance framework for UAV engines using a Singular [...] Read more.

Predictive maintenance (PdM) is a proactive strategy that enhances safety, minimizes unplanned downtime, and optimizes operational costs by forecasting equipment failures before they occur. This study presents a novel Field Programmable Gate Array (FPGA)-accelerated predictive maintenance framework for UAV engines using a Singular Value Decomposition (SVD)-optimized Long Short-Term Memory (LSTM) model. The model performs binary classification to predict the likelihood of imminent engine failure by processing normalized multi-sensor data, including temperature, pressure, and vibration measurements. To enable real-time deployment on resource-constrained UAV platforms, the LSTM’s weight matrices are compressed using Singular Value Decomposition (SVD), significantly reducing computational complexity while preserving predictive accuracy. The compressed model is executed on a Xilinx ZCU-104 FPGA and uses a pipelined, AXI-based hardware accelerator with efficient memory mapping and parallelized gate calculations tailored for low-power onboard systems. Unlike prior works, this study uniquely integrates a tailored SVD compression strategy with a custom hardware accelerator co-designed for real-time, flight-safe inference in UAV systems. Experimental results demonstrate a 98% classification accuracy, a 24% reduction in latency, and substantial FPGA resource savings—specifically, a 26% decrease in BRAM usage and a 37% reduction in DSP consumption—compared to the 32-bit floating-point SVD-compressed FPGA implementation, not CPU or GPU. These findings confirm the proposed system as an efficient and scalable solution for real-time UAV engine health monitoring, thereby enhancing in-flight safety through timely fault prediction and enabling autonomous engine monitoring without reliance on ground communication. Full article

(This article belongs to the Special Issue Advances in Perception, Communications, and Control for Drones)

► Show Figures

Figure 1

26 pages, 1929 KiB

Open AccessArticle

PASS: A Flexible Programmable Framework for Building Integrated Security Stack in Public Cloud

by Wenwen Fu, Jinli Yan, Jian Zhang, Yinhan Sun, Yong Wang, Ziwen Zhang, Qianming Yang and Yongwen Wang

Electronics 2025, 14(13), 2650; https://doi.org/10.3390/electronics14132650 - 30 Jun 2025

Viewed by 234

Abstract

Integrated security stacks, which offer diverse security function chains in a single device, hold substantial potential to satisfy the security requirements of multiple tenants on a public cloud. However, it is difficult for the software-only or hardware-customized security stack to establish a good [...] Read more.

Integrated security stacks, which offer diverse security function chains in a single device, hold substantial potential to satisfy the security requirements of multiple tenants on a public cloud. However, it is difficult for the software-only or hardware-customized security stack to establish a good tradeoff between performance and flexibility. SmartNIC overcomes these limitations by providing a programmable platform for implementing these functions with hardware acceleration. Significantly, without a professional CPU/SmartNIC co-design, developing security function chains from scratch with low-level APIs is challenging and tedious for network operators. This paper presents PASS, a flexible programmable framework for the fast development of high-performance security stacks with SmartNIC acceleration. In the data plane, PASS provides modular abstractions to extract the shared security logic and eliminate redundant operations by reusing the intermediate results with the customized metadata. In the control plane, PASS offloads the tedious security policy conversion to the proposed security auxiliary plane. With well-defined APIs, developers only need to focus on the core logic instead of labor-intensive shared logic. We built a PASS prototype based on a CPU-FPGA platform and developed three typical security components. Compared to implementation from scratch, PASS reduces the code by 65% on average. Additionally, PASS improves security processing performance by 76% compared to software-only implementations and optimizes the latency of policy translation and distribution by 90% versus the architecture without offloading. Full article

► Show Figures

Figure 1

22 pages, 4426 KiB

Open AccessArticle

High-Radix Taylor-Optimized Tone Mapping Processor for Adaptive 4K HDR Video at 30 FPS

by Xianglong Wang, Zhiyong Lai, Lei Chen and Fengwei An

Sensors 2025, 25(13), 3887; https://doi.org/10.3390/s25133887 - 22 Jun 2025

Viewed by 282

Abstract

High Dynamic Range (HDR) imaging is capable of capturing vivid and lifelike visual effects, which are crucial for fields such as computer vision, photography, and medical imaging. However, real-time processing of HDR content remains challenging due to the computational complexity of tone mapping [...] Read more.

High Dynamic Range (HDR) imaging is capable of capturing vivid and lifelike visual effects, which are crucial for fields such as computer vision, photography, and medical imaging. However, real-time processing of HDR content remains challenging due to the computational complexity of tone mapping algorithms and the inherent limitations of Low Dynamic Range (LDR) capture systems. This paper presents an adaptive HDR tone mapping processor that achieves high computational efficiency and robust image quality under varying exposure conditions. By integrating an exposure-adaptive factor into a bilateral filtering framework, we dynamically optimize parameters to achieve consistent performance across fluctuating illumination conditions. Further, we introduce a high-radix Taylor expansion technique to accelerate floating-point logarithmic and exponential operations, significantly reducing resource overhead while maintaining precision. The proposed architecture, implemented on a Xilinx XCVU9P FPGA, operates at 250 MHz and processes 4K video at 30 frames per second (FPS), outperforming state-of-the-art designs in both throughput and hardware efficiency. Experimental results demonstrate superior image fidelity with an average Tone Mapping Quality Index (TMQI): 0.9314 and 43% fewer logic resources compared to existing solutions, enabling real-time HDR processing for high-resolution applications. Full article

(This article belongs to the Special Issue Image/Video Coding and Processing Techniques for Intelligent Sensor Nodes: 2nd Edition)

► Show Figures

Figure 1

14 pages, 949 KiB

Open AccessArticle

A New Approach to ORB Acceleration Using a Modern Low-Power Microcontroller

by Jorge Aráez, Santiago Real and Alvaro Araujo

Sensors 2025, 25(12), 3796; https://doi.org/10.3390/s25123796 - 18 Jun 2025

Viewed by 294

Abstract

A key component in visual Simultaneous Location And Mapping (SLAM) systems is feature extraction and description. One common algorithm that accomplishes this purpose is Oriented FAST and Rotated BRIEF (ORB), which is used in state-of-the-art SLAM systems like ORB-SLAM. While it is faster [...] Read more.

A key component in visual Simultaneous Location And Mapping (SLAM) systems is feature extraction and description. One common algorithm that accomplishes this purpose is Oriented FAST and Rotated BRIEF (ORB), which is used in state-of-the-art SLAM systems like ORB-SLAM. While it is faster than other feature detectors like SIFT (340 times faster) or SURF (15 times faster), it is one of the most computationally expensive algorithms in these types of systems. This problem has commonly been solved by delegating this task to hardware-accelerated solutions like FPGAs or ASICs. While this solution is useful, it incurs a greater economical cost. This work proposes a solution for feature extraction and description based on a modern low-power mainstream microcontroller. The execution time of ORB, along with power consumption, are analyzed in relation to the number of feature points and internal variables. The results show a maximum of 0.6 s for ORB execution in 1241 × 376 resolution images, which is significantly slower than other hardware-accelerated solutions but remains viable for certain applications. Additionally, the power consumption ranges between 30 and 40 milliwatts, which is lower than FPGA solutions. This work also allows for future optimizations that will improve the results of this paper. Full article

(This article belongs to the Special Issue Sensors and Sensory Algorithms for Intelligent Transportation Systems)

► Show Figures

Figure 1

20 pages, 999 KiB

Open AccessArticle

Efficient Real-Time Isotope Identification on SoC FPGA

by Katherine Guerrero-Morejón, José María Hinojo-Montero, Jorge Jiménez-Sánchez, Cristian Rocha-Jácome, Ramón González-Carvajal and Fernando Muñoz-Chavero

Sensors 2025, 25(12), 3758; https://doi.org/10.3390/s25123758 - 16 Jun 2025

Viewed by 265

Abstract

Efficient real-time isotope identification is a critical challenge in nuclear spectroscopy, with important applications such as radiation monitoring, nuclear waste management, and medical imaging. This work presents a novel approach for isotope classification using a System-on-Chip FPGA, integrating hardware-accelerated principal component analysis (PCA) [...] Read more.

Efficient real-time isotope identification is a critical challenge in nuclear spectroscopy, with important applications such as radiation monitoring, nuclear waste management, and medical imaging. This work presents a novel approach for isotope classification using a System-on-Chip FPGA, integrating hardware-accelerated principal component analysis (PCA) for feature extraction and a software-based random forest classifier. The system leverages the FPGA’s parallel processing capabilities to implement PCA, reducing the dimensionality of digitized nuclear signals and optimizing computational efficiency. A key feature of the design is its ability to perform real-time classification without storing ADC samples, directly processing nuclear pulse data as it is acquired. The extracted features are classified by a random forest model running on the embedded microprocessor. PCA quantization is applied to minimize power consumption and resource usage without compromising accuracy. The experimental validation was conducted using datasets from high-resolution pulse-shape digitization, including closely matched isotope pairs (¹²C/¹³C, ³⁶Ar/⁴⁰Ar, and ⁸⁰Kr/⁸⁴Kr). The results demonstrate that the proposed SoC FPGA system significantly outperforms conventional software-only implementations, reducing latency while maintaining classification accuracy above 98%. This study provides a scalable, precise, and energy-efficient solution for real-time isotope identification. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

19 pages, 2339 KiB

Open AccessArticle

Parallel Processing of Sobel Edge Detection on FPGA: Enhancing Real-Time Image Analysis

by Sanmugasundaram Ravichandran, Hui-Kai Su, Wen-Kai Kuo, Dileepan Dhanasekaran, Manikandan Mahalingam and Jui-Pin Yang

Sensors 2025, 25(12), 3649; https://doi.org/10.3390/s25123649 - 11 Jun 2025

Viewed by 527

Abstract

Detection of object boundaries and significant features within an image is one of the most important processes in image processing and computer vision, as it allows the identification of object boundaries and significant features within an image. In applications such as autonomous vehicles, [...] Read more.

Detection of object boundaries and significant features within an image is one of the most important processes in image processing and computer vision, as it allows the identification of object boundaries and significant features within an image. In applications such as autonomous vehicles, surveillance systems, and medical imaging, real-time processing has become increasingly important, which requires hardware accelerators. In this paper, the improved Sobel edge detection algorithm was implemented using Verilog as an FPGA-based algorithm designed to perform real-time image processing under the Sobel edge detection algorithm for specially RGB images. The proposed design proposes an application of horizontal and vertical Sobel kernels in parallel in order to compute the gradient magnitudes for 1028 × 720 RGB images by taking the gradient magnitudes of 3 × 3 pixel windows. This work focuses on algorithmic complex reduction by using eight directional approaches, and parallel processing leads to reducing the architectural utilization. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

18 pages, 3282 KiB

Open AccessArticle

Hardware Accelerator for Approximation-Based Softmax and Layer Normalization in Transformers

by Raehyeong Kim, Dayoung Lee, Jinyeol Kim, Joungmin Park and Seung Eun Lee

Electronics 2025, 14(12), 2337; https://doi.org/10.3390/electronics14122337 - 7 Jun 2025

Viewed by 875

Abstract

Transformer-based models have achieved remarkable success across various AI tasks, but their growing complexity has led to significant computational and memory demands. While most optimization efforts have focused on linear operations such as matrix multiplications, non-linear functions like Softmax and layer normalization (LayerNorm) [...] Read more.

Transformer-based models have achieved remarkable success across various AI tasks, but their growing complexity has led to significant computational and memory demands. While most optimization efforts have focused on linear operations such as matrix multiplications, non-linear functions like Softmax and layer normalization (LayerNorm) are increasingly dominating inference latency, especially for long sequences and high-dimensional inputs. To address this emerging bottleneck, we present a hardware accelerator that jointly approximates these non-linear functions using piecewise linear approximation for the exponential in Softmax and Newton–Raphson iteration for the square root in LayerNorm. The proposed unified architecture dynamically switches operation modes while reusing hardware resources. The proposed accelerator was implemented on a Xilinx VU37P FPGA and evaluated with BERT and GPT-2 models. Experimental results demonstrate speedups of up to 7.6× for Softmax and 2.0× for LayerNorm, while maintaining less than 1% accuracy degradation on classification tasks with conservative approximation settings. However, generation tasks showed greater sensitivity to approximation, underscoring the need for task-specific tuning. Full article

(This article belongs to the Special Issue Feature Papers in "Computer Science & Engineering", 2nd Edition)

► Show Figures

Figure 1

22 pages, 20735 KiB

Open AccessArticle

High-Throughput ORB Feature Extraction on Zynq SoC for Real-Time Structure-from-Motion Pipelines

by Panteleimon Stamatakis and John Vourvoulakis

J. Imaging 2025, 11(6), 178; https://doi.org/10.3390/jimaging11060178 - 28 May 2025

Viewed by 556

Abstract

This paper presents a real-time system for feature detection and description, the first stage in a structure-from-motion (SfM) pipeline. The proposed system leverages an optimized version of the ORB algorithm (oriented FAST and rotated BRIEF) implemented on the Digilent Zybo Z7020 FPGA board [...] Read more.

This paper presents a real-time system for feature detection and description, the first stage in a structure-from-motion (SfM) pipeline. The proposed system leverages an optimized version of the ORB algorithm (oriented FAST and rotated BRIEF) implemented on the Digilent Zybo Z7020 FPGA board equipped with the Xilinx Zynq-7000 SoC. The system accepts real-time video input (60 fps, 1920 × 1080 resolution, 24-bit color) via HDMI or a camera module. In order to support high frame rates for full-HD images, a double-data-rate pipeline scheme was adopted for Harris functions. Gray-scale video with features identified in red is exported through a separate HDMI port. Feature descriptors are calculated inside the FPGA by Zynq’s programmable logic and verified using Xilinx’s ILA IP block on a connected computer running Vivado. The implemented system achieves a latency of 192.7 microseconds, which is suitable for real-time applications. The proposed architecture is evaluated in terms of repeatability, matching retention and matching accuracy in several image transformations. It meets satisfactory accuracy and performance considering that there are slight changes between successive frames. This work paves the way for future research on the implementation of the remaining stages of a real-time SfM pipeline on the proposed hardware platform. Full article

(This article belongs to the Special Issue Recent Techniques in Image Feature Extraction)

► Show Figures

Figure 1

17 pages, 3044 KiB

Open AccessArticle

High-Speed SMVs Subscriber Design for FPGA Architectures

by Mihai-Alexandru Pisla, Bogdan-Adrian Enache, Vasilis Argyriou, Panagiotis Sarigiannidis, Teodor-Iulian Voicila and George-Calin Seritan

Electronics 2025, 14(11), 2135; https://doi.org/10.3390/electronics14112135 - 24 May 2025

Viewed by 303

Abstract

Modern power systems, particularly those integrating smart grid and microgrid functionalities, demand efficient high-speed data processing to manage increasingly complex operational requirements. In response to these challenges, this paper proposes a high-speed Sampled Measured Values (SMVs) subscriber design that leverages the programmability of [...] Read more.

Modern power systems, particularly those integrating smart grid and microgrid functionalities, demand efficient high-speed data processing to manage increasingly complex operational requirements. In response to these challenges, this paper proposes a high-speed Sampled Measured Values (SMVs) subscriber design that leverages the programmability of Multi-Processor System-on-Chip (MPSoC) technology and the parallel processing capabilities of Field-Programmable Gate Arrays (FPGAs). By offloading SMVs data decoding to dedicated FPGA hardware, the approach significantly reduces processing latency and delivers deterministic performance, thereby surpassing traditional software-based implementations. This hardware acceleration is achieved without sacrificing flexibility, ensuring compatibility with emerging standards in IEC 61850 and offering scalability for expanding substation and grid communication networks. Experimental validations demonstrate lower end-to-end delays and improved throughput, highlighting the potential of the proposed system to meet stringent real-time requirements for monitoring and control in evolving smart grids. Full article

(This article belongs to the Section Circuit and Signal Processing)

► Show Figures

Figure 1

12 pages, 870 KiB

Open AccessArticle

An Improved Strategy for Data Layout in Convolution Operations on FPGA-Based Multi-Memory Accelerators

by Yongchang Wang and Hongzhi Zhao

Electronics 2025, 14(11), 2127; https://doi.org/10.3390/electronics14112127 - 23 May 2025

Viewed by 382

Abstract

Convolutional Neural Networks (CNNs) are fundamental to modern AI applications but often suffer from significant memory bottlenecks due to non-contiguous access patterns during convolution operations. Although previous work has optimized data layouts at the software level, hardware-level solutions for multi-memory accelerators remain underexplored. [...] Read more.

Convolutional Neural Networks (CNNs) are fundamental to modern AI applications but often suffer from significant memory bottlenecks due to non-contiguous access patterns during convolution operations. Although previous work has optimized data layouts at the software level, hardware-level solutions for multi-memory accelerators remain underexplored. In this paper, we propose a hardware-level approach to mitigate memory row conflicts in FPGA-based CNN accelerators. Specifically, we introduce a dynamic DDR controller generated using Vivado 2019.1, which optimizes feature map allocation across memory banks and operates in conjunction with a multi-memory architecture to enable parallel access. Our method reduces row conflicts by up to 21% and improves throughput by 17% on the KCU1500 FPGA, with validation across YOLOv2, VGG16, and AlexNet. The key innovation lies in the layer-specific address mapping strategy and hardware-software co-design, providing a scalable and efficient solution for CNN inference across both edge and cloud platforms. Full article

(This article belongs to the Special Issue FPGA-Based Reconfigurable Embedded Systems)

► Show Figures

Figure 1

16 pages, 1263 KiB

Open AccessArticle

Accelerating CRYSTALS-Kyber: High-Speed NTT Design with Optimized Pipelining and Modular Reduction

by Omar S. Sonbul, Muhammad Rashid and Amar Y. Jaffar

Electronics 2025, 14(11), 2122; https://doi.org/10.3390/electronics14112122 - 23 May 2025

Viewed by 641

Abstract

The Number Theoretic Transform (NTT) is a cornerstone for efficient polynomial multiplication, which is fundamental to lattice-based cryptographic algorithms such as CRYSTALS-Kyber—a leading candidate in post-quantum cryptography (PQC). However, existing NTT accelerators often rely on integer multiplier-based modular reduction techniques, such as Barrett [...] Read more.

The Number Theoretic Transform (NTT) is a cornerstone for efficient polynomial multiplication, which is fundamental to lattice-based cryptographic algorithms such as CRYSTALS-Kyber—a leading candidate in post-quantum cryptography (PQC). However, existing NTT accelerators often rely on integer multiplier-based modular reduction techniques, such as Barrett or Montgomery reduction, which introduce significant computational overhead and hardware resource consumption. These accelerators also lack optimization in unified architectures for forward (FNTT) and inverse (INTT) transformations. Addressing these research gaps, this paper introduces a novel, high-speed NTT accelerator tailored specifically for CRYSTALS-Kyber. The proposed design employs an innovative shift-add modular reduction mechanism, eliminating the need for integer multipliers, thereby reducing critical path delay and enhancing circuit frequency. A unified pipelined butterfly unit, capable of performing FNTT and INTT operations through Cooley–Tukey and Gentleman–Sande configurations, is integrated into the architecture. Additionally, a highly efficient data handling mechanism based on Register banks supports seamless memory access, ensuring continuous and parallel processing. The complete architecture, implemented in Verilog HDL, has been evaluated on FPGA platforms (Virtex-5, Virtex-6, and Virtex-7). Post place-and-route results demonstrate a maximum operating frequency of 261 MHz on Virtex-7, achieving a throughput of 290.69 Kbps—1.45× and 1.24× higher than its performance on Virtex-5 and Virtex-6, respectively. Furthermore, the design boasts an impressive throughput-per-slice metric of 111.63, underscoring its resource efficiency. With a 1.27× reduction in computation time compared to state-of-the-art single butterfly unit-based NTT accelerators, this work establishes a new benchmark in advancing secure and scalable cryptographic hardware solutions. Full article

► Show Figures

Figure 1

35 pages, 2630 KiB

Open AccessArticle

AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow

by David Berrazueta-Mena and Byron Navas

Computers 2025, 14(5), 189; https://doi.org/10.3390/computers14050189 - 13 May 2025

Viewed by 800

Abstract

The increasing complexity of algorithms in embedded applications has amplified the demand for high-performance computing. Heterogeneous embedded systems, particularly FPGA-based systems-on-chip (SoCs), enhance execution speed by integrating hardware accelerator intellectual property (IP) cores. However, traditional low-level IP-core design presents significant challenges. High-level synthesis [...] Read more.

The increasing complexity of algorithms in embedded applications has amplified the demand for high-performance computing. Heterogeneous embedded systems, particularly FPGA-based systems-on-chip (SoCs), enhance execution speed by integrating hardware accelerator intellectual property (IP) cores. However, traditional low-level IP-core design presents significant challenges. High-level synthesis (HLS) offers a promising alternative, enabling efficient FPGA development through high-level programming languages. Yet, effective methodologies for designing and evaluating heterogeneous FPGA-based SoCs remain crucial. This study surveys HLS tools and design concepts and presents the development of the AHA IP cores, a set of five benchmarking accelerators for rapid Zynq-based SoC evaluation. These accelerators target compute-intensive tasks, including matrix multiplication, Fast Fourier Transform (FFT), Advanced Encryption Standard (AES), Back-Propagation Neural Network (BPNN), and Artificial Neural Network (ANN). We establish a streamlined design flow using AMD-Xilinx tools for rapid prototyping and testing FPGA-based heterogeneous platforms. We outline criteria for selecting algorithms to improve speed and resource efficiency in HLS design. Our performance evaluation across various configurations highlights performance–resource trade-offs and demonstrates that ANN and BPNN achieve significant parallelism, while AES optimization increases resource utilization the most. Matrix multiplication shows strong optimization potential, whereas FFT is constrained by data dependencies. Full article

► Show Figures

Figure 1

26 pages, 442 KiB

Open AccessArticle

Improving the Fast Fourier Transform for Space and Edge Computing Applications with an Efficient In-Place Method

by Christoforos Vasilakis, Alexandros Tsagkaropoulos, Ioannis Koutoulas and Dionysios Reisis

Software 2025, 4(2), 11; https://doi.org/10.3390/software4020011 - 12 May 2025

Viewed by 1231

Abstract

Satellite and edge computing designers develop algorithms that restrict resource utilization and execution time. Among these design efforts, optimizing Fast Fourier Transform (FFT), key to many tasks, has led mainly to in-place FFT-specific hardware accelerators. Aiming at improving the FFT performance on processors [...] Read more.

Satellite and edge computing designers develop algorithms that restrict resource utilization and execution time. Among these design efforts, optimizing Fast Fourier Transform (FFT), key to many tasks, has led mainly to in-place FFT-specific hardware accelerators. Aiming at improving the FFT performance on processors and computing devices with limited resources, the current paper enhances the efficiency of the radix-2 FFT by exploring the benefits of an in-place technique. First, we present the advantages of organizing the single memory bank of processors to store two (2) FFT elements in each memory address and provide parallel load and store of each FFT pair of data. Second, we optimize the floating point (FP) and block floating point (BFP) configurations to improve the FFT Signal-to-Noise (SNR) performance and the resource utilization. The resulting techniques reduce the memory requirements by two and significantly improve the time performance for the overall prevailing BFP representation. The execution of inputs ranging from 1K to 16K FFT points, using 8-bit or 16-bit as FP or BFP numbers, on the space-proven Atmel AVR32 and Vision Processing Unit (VPU) Intel Movidius Myriad 2, the edge device Raspberry Pi Zero 2W and a low-cost accelerator on Xilinx Zynq 7000 Field Programmable Gate Array (FPGA), validates the method’s performance improvement. Full article

► Show Figures

Figure 1

23 pages, 3101 KiB

Open AccessArticle

A Sea-Surface Radar Target-Detection Method Based on an Improved U-Net and Its FPGA Implementation

by Gangyi Zhai, Jianjiang Zhou, Haocheng Yang and Yutao Zhang

Electronics 2025, 14(10), 1944; https://doi.org/10.3390/electronics14101944 - 10 May 2025

Viewed by 377

Abstract

Existing radar target-detection methods exhibit suboptimal performance when they are applied to sea-surface target detection. This is due to the difficulties in detecting weak targets and the interference from sea clutter, as well as to the inability of statistical models to accurately model [...] Read more.

Existing radar target-detection methods exhibit suboptimal performance when they are applied to sea-surface target detection. This is due to the difficulties in detecting weak targets and the interference from sea clutter, as well as to the inability of statistical models to accurately model sea-surface targets, which leads to degraded detection performance. With the development of artificial intelligence technologies, research based on deep learning methods has gained momentum in the field of radar target detection. Considering the complexity of neural networks and the real-time requirements of radar target-detection algorithms, this paper investigates a sea-surface radar target-detection method based on an improved U-Net network and its FPGA implementation, achieving real-time radar target detection without relying on GPUs. This paper first selected the lightweight U-Net network through a survey and analysis. The original U-Net network was then structurally optimized using network volume-reduction methods. Based on the characteristics of the network structure, optimization strategies such as pipelining and parallel processing, hybrid-layer design, and convolution-layer optimization were applied to the accelerator system. These optimizations reduced the system’s hardware-resource requirements and enabled the complete deployment of the network onto the accelerator system. The accelerator system was implemented using high-level synthesis (HLS) with modular and template-based design approaches. Experiments showed that the proposed method has significant advantages in improving detection probability, reducing false-alarm rates, and achieving real-time processing. Full article

► Show Figures

Figure 1

19 pages, 494 KiB

Open AccessArticle

Hardware-Accelerated Data Readout Platform Using Heterogeneous Computing for DNA Data Storage

by Xiaopeng Gou, Qi Ge, Quan Guo, Menghui Ren, Tingting Qi, Rui Qin and Weigang Chen

Appl. Sci. 2025, 15(9), 5050; https://doi.org/10.3390/app15095050 - 1 May 2025

Viewed by 421

Abstract

DNA data storage has emerged as a promising alternative to traditional storage media due to its high density and durability. However, large-scale DNA storage systems generate massive sequencing reads, posing substantial computational complexity and latency challenges for data readout. Here, we propose a [...] Read more.

DNA data storage has emerged as a promising alternative to traditional storage media due to its high density and durability. However, large-scale DNA storage systems generate massive sequencing reads, posing substantial computational complexity and latency challenges for data readout. Here, we propose a novel heterogeneous computing architecture based on a field-programmable gate array (FPGA) to accelerate DNA data readout. The software component, running on a general computing platform, manages data distribution and schedules acceleration kernels. Meanwhile, the hardware acceleration kernel is deployed on an Alveo U200 data center accelerator card, executing multiple logical computing units within modules and utilizing task-level pipeline structures between modules to handle sequencing reads step by step. This heterogeneous computing acceleration system enables the efficient execution of the entire readout process for DNA data storage. We benchmark the proposed system against a CPU-based software implementation under various error rates and coverages. The results indicate that under high-error, low-coverage conditions (error rate of 1.5% and coverage of 15×), the accelerator achieves a peak speedup of up to 373.1 times, enabling the readout of 59.4 MB of stored data in just 12.40 s. Overall, the accelerator delivers a speedup of two orders of magnitude. Our proposed heterogeneous computing acceleration strategy provides an efficient solution for large-scale DNA data readout. Full article

► Show Figures

Figure 1

Search Results (339)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (339)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI