Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (361)

Search Parameters:
Keywords = FPGA-based accelerators

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
37 pages, 483 KB  
Review
Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges
by Hua Yan, Lei Wu, Qiming Sun and Pengzhou He
Electronics 2026, 15(2), 475; https://doi.org/10.3390/electronics15020475 - 22 Jan 2026
Abstract
The imminent threat of large-scale quantum computers to modern public-key cryptographic devices has led to extensive research into post-quantum cryptography (PQC). Lattice-based schemes have proven to be the top candidate among existing PQC schemes due to their strong security guarantees, versatility, and relatively [...] Read more.
The imminent threat of large-scale quantum computers to modern public-key cryptographic devices has led to extensive research into post-quantum cryptography (PQC). Lattice-based schemes have proven to be the top candidate among existing PQC schemes due to their strong security guarantees, versatility, and relatively efficient operations. However, the computational cost of lattice-based algorithms—including various arithmetic operations such as Number Theoretic Transform (NTT), polynomial multiplication, and sampling—poses considerable performance challenges in practice. This survey offers a comprehensive review of hardware acceleration for lattice-based cryptographic schemes—specifically both the architectural and implementation details of the standardized algorithms in the category CRYSTALS-Kyber, CRYSTALS-Dilithium, and FALCON (Fast Fourier Lattice-Based Compact Signatures over NTRU). It examines optimization measures at various levels, such as algorithmic optimization, arithmetic unit design, memory hierarchy management, and system integration. The paper compares the various performance measures (throughput, latency, area, and power) of Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) implementations. We also address major issues related to implementation, side-channel resistance, resource constraints within IoT (Internet of Things) devices, and the trade-offs between performance and security. Finally, we point out new research opportunities and existing challenges, with implications for hardware accelerator design in the post-quantum cryptographic environment. Full article
22 pages, 2066 KB  
Article
A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids
by Eleftherios Mylonas, Chrisanthi Filippou, Sotirios Kontraros, Michael Birbas and Alexios Birbas
Electronics 2026, 15(2), 414; https://doi.org/10.3390/electronics15020414 - 17 Jan 2026
Viewed by 163
Abstract
The ever-increasing need for energy-efficient implementation of AI algorithms has driven the research community towards the development of many hardware architectures and frameworks for AI. A lot of work has been presented around FPGAs, while more sophisticated architectures like CGRAs have also been [...] Read more.
The ever-increasing need for energy-efficient implementation of AI algorithms has driven the research community towards the development of many hardware architectures and frameworks for AI. A lot of work has been presented around FPGAs, while more sophisticated architectures like CGRAs have also been at the center. However, AI ecosystems are isolated and fragmented, with no standardized way to compare different frameworks with detailed Power–Performance–Area (PPA) analysis. This paper bridges the gap by presenting a unified, fully open-source hardware-aware AI acceleration pipeline that enables seamless deployment of neural networks on both FPGA and CGRA architectures. Built around the Brevitas quantization framework, it supports two distinct backend flows: FINN for high-performance dataflow accelerators and CGRA4ML for low-power coarse-grained reconfigurable designs. To facilitate this, a model translation layer from QONNX to QKeras is also introduced. To demonstrate its effectiveness, we use an autoencoder model for anomaly detection in wind turbines. We deploy our accelerated models on the AMD’s ZCU104 and benchmark it against a Raspberry Pi. Evaluation on a realistic cyber–physical testbed shows that the hardware-accelerated solutions achieve substantial performance and energy-efficiency gains—up to 10× and 37× faster inference per flow and over 11× higher efficiency—while maintaining acceptable reconstruction accuracy. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

19 pages, 1607 KB  
Article
Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA
by Rodrigo Lopes da Silva, Gustavo Jacinto, Mário Véstias and Rui Policarpo Duarte
Electronics 2026, 15(2), 354; https://doi.org/10.3390/electronics15020354 - 13 Jan 2026
Viewed by 262
Abstract
Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio [...] Read more.
Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio classification. However, their computational complexity poses significant challenges for deployment on low-power embedded platforms. This paper presents a low-power embedded system for real-time bird audio detection. A hybrid CNN–RNN architecture is adopted, redesigned, and quantized to significantly reduce model complexity while preserving classification accuracy. To support efficient execution, a custom hardware accelerator was developed and integrated into a Zynq UltraScale+ ZU3CG FPGA. The proposed system achieves an accuracy of 87.4%, processes up to 5 audio samples per second, and operates at only 1.4 W, demonstrating its suitability for autonomous, energy-efficient wildlife monitoring applications. Full article
Show Figures

Figure 1

32 pages, 8110 KB  
Article
A Secure and Efficient Sharing Framework for Student Electronic Academic Records: Integrating Zero-Knowledge Proof and Proxy Re-Encryption
by Xin Li, Minsheng Tan and Wenlong Tian
Future Internet 2026, 18(1), 47; https://doi.org/10.3390/fi18010047 - 12 Jan 2026
Viewed by 152
Abstract
A sharing framework based on Zero-Knowledge Proof (ZKP) and Proxy Re-encryption (PRE) technologies offers a promising solution for sharing Student Electronic Academic Records (SEARs). As core credentials in the education sector, student records are characterized by strong identity binding, the need for long-term [...] Read more.
A sharing framework based on Zero-Knowledge Proof (ZKP) and Proxy Re-encryption (PRE) technologies offers a promising solution for sharing Student Electronic Academic Records (SEARs). As core credentials in the education sector, student records are characterized by strong identity binding, the need for long-term retention, frequent cross-institutional verification, and sensitive information. Compared with electronic health records and government archives, they face more complex security, privacy protection, and storage scalability challenges during sharing. These records not only contain sensitive data such as personal identity and academic performance but also serve as crucial evidence in key scenarios such as further education, employment, and professional title evaluation. Leakage or tampering could have irreversible impacts on a student’s career development. Furthermore, traditional blockchain technology faces storage capacity limitations when storing massive academic records, and existing general electronic record sharing solutions struggle to meet the high-frequency verification demands of educational authorities, universities, and employers for academic data. This study proposes a dedicated sharing framework for students’ electronic academic records, leveraging PRE technology and the distributed ledger characteristics of blockchain to ensure transparency and immutability during sharing. By integrating the InterPlanetary File System (IPFS) with Ethereum Smart Contract (SC), it addresses blockchain storage bottlenecks, enabling secure storage and efficient sharing of academic records. Relying on optimized ZKP technology, it supports verifying the authenticity and integrity of records without revealing sensitive content. Furthermore, the introduction of gate circuit merging, constant folding techniques, Field-Programmable Gate Array (FPGA) hardware acceleration, and the efficient Bulletproofs algorithm alleviates the high computational complexity of ZKP, significantly reducing proof generation time. The experimental results demonstrate that the framework, while ensuring strong privacy protection, can meet the cross-scenario sharing needs of student records and significantly improve sharing efficiency and security. Therefore, this method exhibits superior security and performance in privacy-preserving scenarios. This framework can be applied to scenarios such as cross-institutional academic certification, employer background checks, and long-term management of academic records by educational authorities, providing secure and efficient technical support for the sharing of electronic academic credentials in the digital education ecosystem. Full article
Show Figures

Graphical abstract

28 pages, 1828 KB  
Article
Edge Detection on a 2D-Mesh NoC with Systolic Arrays: From FPGA Validation to GDSII Proof-of-Concept
by Emma Mascorro-Guardado, Susana Ortega-Cisneros, Francisco Javier Ibarra-Villegas, Jorge Rivera, Héctor Emmanuel Muñoz-Zapata and Emilio Isaac Baungarten-Leon
Appl. Sci. 2026, 16(2), 702; https://doi.org/10.3390/app16020702 - 9 Jan 2026
Viewed by 147
Abstract
Edge detection is a key building block in real-time image-processing applications such as drone-based infrastructure inspection, autonomous navigation, and remote sensing. However, its computational cost remains a challenge for resource-constrained embedded systems. This work presents a hardware-accelerated edge detection architecture based on a [...] Read more.
Edge detection is a key building block in real-time image-processing applications such as drone-based infrastructure inspection, autonomous navigation, and remote sensing. However, its computational cost remains a challenge for resource-constrained embedded systems. This work presents a hardware-accelerated edge detection architecture based on a homogeneous 2D-mesh Network-on-Chip (NoC) integrating systolic arrays to efficiently perform the convolution operations required by the Sobel filter. The proposed architecture was first developed and validated as a 3 × 3 mesh prototype on FPGA (Xilinx Zynq-7000, Zynq-7010, XC7Z010-CLG400A, Zybo board, utilizing 26,112 LUTs, 24,851 flip-flops, and 162 DSP blocks), achieving a throughput of 8.8 Gb/s with a power consumption of 0.79 W at 100 MHz. Building upon this validated prototype, a reduced 2 × 2 node cluster with 14-bit word width was subsequently synthesized at the physical level as a proof-of-concept using the OpenLane RTL-to-GDSII open-source flow targeting the SkyWater 130 nm PDK (sky130A). Post-layout analysis confirms the manufacturability of the design, with a total power consumption of 378 mW and compliance with timing constraints, demonstrating the feasibility of mapping the proposed architecture to silicon and its suitability for drone-based infrastructure monitoring applications. Full article
(This article belongs to the Special Issue Advanced Integrated Circuit Design and Applications)
Show Figures

Figure 1

16 pages, 2077 KB  
Article
Cross Comparison Between Thermal Cycling and High Temperature Stress on I/O Connection Elements
by Mamta Dhyani, Tsuriel Avraham, Joseph B. Bernstein and Emmanuel Bender
Micromachines 2026, 17(1), 88; https://doi.org/10.3390/mi17010088 - 9 Jan 2026
Viewed by 284
Abstract
This work examines resistance drift in FPGA I/O paths subjected to combined electrical and thermal stress, using a Xilinx Spartan-6 device as a representative platform. A multiplexed measurement approach was employed, in which multiple I/O pins were externally shorted and sequentially activated, enabling [...] Read more.
This work examines resistance drift in FPGA I/O paths subjected to combined electrical and thermal stress, using a Xilinx Spartan-6 device as a representative platform. A multiplexed measurement approach was employed, in which multiple I/O pins were externally shorted and sequentially activated, enabling precise tracking of voltage, current, and effective series resistance over time, under controlled bias conditions. Two accelerated stress modes were investigated: high-temperature dwell in the range of 80–120 °C and thermal cycling between 80 and 140 °C. Both stress modes exhibited similar sub-linear (power-law) time dependence on resistance change, indicating cumulative degradation behavior. However, Arrhenius analysis revealed a strong contrast in effective activation energy: approximately 0.62 eV for high-temperature dwell and approximately 1.3 eV for thermal cycling. This divergence indicates that distinct physical mechanisms dominate under each stress regime. The lower activation energy is consistent with electrically and thermally driven on-die degradation within the FPGA I/O macro, including bias-related aging of output drivers and pad-level structures. In contrast, the higher activation energy observed under thermal cycling is characteristic of diffusion- and creep-dominated thermo-mechanical damage in package-level interconnects, such as solder joints. These findings demonstrate that resistance-based monitoring of FPGA I/O paths can discriminate between device-dominated and package-dominated aging mechanisms, providing a practical foundation for reliability assessment and self-monitoring methodologies in complex electronic systems. Full article
(This article belongs to the Special Issue Emerging Packaging and Interconnection Technology, Second Edition)
Show Figures

Figure 1

24 pages, 1630 KB  
Article
Hardware-Oriented Approximations of Softmax and RMSNorm for Efficient Transformer Inference
by Yiwen Kang and Dong Wang
Micromachines 2026, 17(1), 84; https://doi.org/10.3390/mi17010084 - 7 Jan 2026
Viewed by 244
Abstract
With the rapid advancement of Transformer-based large language models (LLMs), these models have found widespread applications in industrial domains such as code generation and non-functional requirement (NFR) classification in software engineering. However, recent research has primarily focused on optimizing linear matrix operations, while [...] Read more.
With the rapid advancement of Transformer-based large language models (LLMs), these models have found widespread applications in industrial domains such as code generation and non-functional requirement (NFR) classification in software engineering. However, recent research has primarily focused on optimizing linear matrix operations, while nonlinear operators remain relatively underexplored. This paper proposes hardware-efficient approximation and acceleration methods for the Softmax and RMSNorm operators to reduce resource cost and accelerate Transformer inference while maintaining model accuracy. For the Softmax operator, an additional range reduction based on the SafeSoftmax technique enables the adoption of a bipartite lookup table (LUT) approximation and acceleration. The bit-width configuration is optimized through Pareto frontier analysis to balance precision and hardware cost, and an error compensation mechanism is further applied to preserve numerical accuracy. The division is reformulated as a logarithmic subtraction implemented with a small LOD-driven lookup table, eliminating expensive dividers. For RMSNorm, LOD is further leveraged to decompose the reciprocal square root into mantissa and exponent parts, enabling parallel table lookup and a single multiplication. Based on these optimizations, an FPGA-based pipelined accelerator is implemented, achieving low operator-level latency and power consumption with significantly reduced hardware resource usage while preserving model accuracy. Full article
(This article belongs to the Special Issue Advances in Field-Programmable Gate Arrays (FPGAs))
Show Figures

Figure 1

17 pages, 9165 KB  
Article
An FPGA-Based Reconfigurable Accelerator for Real-Time Affine Transformation in Industrial Imaging Heterogeneous SoC
by Yang Zhang, Dejun Chen, Huixiong Ruan, Hongyu Jia, Yong Liu and Ying Luo
Sensors 2026, 26(1), 316; https://doi.org/10.3390/s26010316 - 3 Jan 2026
Viewed by 356
Abstract
Real-time affine transformation, a core operation for image correction and registration of industrial cameras and scanners, faces challenges including the high computational cost of interpolation and inefficient data access. In this study, we propose a reconfigurable accelerator architecture based on a heterogeneous system-on-chip [...] Read more.
Real-time affine transformation, a core operation for image correction and registration of industrial cameras and scanners, faces challenges including the high computational cost of interpolation and inefficient data access. In this study, we propose a reconfigurable accelerator architecture based on a heterogeneous system-on-chip (SoC). The architecture decouples tasks into control and data paths: the ARM core in the processing system (PS) handles parameter matrix generation and scheduling, whereas the FPGA-based acceleration module in programmable logic (PL) implements the proposed PATRM algorithm. By integrating multiplication-free design and affine matrix properties, PATRM adopts Q15.16 fixed-point computation and AXI4 burst transmission for efficient block data prefetching and pipelined processing. Experimental results demonstrate 25 frames per second (FPS) for 2095×2448 resolution images, representing a 128.21 M pixel/s throughput, which is 5.3× faster than the Block AT baseline with a peak signal-to-noise ratio (PSNR) exceeding 26 dB. Featuring low resource consumption and dynamic reconfigurability, the accelerator meets the real-time requirements of industrial scanner correction and other high-performance image processing tasks. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

25 pages, 7245 KB  
Article
A Hardware-Friendly Joint Denoising and Demosaicing System Based on Efficient FPGA Implementation
by Jiqing Wang, Xiang Wang and Yu Shen
Micromachines 2026, 17(1), 44; https://doi.org/10.3390/mi17010044 - 29 Dec 2025
Viewed by 316
Abstract
This paper designs a hardware-implementable joint denoising and demosaicing acceleration system. Firstly, a lightweight network architecture with multi-scale feature extraction based on partial convolution is proposed at the algorithm level. The partial convolution scheme can reduce the redundancy of filters and feature maps, [...] Read more.
This paper designs a hardware-implementable joint denoising and demosaicing acceleration system. Firstly, a lightweight network architecture with multi-scale feature extraction based on partial convolution is proposed at the algorithm level. The partial convolution scheme can reduce the redundancy of filters and feature maps, thereby reducing memory accesses, and achieve excellent visual effects with a smaller model complexity. In addition, multi-scale extraction can expand the receptive field while reducing model parameters. Then, we apply separable convolution and partial convolution to reduce the parameters of the model. Compared with the standard convolutional solution, the parameters and MACs are reduced by 83.38% and 77.71%, respectively. Moreover, different networks bring different memory access and complex computing methods; thus, we introduce a unified and flexibly configurable hardware acceleration processing platform and implement it on the Xilinx Zynq UltraScale + FPGA board. Finally, compared with the state-of-the-art neural network solution on the Kodak24 set, the peak signal-to-noise ratio and the structural similarity index measure are approximately improved by 2.36dB and 0.0806, respectively, and the computing efficiency is improved by 2.09×. Furthermore, the hardware architecture supports multi-parallelism and can adapt to the different edge-embedded scenarios. Overall, the image processing task solution proposed in this paper has positive advantages in the joint denoising and demosaicing system. Full article
(This article belongs to the Special Issue Advances in Field-Programmable Gate Arrays (FPGAs))
Show Figures

Figure 1

13 pages, 1258 KB  
Article
A Binary Convolution Accelerator Based on Compute-in-Memory
by Wenpeng Cui, Zhe Zheng, Pan Li, Ming Li, Yu Liu and Yingying Chi
Electronics 2026, 15(1), 117; https://doi.org/10.3390/electronics15010117 - 25 Dec 2025
Viewed by 322
Abstract
As AI workloads move to edge devices, the von Neumann architecture is hindered by memory- and power-wall limitations We present an SRAM-based compute-in-memory binary convolution accelerator that stores and transports only 1-bit weights and activations, maps MACs to bitwise XNOR–popcount, and fuses BatchNorm, [...] Read more.
As AI workloads move to edge devices, the von Neumann architecture is hindered by memory- and power-wall limitations We present an SRAM-based compute-in-memory binary convolution accelerator that stores and transports only 1-bit weights and activations, maps MACs to bitwise XNOR–popcount, and fuses BatchNorm, HardTanh, and binarization into a single affine-and-threshold uni. Residual paths are handled by in-accumulator summation to minimize data movement. FPGA validation shows 87.6% CIFAR 10 accuracy consistent with a bit-accurate software reference, a compute-only latency of 2.93 ms per 32 × 32 image at 50 MHz, sustained at only 1.52 W. These results demonstrate an efficient and practical path to deploying edge models under tight power and memory budgets. Full article
Show Figures

Figure 1

30 pages, 1176 KB  
Article
Towards Secure and Adaptive AI Hardware: A Framework for Optimizing LLM-Oriented Architectures
by Sabya Shtaiwi and Dheya Mustafa
Computers 2026, 15(1), 10; https://doi.org/10.3390/computers15010010 - 25 Dec 2025
Viewed by 737
Abstract
With the increasing computational demands of large language models (LLMs), there is a pressing need for more specialized hardware architectures capable of supporting their dynamic and memory-intensive workloads. This paper examines recent studies on hardware acceleration for AI, focusing on three critical aspects: [...] Read more.
With the increasing computational demands of large language models (LLMs), there is a pressing need for more specialized hardware architectures capable of supporting their dynamic and memory-intensive workloads. This paper examines recent studies on hardware acceleration for AI, focusing on three critical aspects: energy efficiency, architectural adaptability, and runtime security. While notable advancements have been made in accelerating convolutional and deep neural networks using ASICs, FPGAs, and compute-in-memory (CIM) approaches, most existing solutions remain inadequate for the scalability and security requirements of LLMs. Our comparative analysis highlights two key limitations: restricted reconfigurability and insufficient support for real-time threat detection. To address these gaps, we propose a novel architectural framework grounded in modular adaptivity, memory-centric processing, and security-by-design principles. The paper concludes with a proposed evaluation roadmap and outlines promising future research directions, including RISC-V-based secure accelerators, neuromorphic co-processors, and hybrid quantum-AI integration. Full article
Show Figures

Graphical abstract

14 pages, 2142 KB  
Article
Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V
by Duc-Thuan Dam, Khai-Duy Nguyen, Duc-Hung Le and Cong-Kha Pham
Electronics 2026, 15(1), 100; https://doi.org/10.3390/electronics15010100 - 24 Dec 2025
Viewed by 409
Abstract
Post-quantum cryptography (PQC) is rapidly being standardized, with key primitives such as Key Encapsulation Mechanisms (KEMs) and Digital Signature Algorithms (DSAs) moving into practical applications. While initial research focused on pure software and hardware implementations, the focus is shifting toward flexible, high-efficiency solutions [...] Read more.
Post-quantum cryptography (PQC) is rapidly being standardized, with key primitives such as Key Encapsulation Mechanisms (KEMs) and Digital Signature Algorithms (DSAs) moving into practical applications. While initial research focused on pure software and hardware implementations, the focus is shifting toward flexible, high-efficiency solutions suitable for widespread deployment. A system-on-chip is a viable option with the ability to coordinate between hardware and software flexibly. However, the main drawback of this system is the latency in exchanging data during computation. Currently, most SoCs are implemented on FPGAs, and there is a lack of SoCs realized on ASICs. This paper introduces a complete RISC-V SoC design in an ASIC for Module Lattice-based KEM. Our system features a RISC-V processor tightly integrated with a high-efficiency Number Theoretic Transform (NTT) accelerator. This accelerator leverages custom instructions to accelerate cryptographic operations. Our research has achieved the following results: (1) The accelerator provides a speedup of up to 14.51× for NTT and 16.75× for inverse NTT operations compared to other RISC-V platforms; (2) This leads to end-to-end performance improvements for ML-KEM of up to 56.5% for security level I, 50.9% for level III, and 45.4% for level V; (3) The ASIC design is fabricated using a 180 nm CMOS process at a maximum operating frequency of 118 MHz with an area overhead of 8.7%. The chip achieved a minimum power consumption of 5.913 μW at 10 kHz and 0.9 V of supply voltage. Full article
(This article belongs to the Special Issue Recent Advances in Quantum Information)
Show Figures

Figure 1

23 pages, 2239 KB  
Article
SparseDroop: Hardware–Software Co-Design for Mitigating Voltage Droop in DNN Accelerators
by Arnab Raha, Shamik Kundu, Arghadip Das, Soumendu Kumar Ghosh and Deepak A. Mathaikutty
J. Low Power Electron. Appl. 2026, 16(1), 2; https://doi.org/10.3390/jlpea16010002 - 23 Dec 2025
Viewed by 383
Abstract
Modern deep neural network (DNN) accelerators must sustain high throughput while avoiding performance degradation from supply voltage (VDD) droop, which occurs when large arrays of multiply–accumulate (MAC) units switch concurrently and induce high peak current (ICCmax) [...] Read more.
Modern deep neural network (DNN) accelerators must sustain high throughput while avoiding performance degradation from supply voltage (VDD) droop, which occurs when large arrays of multiply–accumulate (MAC) units switch concurrently and induce high peak current (ICCmax) transients on the power delivery network (PDN). In this work, we focus on ASIC-class DNN accelerators with tightly synchronized MAC arrays rather than FPGA-based implementations, where such cycle-aligned switching is most pronounced. Conventional guardbanding and reactive countermeasures (e.g., throttling, clock stretching, or emergency DVFS) either waste energy or incur non-trivial throughput penalties. We propose SparseDroop, a unified hardware-conscious framework that proactively shapes instantaneous current demand to mitigate droop without reducing sustained computing rate. SparseDroop comprises two complementary techniques. (1) SparseStagger, a lightweight hardware-friendly droop scheduler that exploits the inherent unstructured sparsity already present in the weights and activations—it does not introduce any additional sparsification. SparseStagger dynamically inspects the zero patterns mapped to each processing element (PE) column and staggers MAC start times within a column so that high-activity bursts are temporally interleaved. This fine-grain reordering smooths ICC trajectories, lowers the probability and depth of transient VDD dips, and preserves cycle-level alignment at tile/row boundaries—thereby maintaining no throughput loss and negligible control overhead. (2) SparseBlock, an architecture-aware, block-wise-structured sparsity induction method that intentionally introduces additional sparsity aligned with the accelerator’s dataflow. By co-designing block layout with the dataflow, SparseBlock reduces the likelihood that all PEs in a column become simultaneously active, directly constraining ICCmax and peak dynamic power on the PDN. Together, SparseStagger’s opportunistic staggering (from existing unstructured weight zeros) and SparseBlock’s structured, layout-aware sparsity induction (added to prevent peak-power excursions) deliver a scalable, low-overhead solution that improves voltage stability, energy efficiency, and robustness, integrates cleanly with the accelerator dataflow, and preserves model accuracy with modest retraining or fine-tuning. Full article
Show Figures

Figure 1

28 pages, 2463 KB  
Article
Design of an Energy-Efficient SHA-3 Accelerator on Artix-7 FPGA for Secure Network Applications
by Abdulmunem A. Abdulsamad and Sándor R. Répás
Computers 2026, 15(1), 3; https://doi.org/10.3390/computers15010003 - 21 Dec 2025
Viewed by 320
Abstract
As the demand for secure communication and data integrity in embedded and networked systems continues to grow, there is an increasing need for cryptographic solutions that provide robust security while efficiently using energy and hardware resources. Although software-based implementations of SHA-3 provide design [...] Read more.
As the demand for secure communication and data integrity in embedded and networked systems continues to grow, there is an increasing need for cryptographic solutions that provide robust security while efficiently using energy and hardware resources. Although software-based implementations of SHA-3 provide design flexibility, they often struggle to meet the performance and power limitations of constrained environments. This study introduces a hardware-accelerated SHA-3 solution tailored for the Xilinx Artix-7 FPGA. The architecture includes a fully pipelined Keccak-f [1600] core and incorporates design strategies such as selective loop unrolling, clock gating, and pipeline balancing to enhance overall efficiency. Developed in VHDL and synthesised using Vivado 2024.2.2, the design achieves a throughput of 1.35 Gbps at 210 MHz, with a power consumption of 0.94 W—yielding an energy efficiency of 1.44 Gbps/W. Validation using NIST SHA-3 vectors confirms its reliable performance, making it a promising candidate for secure embedded systems, including IoT platforms, edge devices, and real-time authentication applications. Full article
Show Figures

Figure 1

30 pages, 10600 KB  
Article
Edge-to-Cloud Continuum Orchestrator Based on Heterogeneous Nodes for Urban Traffic Monitoring
by Pietro Ruiu, Andrea Lagorio, Claudio Rubattu, Matteo Anedda, Michele Sanna and Mauro Fadda
Future Internet 2025, 17(12), 574; https://doi.org/10.3390/fi17120574 - 13 Dec 2025
Viewed by 514
Abstract
This paper presents an edge-to-cloud orchestrator capable of supporting services running at the edge on heterogeneous nodes based on general-purpose processing units and Field Programmable Gate Array (FPGA) platform (i.e., AMD Kria K26 SoM) in an urban environment, integrated with a series of [...] Read more.
This paper presents an edge-to-cloud orchestrator capable of supporting services running at the edge on heterogeneous nodes based on general-purpose processing units and Field Programmable Gate Array (FPGA) platform (i.e., AMD Kria K26 SoM) in an urban environment, integrated with a series of cloud-based services and capable of minimizing energy consumption. A use case of vehicle traffic monitoring is considered in a mobility scenario involving computing nodes equipped with video acquisition systems to evaluate the feasibility of the system. Since the use case concerns the monitoring of vehicular traffic by AI-based images and video processing, specific support for application orchestration in the form of containers was required. The development concerned the feasibility of managing containers with hardware acceleration derived from the Vitis AI design flow, leveraged to accelerate AI inference on the AMD Kria K26 SoM. A Kubernetes-based controller node was designed to facilitate the tracking and monitoring of specific vehicles. These vehicles may either be flagged by law enforcement authorities due to legal concerns or identified by the system itself through detection mechanisms deployed in computing nodes. Strategically distributed across the city, these nodes continuously analyze traffic, identifying vehicles that match the search criteria. Using containerized microservices and Kubernetes orchestration, the infrastructure ensures that tracking operations remain uninterrupted even in high-traffic scenarios. Full article
(This article belongs to the Special Issue Convergence of IoT, Edge and Cloud Systems)
Show Figures

Figure 1

Back to TopTop