Hardware Acceleration for Machine Learning

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 15 March 2026 | Viewed by 11518

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electronic Engineering, University of Rome Tor Vergata, 00133 Rome, Italy
Interests: electronics and communication engineering; digital signal processing; machine learning; FPGA
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electronic Engineering, University of Rome Tor Vergata, 00133 Rome, Italy
Interests: digital signal processing; machine learning; digital architectures; digital electronics for space
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Non-functional constraints such as execution time, memory capacity, and energy consumption pose significant challenges for designers in the field of machine learning systems. New applications are being proposed that integrate various functionalities into everyday objects, imposing several additional requirements on embedded system designers, such as the following:

  • Increased computing workloads when elaborating and fusing multiple data, even when using advanced machine learning techniques;
  • Reduced power consumption, allowing for smaller batteries and renewable power sources;
  • Faster interaction with the environment, requiring a high level of data processing performance often reached using hardware implementations.

For example, the physical dimensions and power consumption of embedded Internet of Things systems are frequently of interest. However, the need for small systems does not prevent greater demands being made with regard to functionality and speed. Simultaneously, designers must respond to a growing need for more powerful edge systems capable of managing vast fleets of connected devices while running resource-intensive algorithms such as sensor fusion, feedback control, and machine learning algorithms. Developers must understand the nature of hardware architectures and strategies for extracting their full performance potential in this environment, as well as embedded design in general.

This Special Issue invites researchers to contribute original research, case studies, and reviews that address topics related to the design and application of hardware accelerators for machine learning.

The topics relevant for this Special Issue include (but are not limited to) the following:

  • Low-power IoT applications;
  • Embedded FPGA and SoC implementations;
  • Embedded ASIC implementations;
  • Machine learning on the Edge;
  • Efficient data processing algorithms.

Dr. Sergio Spanò
Prof. Dr. Gian Carlo Cardarilli
Dr. Luca Di Nunzio
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • embedded systems
  • digital electronics
  • low-power IoT
  • edge computing
  • FPGA
  • ASIC
  • systems-on-chips
  • machine learning

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

37 pages, 2861 KB  
Article
AdamN: Accelerating Deep Learning Training via Nested Momentum and Exact Bias Handling
by Mohamed Aboulsaad and Adnan Shaout
Electronics 2026, 15(3), 670; https://doi.org/10.3390/electronics15030670 - 3 Feb 2026
Abstract
This paper introduces AdamN, a nested-momentum adaptive optimizer that replaces the single Exponential Moving Average (EMA) numerator in Adam/AdamW with a compounded EMA of gradients plus an EMA of that EMA, paired with an exact double-EMA bias correction. This yields a smoother, curvature-aware [...] Read more.
This paper introduces AdamN, a nested-momentum adaptive optimizer that replaces the single Exponential Moving Average (EMA) numerator in Adam/AdamW with a compounded EMA of gradients plus an EMA of that EMA, paired with an exact double-EMA bias correction. This yields a smoother, curvature-aware search direction at essentially first-order cost, with longer, more faithful gradient-history memory and a stable, warmup-free start. Under comparable wall-clock time per epoch, AdamN matches AdamW’s final accuracy on ResNet-18/CIFAR-100, while reaching 80% and 90% training-accuracy milestones ~127 s and ~165 s earlier, respectively. On pre-benchmarking workloads (toy problems and CIFAR-10), AdamN shows the same pattern: faster early-phase convergence with similar or slightly better final accuracy. On language modeling with token-frequency imbalance—Wikitext-2-style data with training-only token corruption and a 10% low-resource variant—AdamN lowers rare-token perplexity versus AdamW without warmup while matching head and mid-frequency performance. In full fine-tuning of Llama 3.1–8B on a small dataset, AdamN reaches AdamW’s final perplexity in roughly half the steps (≈ 2.25 xfaster time-to-quality). Finally, on a ViT-Base/16 transferred to CIFAR-
100 (batch size 256), AdamN achieves 88.8% test accuracy vs. 84.2% for AdamW and reaches
40–80% validation-accuracy milestones in the first epoch (AdamW reaches 80% by epoch 59),
reducing epochs, energy use, and cost. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
26 pages, 3967 KB  
Article
A General-Purpose AXI Plug-and-Play Hyperdimensional Computing Accelerator
by Rocco Martino, Marco Pisani, Marco Angioli, Marcello Barbirotta, Antonio Mastrandrea, Antonello Rosato and Mauro Olivieri
Electronics 2026, 15(2), 489; https://doi.org/10.3390/electronics15020489 - 22 Jan 2026
Viewed by 139
Abstract
Hyperdimensional Computing (HDC) offers a robust and energy-efficient paradigm for edge intelligence; however, current hardware accelerators are often proprietary, tailored to the target learning task and tightly coupled to specific CPU microarchitectures, limiting portability and adoption. To address this, and democratize the deployment [...] Read more.
Hyperdimensional Computing (HDC) offers a robust and energy-efficient paradigm for edge intelligence; however, current hardware accelerators are often proprietary, tailored to the target learning task and tightly coupled to specific CPU microarchitectures, limiting portability and adoption. To address this, and democratize the deployment of HDC hardware, we present a general-purpose, plug-and-play accelerator IP that implements the Binary Spatter Code framework as a standalone, host-agnostic module. The design is compliant with the AMBA AXI4 standard and provides an AXI4-Lite control plane and DMA-driven AXI4-Stream datapaths coupled to a banked scratchpad memory. The architecture supports synthesis-time scalability, enabling high-throughput transfers independently of the host processor, while employing microarchitectural optimizations to minimize silicon area. A multi-layer C++ software (GitHub repository commit 3ae3b46) stack running in Linux userspace provides a unified programming model, abstracting low-level hardware interactions and enabling the composition of complex HDC pipelines. Implemented on a Xilinx Zynq XC7Z020 SoC, the accelerator achieves substantial gains over an ARM Cortex-A9 baseline, with primitive-level speedups of up to 431×. On end-to-end classification benchmarks, the system delivers average speedups of 68.45× for training and 93.34× for inference. The complete RTL and software stack are released as open-source hardware to support reproducible research and rapid adoption on heterogeneous SoCs. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

22 pages, 2066 KB  
Article
A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids
by Eleftherios Mylonas, Chrisanthi Filippou, Sotirios Kontraros, Michael Birbas and Alexios Birbas
Electronics 2026, 15(2), 414; https://doi.org/10.3390/electronics15020414 - 17 Jan 2026
Viewed by 254
Abstract
The ever-increasing need for energy-efficient implementation of AI algorithms has driven the research community towards the development of many hardware architectures and frameworks for AI. A lot of work has been presented around FPGAs, while more sophisticated architectures like CGRAs have also been [...] Read more.
The ever-increasing need for energy-efficient implementation of AI algorithms has driven the research community towards the development of many hardware architectures and frameworks for AI. A lot of work has been presented around FPGAs, while more sophisticated architectures like CGRAs have also been at the center. However, AI ecosystems are isolated and fragmented, with no standardized way to compare different frameworks with detailed Power–Performance–Area (PPA) analysis. This paper bridges the gap by presenting a unified, fully open-source hardware-aware AI acceleration pipeline that enables seamless deployment of neural networks on both FPGA and CGRA architectures. Built around the Brevitas quantization framework, it supports two distinct backend flows: FINN for high-performance dataflow accelerators and CGRA4ML for low-power coarse-grained reconfigurable designs. To facilitate this, a model translation layer from QONNX to QKeras is also introduced. To demonstrate its effectiveness, we use an autoencoder model for anomaly detection in wind turbines. We deploy our accelerated models on the AMD’s ZCU104 and benchmark it against a Raspberry Pi. Evaluation on a realistic cyber–physical testbed shows that the hardware-accelerated solutions achieve substantial performance and energy-efficiency gains—up to 10× and 37× faster inference per flow and over 11× higher efficiency—while maintaining acceptable reconstruction accuracy. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

26 pages, 1607 KB  
Article
Analyzing Performance of Data Preprocessing Techniques on CPUs vs. GPUs with and Without the MapReduce Environment
by Sikha S. Bagui, Colin Eller, Rianna Armour, Shivani Singh, Subhash C. Bagui and Dustin Mink
Electronics 2025, 14(18), 3597; https://doi.org/10.3390/electronics14183597 - 10 Sep 2025
Viewed by 1267
Abstract
Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine [...] Read more.
Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine (SVM) classifier. Efficiency is measured in terms of statistical metrics such as accuracy, precision, recall, the F-1 measure, and AUROC. The preprocessing times and the classifier run times are also compared using the three differently preprocessed datasets. Finally, a comparison of performance timings on CPUs vs. GPUs with and without the MapReduce environment is performed. Two newly created Zeek Connection Log datasets, collected using the Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework, UWF-ZeekData22 and UWF-ZeekDataFall22, are used for this work. Results from this work show that binomial LDA, on average, performs the best in terms of statistical measures as well as timings using GPUs or MapReduce GPUs. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

12 pages, 7716 KB  
Article
Hardware Accelerator Design by Using RT-Level Power Optimization Techniques on FPGA for Future AI Mobile Applications
by Achyuth Gundrapally, Yatrik Ashish Shah, Sai Manohar Vemuri and Kyuwon (Ken) Choi
Electronics 2025, 14(16), 3317; https://doi.org/10.3390/electronics14163317 - 20 Aug 2025
Cited by 2 | Viewed by 1612
Abstract
In resource-constrained edge environments—such as mobile devices, IoT systems, and electric vehicles—energy-efficient Convolution Neural Network (CNN) accelerators on mobile Field Programmable Gate Arrays (FPGAs) are gaining significant attention for real-time object detection tasks. This paper presents a low-power implementation of the Tiny YOLOv4 [...] Read more.
In resource-constrained edge environments—such as mobile devices, IoT systems, and electric vehicles—energy-efficient Convolution Neural Network (CNN) accelerators on mobile Field Programmable Gate Arrays (FPGAs) are gaining significant attention for real-time object detection tasks. This paper presents a low-power implementation of the Tiny YOLOv4 object detection model on the Xilinx ZCU104 FPGA platform by using Register Transfer Level (RTL) optimization techniques. We proposed three RTL techniques in the paper: (i) Local Explicit Clock Enable (LECE), (ii) operand isolation, and (iii) Enhanced Clock Gating (ECG). A novel low-power design of Multiply-Accumulate (MAC) operations, which is one of the main components in the AI algorithm, was proposed to eliminate redundant signal switching activities. The Tiny YOLOv4 model, trained on the COCO dataset, was quantized and compiled using the Tensil tool-chain for fixed-point inference deployment. Post-implementation evaluation using Vivado 2022.2 demonstrates around 29.4% reduction in total on-chip power. Our design supports real-time detection throughput while maintaining high accuracy, making it ideal for deployment in battery-constrained environments such as drones, surveillance systems, and autonomous vehicles. These results highlight the effectiveness of RTL-level power optimization for scalable and sustainable edge AI deployment. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

20 pages, 766 KB  
Article
Accelerating Deep Learning Inference: A Comparative Analysis of Modern Acceleration Frameworks
by Ishrak Jahan Ratul, Yuxiao Zhou and Kecheng Yang
Electronics 2025, 14(15), 2977; https://doi.org/10.3390/electronics14152977 - 25 Jul 2025
Cited by 4 | Viewed by 7350
Abstract
Deep learning (DL) continues to play a pivotal role in a wide range of intelligent systems, including autonomous machines, smart surveillance, industrial automation, and portable healthcare technologies. These applications often demand low-latency inference and efficient resource utilization, especially when deployed on embedded or [...] Read more.
Deep learning (DL) continues to play a pivotal role in a wide range of intelligent systems, including autonomous machines, smart surveillance, industrial automation, and portable healthcare technologies. These applications often demand low-latency inference and efficient resource utilization, especially when deployed on embedded or edge devices with limited computational capacity. As DL models become increasingly complex, selecting the right inference framework is essential to meeting performance and deployment goals. In this work, we conduct a comprehensive comparison of five widely adopted inference frameworks: PyTorch, ONNX Runtime, TensorRT, Apache TVM, and JAX. All experiments are performed on the NVIDIA Jetson AGX Orin platform, a high-performance computing solution tailored for edge artificial intelligence workloads. The evaluation considers several key performance metrics, including inference accuracy, inference time, throughput, memory usage, and power consumption. Each framework is tested using a wide range of convolutional and transformer models and analyzed in terms of deployment complexity, runtime efficiency, and hardware utilization. Our results show that certain frameworks offer superior inference speed and throughput, while others provide advantages in flexibility, portability, or ease of integration. We also observe meaningful differences in how each framework manages system memory and power under various load conditions. This study offers practical insights into the trade-offs associated with deploying DL inference on resource-constrained hardware. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

Back to TopTop