Hardware Acceleration for Machine Learning

Spanò, Sergio; Cardarilli, Gian Carlo; Di Nunzio, Luca

doi:10.3390/electronics15091857

Open AccessEditorial

Hardware Acceleration for Machine Learning

by

Sergio Spanò

^*

,

Gian Carlo Cardarilli

and

Luca Di Nunzio

Department of Electronic Engineering, Tor Vergata University of Rome, 00133 Rome, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1857; https://doi.org/10.3390/electronics15091857

Submission received: 22 April 2026 / Accepted: 25 April 2026 / Published: 28 April 2026

(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)

Download Versions Notes

1. Introduction

In recent years, hardware acceleration for machine learning has made significant strides, evolving from general-purpose solutions to increasingly specialized heterogeneous platforms designed to integrate artificial intelligence directly into embedded systems and edge architectures. This evolution is driven by the need to maintain or even increase computational performance while simultaneously reducing power consumption, latency, and memory footprint, objectives that software advances alone can no longer achieve [1,2,3].

Recent literature documents how embedded machine learning accelerators are converging toward “versatile” solutions capable of supporting a variety of models and workloads while remaining constrained by tight power budgets. Garreau et al. [1] provide a comprehensive overview of this landscape, analyzing architectures and hardware–software co-design strategies for integrated ML systems. In parallel, several proposals systematically address low-power VLSI architectures for the edge, with dedicated processors capable of executing AI inference on IoT nodes under extreme energy constraints, as shown in works on low-power edge AI processors [4,5].

Beyond the purely circuit-level dimension, comprehensive architectures have emerged for the efficient processing of complex neural networks in embedded and real-time contexts. In the FPGA domain, Dehnavi et al. [6] propose a CNN accelerator based on a novel “Convolutional Processing Element” capable of reducing idle states and external memory accesses, thereby simultaneously improving throughput and energy efficiency. Complementarily, the Pflow framework introduces an end-to-end heterogeneous chain for CNN inference on FPGAs, decoupling model description from hardware details and achieving significant speedups and efficiency gains [7]. At the system level, this trend toward heterogeneous platforms is further exemplified by solutions like TaPaSCo-AIE, which leverages AMD AI Engines with Vitis 2022.2 and streaming infrastructure for highly energy-efficient acceleration in large-scale neural scenarios [8].

Energy and performance optimization of inference architectures is also addressed along methodological lines, with proposals combining quantization, approximation, and dataflow-driven acceleration techniques to drastically reduce consumption while maintaining accuracy. An example is the VLSI architecture for edge AI described in Progress in Electronics and Communication Engineering, which integrates strategies such as approximate computing, clock gating, and quantization to achieve power savings exceeding 40% compared to baseline solutions [9]. Parallelly, comprehensive surveys relate model compression, accelerator specialization, and new forms of algorithm–hardware co-design, highlighting how efficiency increasingly depends not only on peak throughput but also on data movement optimization and memory hierarchy [2].

It is noteworthy that hardware acceleration for AI now has a cross-cutting impact on diverse application domains beyond classical signal or image processing. In the materials field, for instance, Zhang et al. [10] analyze how advanced deep learning techniques paired with hardware acceleration solutions enable new modes of materials design and information extraction from large scientific databases. These results reinforce the idea that architectural choices, from general-purpose processors to accelerator-centric systems, neuromorphic solutions, and heterogeneous frameworks, are now an integral part of the innovation process across multiple sectors.

In this context, the Special Issue “Hardware Acceleration for Machine Learning” in Electronics aims to provide an updated and coherent snapshot of this rapidly evolving ecosystem. The selected contributions cover the full spectrum from low-power IoT applications to embedded FPGA and SoC implementations, ASIC edge solutions, and efficient data-processing algorithms, continuing the path traced by previous Special Issues of the journal on advanced devices and circuits for extreme operational scenarios. This collection thus serves as a reference for researchers and designers interested in combining machine learning’s potential with truly implementable, scalable, and energy-sustainable hardware platforms, while opening new research perspectives for next-generation intelligent systems.

2. Overview of Contributions

The eight articles comprising the “Hardware Acceleration for Machine Learning” Special Issue synthesize cutting-edge developments in hardware acceleration strategies for machine learning, showcasing the latest research advances across key embedded platforms and optimization techniques. These contributions encompass comparative framework analyses for deep learning inference, low-power RTL designs for CNN accelerators, unified FPGA/CGRA pipelines for edge AI, adaptive optimizers for training efficiency, performance analyses of preprocessing techniques, general-purpose hyperdimensional computing accelerators, breaking DSP walls in edge AI, and integer-state dynamics in quantized SNNs, as detailed in Contributions 1–8.

Ratul et al. (Contribution 1) present an exhaustive, multi-dimensional benchmarking study evaluating five leading deep learning inference frameworks, including PyTorch version 2.3.0, ONNX Runtime version 1.17.1, TensorRT version 8.6.2.3, Apache TVM version 0.21.0, and JAX version 0.4.28. All frameworks undergo rigorous testing on the NVIDIA Jetson AGX Orin edge AI platform, a high-performance embedded solution with Ampere GPU architecture. The evaluation encompasses both traditional convolutional neural networks, such as ResNet-152 with 60.2 million parameters, MobileNetV2 optimized for mobile deployment, ultra-compact SqueezeNet, balanced EfficientNet-B0, and classic VGG16, as well as modern transformer architectures, including Swin Transformer and real-time YOLOv5s object detector. Testing utilizes the complete ImageNet ILSVRC2012 validation dataset comprising 50,000 high-resolution images across 1000 classes and measures critical performance dimensions, including inference latency, throughput expressed as images per second, peak system memory utilization, instantaneous power consumption, and classification accuracy through both Top-1 and Top-5 metrics across batch sizes ranging from 2 to 128. Key findings reveal TensorRT achieves superior performance in larger batch configurations through sophisticated layer and tensor fusion, kernel-level autotuning for Jetson hardware, and dynamic tensor memory allocation that minimizes fragmentation. ONNX Runtime demonstrates superior cross-platform portability, leveraging execution providers and graph optimizations like constant folding and operator fusion, while PyTorch establishes a flexible research baseline with cuDNN acceleration. Apache TVM excels through AutoTVM empirical cost-model tuning, generating hardware-specific kernel schedules, and JAX shows promising just-in-time compilation via XLA, though it is constrained by memory limitations at scale. The study maintains near-identical prediction accuracy across all frameworks after FP16 mixed-precision optimization, providing developers with actionable guidance on framework selection, balancing deployment complexity against runtime efficiency and hardware utilization characteristics.

Gundrapally et al. (Contribution 2) present a comprehensively power-optimized implementation of the Tiny YOLOv4 object detection network specifically targeting the AMD Xilinx ZCU104 FPGA-SoC (Advanced Micro Devices, Inc. (AMD), Santa Clara, CA 95054, USA) evaluation platform. Starting from a Tensil toolchain-generated register-transfer-level RTL baseline for COCO dataset-trained fixed-point quantized inference, the authors introduce three novel RTL-level low-power design methodologies meticulously applied to computational critical paths. Local Explicit Clock Enable (LECE) provides fine-grained control over flip-flop and pipeline register updates by gating clock signals only during valid computation cycles, operand isolation strategically decouples inactive input datapaths from downstream functional units using conditional logic guards to suppress unnecessary signal transitions within multipliers and adders, and Enhanced Clock Gating (ECG) employs XOR-based multi-stage gating logic across wide datapaths and deep pipelines to minimize switching activity with lower power than conventional AND/OR schemes. Vivado 2022.2 post-place-and-route synthesis and implementation analysis reports significant power improvements, including a 29.4 percent total on-chip power reduction from 5.044 watts (baseline) to 3.561 watts (optimized configuration), alongside 33.9 percent dynamic power savings from 4.338 watts to 2.866 watts. Resource utilization is also substantially reduced to 9.0 percent CLB LUTs versus 20.9 percent (baseline), 1.4 percent CLB registers versus 10.8 percent, and 1.3 percent DSP slices versus 41.9 percent, while maintaining 32.2 percent Block RAM tile occupancy equivalent to 3.77 megabytes for feature map buffering. Real-time validation through the PYNQ Jupyter Notebook interface demonstrates 9.87 frames per second live object detection processing from USB camera input, achieving 43.9 GOPs per watt efficiency, representing a 1.37 times improvement over comparable FPGA implementations. These results validate RTL power optimization efficacy for battery-constrained applications, including drones, surveillance systems, and autonomous vehicles, without compromising COCO-trained detection accuracy.

Bagui et al. (Contribution 3) conduct a methodologically rigorous comparative analysis quantifying the impact of three distinct data preprocessing pipelines on support vector machine classifier performance for MITRE ATT&CK framework tactic-level cybersecurity threat detection. The study leverages two unprecedented, large-scale Zeek network connection log datasets generated through University of West Florida Cyber Range facilities: UWF-ZeekData22, containing 9.28 million attack records matched against 9.28 million benign samples dominated by Reconnaissance (9.28 M instances) and Discovery (2086 instances) tactics, and UWF-ZeekDataFall22, with 350 K attack and 350 K benign records spanning Resource Development (275 K), Reconnaissance (51 K), Discovery (17 K), Defense Evasion (3 K), and Privilege Escalation (3 K) tactics. Processing occurs within the Hadoop distributed file system, utilizing Apache Spark version 3.5.0 with explicit MapReduce-style parallelization across GPU-enabled clusters. Preprocessing pipelines systematically compare minimal processing through mean imputation, StringIndexer conversion, and uid/label dropping against dimensionality reduction via normalized Principal Component Analysis and supervised Linear Discriminant Analysis. Binary classification targets individual tactics versus benign traffic, while a multinomial One-vs-Rest SVM is used for multi-tactic discrimination. Comprehensive evaluation metrics encompass accuracy, precision, recall, F1 score, and AUROC, alongside preprocessing and classifier execution times across CPU, GPU, and MapReduce-GPU execution environments. Binomial LDA consistently performs best across all statistical measures and demonstrates the fastest execution times, particularly under GPU and MapReduce-GPU acceleration, outperforming both PCA and minimal preprocessing baselines. This represents a pioneering tactic-level Zeek log classification approach, advancing production cybersecurity machine learning pipelines through dimensionality reduction optimized for massive-scale network telemetry analysis.

Mylonas et al. (Contribution 4) construct a fully open-source unified hardware-aware artificial intelligence acceleration pipeline that seamlessly bridges the Brevitas version 0.12.0 quantization-aware training and post-training quantization frontend with two distinct acceleration backends. The FINN version 0.10.1 flow generates high-throughput customized FPGA dataflow architectures emphasizing streaming processing elements and parallel pipelines, while the CGRA4ML framework targets low-power coarse-grained reconfigurable array mappings through temporal processing element reuse and dynamic runtime reconfiguration. A novel model intermediate representation translation layer from QONNX version 0.4.0 to QKeras version 0.9.0 handles Conv/ConvTranspose nodes alongside standard activation functions, including ReLU, facilitating Brevitas model portability across flows. The pipeline demonstrates practical effectiveness through the comprehensive deployment of an autoencoder neural network for anomaly detection targeting wind turbine fleet monitoring within realistic smart grid cyber-physical systems testbed environments. Quantization-aware trained and post-training quantized models deploy equivalently across both acceleration paths on the AMD ZCU104 evaluation platform, achieving up to a 10 times inference speedup per individual flow relative to the software baseline, alongside a 37 times aggregate throughput improvement and over an 11 times higher energy efficiency through detailed power-performance-area analysis. Reconstruction mean-squared error accuracy remains preserved across hardware-accelerated implementations compared to the reference execution, with FINN excelling in ultralow latency requirements and CGRA4ML demonstrating advantages in sustained low-power operation for continuous, time-critical edge monitoring applications within industrial energy infrastructure.

Martino et al. (Contribution 5) construct a comprehensively host-agnostic, AMBA AXI4-compliant, plug-and-play Binary Spatter Code hyperdimensional computing accelerator intellectual property core targeting Xilinx Zynq XC7Z020 system-on-chip platforms hosted on Digilent Zybo Z7-20 evaluation boards. The design advances beyond prior instruction-set-coupled approaches through fully standalone, modular implementation featuring synthesis-time scalable SIMD parallel datapaths, runtime configurable hypervector dimensionality via dedicated control-status registers, multi-bank scratchpad memory hierarchy with configurable bank count and depth, and optimized functional units that eliminate the area-intensive permutation shifter through block-cyclic memory address reordering while preserving bijectivity and decorrelation properties essential for BSC algebra. Clipping unit redesign employs packed comparator arrays with shift-append reconstruction, reducing dynamic multiplexers to 16 LUTs and 34 flip-flops versus hundreds in the baseline. AXI4-Lite memory-mapped control plane handles configuration alongside high-bandwidth AXI4-Stream DMA-driven data paths couple to external DDR through Xilinx DMA engine with asynchronous FIFO clock domain crossing isolation. Complete Linux userspace multi-layer C software stack, released via GitHub repository commit 3ae3b46, provides a unified dual-mode API supporting both pure software emulation for verification and hardware dispatch, abstracting low-level register programming, DMA buffer management via pre-allocated contiguous physical memory regions, and composite pipeline orchestration for encoding, training, and inference workflows. Compared with an embedded dual-core ARM Cortex-A9 software baseline, primitive-level speedups reach 431 times across binding (XOR), bundling (saturating counters), permutation, similarity (Hamming distance accumulation), clipping (majority vote), and associative search operations, yielding average 68.45 times training and 93.34 times inference acceleration across end-to-end classification benchmarks. Full RTL code, testbenches, and software artifacts enable reproducible research and accelerate broader adoption of hyperdimensional computing for robust, noise-tolerant edge intelligence applications.

Aboulsaad and Shaout (Contribution 6) introduce AdamN, a sophisticated drop-in compatible replacement optimizer for the widely adopted AdamW algorithm that fundamentally enhances first-order stochastic optimization through principled nested momentum accumulation and exact bias-correction mechanisms. The core methodological advance replaces AdamW’s conventional single exponential moving average numerator with a compounded EMA-of-EMA structure applied to raw gradients, producing triangular-with-exponential-tail temporal kernels that extend effective gradient-history memory while preserving instantaneous responsiveness to fresh gradient information. An analytically derived exact double-EMA debiasing factor simultaneously eliminates both inner and outer cold-start initialization biases without requiring ad hoc warmup schedules, enabling stable convergence from iteration zero at identical first-order computational complexity. Extensive empirical validation across diverse architectures and tasks confirms transformative practical benefits. On ResNet-18 trained for CIFAR-100 classification, AdamN matches AdamW’s final test accuracy while reaching critical 80 percent and 90 percent training accuracy milestones 127 s and 165 s faster in wall-clock time, respectively. Fine-tuning the Llama 3.1-8B language model on small-domain datasets demonstrates halving the required optimization steps or 2.25 times faster time-to-target perplexity. Vision Transformer transfer learning from ViT-Base/16 pretraining to CIFAR-100 with batch size 256 achieves 88.8 percent test accuracy versus AdamW’s 84.2 percent, reaching 40/80 percent validation milestones by epoch three compared to AdamW’s epoch 59 for the 80 percent threshold. Language modeling on token-frequency imbalanced Wikitext-2-style datasets with training-only token corruption reduces rare-token perplexity without warmup dependency while matching head- and mid-frequency performance. Controlled ablation studies isolate nested momentum and exact debiasing contributions, confirming their synergistic role in reducing total training epochs, energy consumption, and deployment costs across vision and natural language processing workloads.

Liu and Heiyan (Contribution 7) introduce AEMAC, a breakthrough software–hardware co-designed edge artificial intelligence accelerator that fundamentally circumvents the critical “DSP wall” limitation constraining convolutional neural network scalability on entry-level Xilinx Zynq-7020 FPGAs, which possess only 220 DSP48E1 slices that are rapidly exhausted by competing SoC workloads. The architecture elegantly decouples arithmetic computation from DSP resource availability through a synergistic two-level strategy: a software-side dynamic integer scaling mechanism maps narrow-range floating-point activations (typically clustering near zero post-ReLU) into optimal integer domains via 99.9th percentile alignment, preventing outlier-induced saturation, complemented by a hardware-side strictly zero-DSP tri-mode computing engine leveraging abundant slice LUT fabric with a 2-bit Tiny mode for extreme sparsity regions (|x| < 3), a 4-bit Approx mode for moderate values (|x| < 12), and an 8-bit Precise mode for large magnitudes. A statistical bias compensation mechanism derived from quantization-noise analysis counteracts systematic −0.5 N negative bias inherent to logic-efficient floor truncation through conditional +1 injection via the accumulator CARRYIN port during low-precision modes, achieving global error cancellation via mode hybridization, where Tiny/Approx positive bias offsets Precise-mode negative bias, yielding near-zero expected error E ≈ 0 across sparsity-dominated neural distributions (β ≈ 0.5–0.8). The Vivado 2020.2-validated single-core implementation consumes 56 LUTs and 33 registers with a shallow four-stage critical path ensuring 100 MHz operation. A 64-core cluster sustains 26.1 GOPs/W at 0.490 W under worst-case random workloads. Micro-level arithmetic fidelity analysis confirms a 16.7 percent reduction in mean absolute error reduction versus naive truncation. On a CIFAR-10 custom lightweight CNN, end-to-end Top-1 accuracy is recovered to 64.74 ± 0.12 percent, matching the FP32 baseline of 64.64 percent compared with 64.19 ± 0.15 percent for uncompensated quantization. The approach scales to an industry-standard ResNet-20 (92.02 percent) and MobileNetV2 (91.30 percent via depthwise-to-group convolution adaptation satisfying the N ≥ 36 statistical boundary for Law of Large Numbers-based error cancellation). ImageNet ResNet-18 bit-accurate simulation recovers accuracy within 0.71 percent versus an 8.34 percent drop under naive quantization. Trace-driven hardware verification confirms bit-true equivalence against the software golden model even under a worst-case N = 4608 accumulation depth.

Zhang (Contribution 8) systematically formalizes quantized spiking neural networks as deterministic discrete-time dynamical systems modeled through bounded integer-state update maps incorporating hardware-relevant implementation semantics, including shift-based leakage approximation, unsigned saturation clipping rather than wraparound overflow, and no-reset post-spike dynamics. The integer-state neuron evolution follows a canonical form

V_{t + 1} = (V_{t} ≫ k) + \sum_{j} w_{i j} S_{j t}

with

S_{t + 1} = [V_{t + 1} \geq θ_{i}]

, generating spikes via threshold comparison, where the right shift implements an exponential decay factor

1 / 2^{k},

avoiding costly multipliers. Theoretical analysis establishes the network as a discrete map

F : Z_{b}^{N} \to Z_{b}^{N}

on a bounded integer lattice of cardinality

{(2^{b})}^{N}

, guaranteeing a finite number of reachable states and eventual periodicity, since deterministic updates from any initial condition must eventually revisit prior states, forming cycles. Extensive simulation sweeps characterize quantization-sensitive temporal regimes across network sizes N = 30–130 neurons, connection densities 0.1–0.9, bitwidths b = 1–16, over T = 1000 steps. One-to-two-bit configurations collapse to quiescence due to insufficient state resolution for sustained suprathreshold dynamics; 3-bit systems mark a critical transition exhibiting intermittent spiking; 4–16 bitwidths sustain rich recurrent activity patterns. For N = 64 and sparsity = 0.5, the median empirical recurrence lengths are 8.0 and 10.5 steps for 4-bit and higher precisions, respectively. Observed attractor structures, including fixed points (silence/saturation), short periodic orbits (rhythmic synchronization), and long bounded transients, underscore precision as an active dynamical design parameter rather than a mere approximation error source. The findings position bitwidth selection as a strategic lever controlling spike timing stability, activity persistence, and computational repertoire in hardware spiking systems, motivating precision-aware training, FPGA/ASIC co-design frameworks, and attractor-based analysis beyond continuous-time approximations.

In conclusion, driven by escalating requirements for resource-efficient machine learning deployment across the edge computing continuum, from battery-constrained IoT sensors to industrial cyber-physical infrastructure, these contributions establish comprehensive advances across the acceleration spectrum. Framework benchmarking guides practical inference-engine selection; RTL power methodologies enable sustainable CNN deployment; unified quantization pipelines democratize FPGA/CGRA accessibility; nested optimizers accelerate development-to-production timelines; preprocessing advances cybersecurity telemetry analysis; hyperdimensional accelerators unlock noise-robust paradigms; DSP-free architectures shatter resource bottlenecks; and integer-state SNN theory reframes quantization as dynamical control. Collectively, these works illuminate pathways toward next-generation heterogeneous SoC integration, advanced quantization-aware training regimes, neuromorphic–digital hybrid systems, and beyond-CMOS computational substrates capable of sustaining exponential AI scaling within planetary energy constraints.

Author Contributions

Conceptualization, S.S., L.D.N. and G.C.C.; methodology, S.S.; validation, S.S. and L.D.N.; formal analysis, S.S. and G.C.C.; investigation, S.S. and L.D.N.; resources, S.S. and G.C.C.; data curation, S.S. and L.D.N.; writing—original draft preparation, S.S.; writing—review and editing, S.S., L.D.N. and G.C.C.; visualization, S.S. and L.D.N.; supervision, G.C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

List of Contributions

Ratul, I.J.; Zhou, Y.; Yang, K. Accelerating deep learning inference: A comparative analysis of modern acceleration frameworks. Electronics 2025, 14, 2977.
Gundrapally, A.; Shah, Y.A.; Vemuri, S.M.; Choi, K. Hardware Accelerator Design by Using RT-Level Power Optimization Techniques on FPGA for Future AI Mobile Applications. Electronics 2025, 14, 3317.
Bagui, S.S.; Eller, C.; Armour, R.; Singh, S.; Bagui, S.C.; Mink, D. Analyzing Performance of Data Preprocessing Techniques on CPUs vs. GPUs with and Without the MapReduce Environment. Electronics 2025, 14, 3597.
Mylonas, E.; Filippou, C.; Kontraros, S.; Birbas, M.; Birbas, A. A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids. Electronics 2026, 15, 414.
Martino, R.; Pisani, M.; Angioli, M.; Barbirotta, M.; Mastrandrea, A.; Rosato, A.; Olivieri, M. A General-Purpose AXI Plug-and-Play Hyperdimensional Computing Accelerator. Electronics 2026, 15, 489.
Aboulsaad, M.; Shaout, A. AdamN: Accelerating Deep Learning Training via Nested Momentum and Exact Bias Handling. Electronics 2026, 15, 670.
Liu, C.; Heiyan, J. Breaking the DSP Wall: A Software–Hardware Co-Designed, Adaptive Error-Compensated MAC Architecture for Efficient Edge AI. Electronics 2026, 15, 1586.
Zhang, L. Integer-State Dynamics in Quantized Spiking Neural Networks: Implications for Hardware-Oriented Design. Electronics 2026, 15, 1756.

References

Garreau, P.; Cotret, P.; Francq, J.; Cexus, J.-C.; Lagadec, L. A survey on versatile embedded Machine Learning hardware acceleration. J. Syst. Archit. 2025, 167, 103501. [Google Scholar] [CrossRef]
Xu, B.; Banerjee, A.; Gupta, S. Hardware Acceleration for Neural Networks: A Comprehensive Survey. arXiv 2025, arXiv:2512.23914. [Google Scholar] [CrossRef]
Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Matta, M.; Re, M.; Silvestri, F.; Spanò, S. Efficient ensemble machine learning implementation on FPGA using partial reconfiguration. In International Conference on Applications in Electronics Pervading Industry, Environment and Society; Springer International Publishing: Cham, Switzerland, 2018; pp. 253–259. [Google Scholar]
Bhaggiaraj, S.; Antony, S.M.; Sankar, B.U. Design and optimization of low-power VLSI circuits of IoT edge devices. ICTAT J. Microelectron. 2024, 10, 1727–1731. [Google Scholar]
Bennet, M. Low-Power VLSI Architectures for Edge Computing: Advancing Energy-Efficient AI Inference at the Device Level. Int. J. Emerg. Trends Comput. Sci. Inf. Technol. 2023, 4, 1–9. [Google Scholar]
Dehnavi, M.; Ghasemi, A.; Alizadeh, B. FPGA-based CNN accelerator using Convolutional Processing Element to reduce idle states. J. Syst. Archit. 2025, 167, 103468. [Google Scholar] [CrossRef]
Wan, Y.; Xie, X.; Yi, L.; Jiang, B.; Chen, J.; Jiang, Y. Pflow: An end-to-end heterogeneous acceleration framework for CNN inference on FPGAs. J. Syst. Archit. 2024, 150, 103113. [Google Scholar] [CrossRef]
Heinz, C.; Kalkhof, T.; Lavan, Y.; Koch, A. TaPaSCo-AIE: An Open-Source Framework for Streaming-Based Heterogeneous Acceleration Using AMD AI Engines. In 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); IEEE: New York, NY, USA, 2024; pp. 155–161. [Google Scholar]
Vaduganathan, D.; Brinda, B.M. Design and Optimization of Energy-Efficient VLSI Architectures for Edge AI in Internet of Things (IoT) Applications. Prog. Electron. Commun. Eng. 2026, 3, 24–28. [Google Scholar]
Zhang, J.; Yu, J.; Qiu, N.; Liu, Y.; Song, Y.; Zhang, L.; Du, S. Advanced artificial intelligence algorithms and hardware acceleration techniques applied to material structure design. AI Mater. 2026, 2, 3. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Spanò, S.; Cardarilli, G.C.; Di Nunzio, L. Hardware Acceleration for Machine Learning. Electronics 2026, 15, 1857. https://doi.org/10.3390/electronics15091857

AMA Style

Spanò S, Cardarilli GC, Di Nunzio L. Hardware Acceleration for Machine Learning. Electronics. 2026; 15(9):1857. https://doi.org/10.3390/electronics15091857

Chicago/Turabian Style

Spanò, Sergio, Gian Carlo Cardarilli, and Luca Di Nunzio. 2026. "Hardware Acceleration for Machine Learning" Electronics 15, no. 9: 1857. https://doi.org/10.3390/electronics15091857

APA Style

Spanò, S., Cardarilli, G. C., & Di Nunzio, L. (2026). Hardware Acceleration for Machine Learning. Electronics, 15(9), 1857. https://doi.org/10.3390/electronics15091857

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hardware Acceleration for Machine Learning

1. Introduction

2. Overview of Contributions

Author Contributions

Funding

Conflicts of Interest

List of Contributions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI