Edge AI in Practice: A Survey and Deployment Framework for Neural Networks on Embedded Systems
Abstract
1. Introduction
2. Literature Search Methodology
- Block A (core concepts): “Edge AI” OR “embedded deep learning” OR “TinyML”.
- Block B (techniques and deployment): “neural network optimization” OR “quantization” OR “pruning” OR “compact networks” OR “NPU acceleration” OR “Edge TPU” OR “FPGA inference” OR “YOLO optimization” OR “model compression” OR “hardware-software co-design”.
(“Edge AI” OR “embedded deep learning” OR “TinyML”) AND (“neural network optimization” OR “quantization” OR “pruning” OR “compact networks” OR “NPU acceleration” OR “Edge TPU” OR “FPGA inference” OR “YOLO optimization” OR “model compression” OR “hardware-software co-design”).
“cloud computing” OR “cloud inference” OR “data center”
3. Core Concepts and Metrics
3.1. The Inference Process
- Pre-processing: Input data are transformed into a format compatible with the neural network. This may involve data type conversion, value normalization, or encoding of categorical variables, including the creation of tensors or arrays with the appropriate structure.
- Model Execution: Pre-processed data are fed into the neural network, where information flows through the network layers, performing mathematical operations and non-linear transformations.
- Post-processing: The neural network output, which may be a vector of numbers or an internal representation, is transformed into a format useful for the specific application or task. This may involve conversion to classes, probabilities, numerical values, or other required representations.
3.2. Inference Evaluation Metrics
3.2.1. Model Performance Metrics
- Latency: This refers to the time it takes for the model to process a single input, typically measured in milliseconds (ms) or microseconds (μs). Lower latency indicates a faster response in real-time scenarios. However, latency alone does not fully capture the overall processing capacity of the system; rather, it primarily reflects how quickly each individual request can be handled [10].
- Throughput: The number of inputs that the model can process per unit of time. It is measured in Inferences Per Second (IPS) or Frames Per Second (FPS). Higher throughput means that the model can process more inputs in the same amount of time.
- Energy Consumption: Energy consumption is the amount of energy that the model consumes during inference. It is measured in Watts (W) or milliWatts (mW). Lower power consumption means that the model is more energy-efficient.
- Memory Usage: Memory usage is the amount of RAM that the model uses during inference. It is measured in Kilobytes (KB) or Megabytes (MB). Lower memory usage allows larger models to run on devices with limited memory.
3.2.2. Hardware Accelerator Metrics
- Tera Operations Per Second (TOPS): This represents the theoretical maximum throughput, measured in trillions of operations per second. While a higher value suggests greater computing capacity, it does not necessarily reflect real-world performance, as it depends on implementation details and actual workload distribution.
- Tera Operations Per Second per Watt (TOPS/W): This metric measures the energy efficiency of the accelerator, indicating the number of operations performed per watt consumed. Although a higher value suggests better energy efficiency, it may not accurately represent performance if the workload is low or highly variable.
- Tera Operations Per Second per Watt per Megahertz (TOPS/W/MHz): This relates energy efficiency to clock frequency, facilitating comparisons between accelerators operating at different speeds. However, not all architectures scale linearly with frequency, making this metric less universally applicable.
- Giga Operations Per Second per Watt (GOPS/W): Similar to TOPS/W but expressed in giga-operations, this metric is commonly used for lower-power devices, such as mobile platforms and embedded systems.
- Floating-Point Operations Per Second (FLOPS): This measures the number of floating-point operations a hardware accelerator can perform per second. Unlike GOPS and TOPS, which account for both integer and floating-point operations, FLOPS specifically quantifies floating-point performance, making it crucial for applications requiring high numerical precision.
3.2.3. System-Level Metrics
3.3. The Role of Standardized Benchmarking in Edge AI
Specific Benchmarks for the Edge: ML Perf Tiny and Edge AIBench
3.4. Empirical Analysis of Performance Metrics on Edge Platforms
- Performance and Latency: The Google Coral Dev Board offers much better throughput and latency for quantized models. This is due to its specialized hardware, such as its 8 MB SRAM cache and its systolic array architecture, which is optimized for the matrix operations of neural networks and reduces memory access.
- Energy Consumption: Edge TPU is significantly more energy-efficient during active inference, consuming much less power to complete the same task. However, the Jetson Nano demonstrates much lower idle consumption. This introduces a key trade-off: Coral is superior for continuous AI computation, whereas the Jetson might be more efficient for sporadic tasks where the device spends most of its time in an idle state.
- Memory Usage: The difference in memory usage is notable. Edge TPU, thanks to its architecture and the use of quantized models, operates using only a fraction of the RAM (42–131 MB), while Jetson Nano’s GPU requires a substantially larger amount (approx. 1.2 GB) to manage the models.
4. State of the Art: Model and Inference Optimization
4.1. Model Compression Techniques
4.1.1. Pruning
4.1.2. Quantization
- Binarization and Ternarization: These are the most aggressive forms, reducing weights to just one bit ({−1, +1}) or two bits ({−1, 0, +1}), respectively. These techniques offer the highest level of compression but require specialized training strategies, often involving knowledge distillation, to mitigate a significant loss of accuracy, as detailed in Table 4.
- Sub-4-bit Quantization (4, 3 and 2 bits): This presents a promising frontier for efficiency but introduces severe challenges. The drastic precision reduction often leads to a considerable degradation in model performance, a problem especially acute in already optimized architectures like MobileNet, which have less inherent redundancy. A key cause for this is the Activation Instability Induced by Weight Quantization (AIWQ), where training becomes unstable and fails to converge because small weight updates cause large, destabilizing oscillations in the quantized output activations.
4.1.3. Knowledge Distillation
4.1.4. Tensor Factorization
4.1.5. Hashing
4.1.6. Hybrid Techniques and Automated Search
- Quantization-Aware Pruning (QAP): This hybrid method integrates pruning and Quantization-Aware Training (QAT) into a single process. Instead of pruning a model and then quantizing it (which can aggravate errors), QAP trains the model to be simultaneously sparse (pruned) and robust to low precision. The objective is for both techniques to be complementary. This strategy has proven to be highly efficient, achieving drastic reductions in the BOPs (Bit Operations) computational complexity metric. In [24], it achieved a 50-fold reduction in BOPs on a 6-bit model pruned to 80%, while maintaining the same accuracy as the original 32-bit model.
- Dynamic/Static Pruning with QAT (QADS): A key challenge in combining QAT with dynamic pruning methods (where weights can regrow) is instability during training [25]. This instability is attributed to the effect of a “double approximation” of the gradient, as both QAT and dynamic pruning rely on the same Straight-Through Estimator (STE) for backpropagation. To address this, the QADS method [26] utilizes an intelligent alternation strategy. Initially, it employs dynamic pruning to explore and determine the optimal sparse structure. Subsequently, once a target sparsity rate is reached, it transitions to static pruning. By applying a fixed mask, static pruning eliminates the need to approximate the gradient with STE in that phase [24]. This approach ensures stable training and the achievement of high accuracy.
- Neural Architecture Search (NAS) for Compression: This is the most advanced approach and addresses the problem that the best architecture for a full-precision model is not necessarily the best architecture once compressed [27]. Methods like Joint Search for Network Architecture, Pruning and Quantization Policy (APQ) and Neural Architecture Search for Bert (NAS-BERT) perform a joint search for the architecture, pruning, and quantization policy.
- APQ utilizes a Once-For-All (OFA) network and an accuracy predictor trained with knowledge transfer (Predictor–Transfer) to estimate the performance of a sub-network without the cost of a full retraining. This drastically reduces the search cost and achieves superior accuracy under the same latency constraints [21].
- NAS-BERT applies Neural Architecture Search (NAS) to the compression of language models by training a supernet [27]. To handle the massive search space (~1034 architectures), the method employs techniques such as Block-wise Search and Progressive Shrinking. These techniques succeed in reducing the search space to a manageable size (~1020).
4.1.7. Comparative Summary of Compression Techniques Across Studies
4.2. Inference Optimization Techniques
4.2.1. Hardware Acceleration
- Dedicated Accelerators: They offer high performance and energy efficiency but can be costly and inflexible.
- Heterogeneous Computing: It utilizes different processing architectures (CPU, GPU, FPGA) to optimize application performance by distributing workloads and leveraging each architecture’s strengths [10].
- ISA Extensions (Instruction Set Architecture): These are special instructions added to a processor’s architecture to accelerate common neural network operations. They are less complex to implement and more flexible but may have limited performance. Modern ARM processors include ISA extensions, achieving a 74% reduction in clock cycles for OCR tasks [28].
4.2.2. Software-Level Optimization
- Hardware-Specific Compilation: This adapts deep learning models for efficient execution on specific hardware using tools like TensorFlow Lite Converter and XLA (Accelerated Linear Algebra) [29].
- Memory Optimization: This includes techniques to efficiently manage memory during inference, such as quantization, memory compression and buffer reuse.
4.2.3. Emerging Techniques
- Analog In-Memory Computing (AIMC): This offers high energy efficiency by performing calculations directly in memory but has precision limitations [32].
- Processing-In-Memory (PIM): PIM architectures integrate processing and memory on the same chip, reducing the need to transfer data between memory and processor, significantly improving energy efficiency and latency [4].
5. State of the Art: Architectures, Platforms and Frameworks
5.1. Neural Network Architectures for Embedded Systems
5.1.1. Convolutional Neural Networks (CNNs)
- (a)
- Main Architectures and Optimized Versions
- MobileNet: Uses depth-wise separable convolutions to reduce the number of parameters and operations, achieving greater speed and energy efficiency.
- SqueezeNet: Significantly reduces the number of parameters through “fire” modules, making it lightweight and fast.
- (b)
- Complexity and Accuracy
5.1.2. Recurrent Neural Networks (RNNs)
- Long Short-Term Memory (LSTM): Introduced to solve the vanishing gradient problem, they use gate mechanisms to control the flow of information, allowing them to maintain and update their internal state for long periods. This significantly extends their memory capacity compared to traditional RNNs [47]. Accuracy metrics for RNN tasks are also specific; metrics such as Word Error Rate (WER) for speech recognition, the BLEU score for machine translation, or Root Mean Square Error (RMSE) for time series prediction can be employed [48].
- Gated Recurrent Unit (GRU): This is a simpler variant of LSTMs, which combines the forget and input gates into a single update gate, thus reducing computational complexity while maintaining similar performance [46].
- Transformers: Although they also address sequential tasks, they are not RNNs, as they replace recurrent connections with attention mechanisms that process global relationships in parallel, improving efficiency and the ability to model long-term dependencies. This makes them more suitable for tasks that require large volumes of data and high levels of parallelism [49,50].
| Architecture | Advantages | Disadvantages |
|---|---|---|
| RNNs (LSTM/GRU) | Optimized variants (GRUs) can be exceptionally lightweight and energy-efficient, rendering them suitable for MCUs. LSTMs may offer a robust trade-off between high accuracy and a manageable RAM footprint (~4.4 GB) for specific tasks. They can achieve real-time performance on MCUs with optimizations (low power consumption and RAM overhead). | Standard (non-optimized) RNNs may exhibit prohibitive RAM requirements (>22 GB), making them unfeasible for edge deployment. Their inherently sequential nature constrains parallelism on hardware accelerators [Inferred]. The irregular memory access patterns (typical of LSTMs) can degrade performance on NPUs reliant on DMA (Direct Memory Access). |
| Transformers | NPUs can demonstrate high efficiency for LLMs (e.g., TinyLlama, 3.2× speedup over GPUs) owing to the predominance of matrix-vector operations. They may exhibit a reduced RAM footprint compared to standard RNNs (1.8–3.7 GB vs. >22 GB in one study). Their parallelizable architecture is well-suited for future compatible accelerators. | Poor compatibility with current mobile accelerators (GPUs, DSPs, NPUs); execution frequently reverts to the CPU, or acceleration is minimal to non-existent. GPU execution may compromise model accuracy. Standard self-attention possesses high computational complexity (quadratic) and significant memory overhead during computation. May demonstrate inferior accuracy compared to LSTMs for certain time-series tasks. |
5.1.3. Compact Neural Networks
5.1.4. FeedForward Neural Networks (FNNs)
5.2. Hardware Platforms for Implementation
- Microcontrollers (MCUs): A popular choice for low-power, cost-effective applications, especially in the TinyML domain. Modern MCUs like the ARM Cortex-M series offer a good balance of processing capability and energy efficiency for simple tasks [14].
- Field-Programmable Gate Arrays (FPGAs): These are highly flexible integrated circuits that can be configured by the user after manufacturing. Their reconfigurability and parallel processing capabilities make them ideal for implementing custom neural network accelerators with high energy efficiency [58].
- Neural Processing Units (NPUs): Designed specifically for neural network operations, these processors deliver high performance and energy efficiency, making them ideal for AI workloads. Their massively parallel architecture enables simultaneous calculations, crucial for deep learning algorithms with extensive matrix operations. NPUs have a tailored memory hierarchy, minimizing data movement and maximizing resource utilization through on-chip memory and dataflow architectures [14].
- Edge TPUs: These are low-power versions of TPUs, designed for neural network inference on embedded devices. They offer a good balance between performance and energy efficiency. Their main advantages are performance, energy efficiency and ease of use [14].
- Deep Learning Accelerators for RISC-V Processors: Known for their open source nature and flexibility, they are being used to optimize the implementation of deep neural networks (DNNs) on edge devices. Their advantages are flexibility and customization, extensibility, and community support [28].
Software Framework
- TensorFlow Lite: It is a lightweight version of TensorFlow optimized for mobile and embedded devices. It offers a set of tools for model conversion, performance optimization and implementation on different hardware platforms. TensorFlow Lite is used in a wide variety of applications, such as image recognition, natural language processing and object detection [63]. A specialized version is TensorFlow Lite for Microcontrollers (TFLM). TFLM is explicitly designed to run on microcontrollers and other devices with extremely limited memory (in the kilobyte range) [66]. It can operate without an operating system or a file system, and is considered the standard inference engine for TinyML use cases [67].
- Cortex Microcontroller Software Interface Standard for Neural Networks (CMSIS-NN): It is a library of optimized functions for neural network operations on ARM Cortex-M processors. CMSIS-NN accelerates the execution of neural networks on microcontrollers, allowing the implementation of AI applications on devices with very limited resources [68].
- PyTorch Mobile: It is a lightweight version of PyTorch optimized for mobile and embedded devices. PyTorch Mobile offers tools for model conversion, performance optimization and implementation on different platforms [69].
- OnnxRuntime: It is a high-performance inference engine for models in ONNX (Open Neural Network Exchange) format. ONNX is an open format that allows interoperability between different Deep Learning frameworks. OnnxRuntime can be used to run models from different frameworks on a variety of hardware platforms [64].
- Glow: A Machine Learning compiler designed to optimize and generate code for various hardware architectures, including accelerators. Glow takes computational graphs from frameworks such as PyTorch or ONNX, performs graph-level optimizations, and generates specific code (compiled library packages) [28]. Facebook (Meta) uses it as an intermediate layer in its inference acceleration platform.
- Apache TVM: It is a complete, open-source compilation stack that aims to close the gap between deep learning frameworks and hardware backends [70]. TVM supports model importing from multiple frameworks (PyTorch, TensorFlow, Keras, ONNX, etc.) and can generate optimized code for a wide range of hardware, including CPUs, GPUs, FPGAs, and bare-metal microcontrollers (via microTVM). It incorporates a powerful auto-tuning engine to optimize the execution order and memory access of tensor operations. Frameworks such as MATCH extend TVM to improve compilation on heterogeneous edge devices with custom accelerators [71].
5.3. Comparative Analysis
- CNNs excel in computer vision.
- RNNs/LSTMs/GRUs are more suitable for processing sequences (text, speech, time series).
- Transformers offer great capacity for parallelism and handling long-term dependencies.
- Compact networks, combined with compression strategies, are ideal when very limited resources are available.
5.4. Comparative Positioning Against Existing Surveys
6. A Proposed Methodology for Optimization and Deployment
- Stage 1: Definition of Requirements and Constraints
- Stage 2: Selection of Architecture and Base Model
- Stage 3: Model Optimization Cycle
- Stage 4: Selection of the Deployment Ecosystem
- Stage 5: Benchmarking and Final Validation
6.1. Case Study: Real-Time Object Detection in Edge AI
- Stage 1: Definition of Requirements and Constraints (r)
- ○
- Primary Task: Detection of people and objects in a controlled environment.
- ○
- Quantitative Constraints (r):
- ○
- Power Budget (): <5 Watts. Consumption must be low despite being connected to the grid, to avoid the need for active heat dissipation (fans), which increases reliability and reduces cost.
- ○
- Maximum Memory (): <512 MB (execution RAM). The operating system and other processes consume resources, leaving this strict limit for model operation.
- ○
- Minimum Accuracy (): >90% mAP. Detection must be highly reliable, especially since the environment is controlled, which raises the required accuracy threshold.
- Stage 2: Selection of Architecture and Base Model
- ○
- Neural Network Architecture
- ○
- Base Model Selection
- Stage 3: Model Optimization Cycle
- Stage 4: Selection of the Deployment Ecosystem
- ○
- Hardware Platform Selection
- ○
- Software Ecosystem Selection and Configuration
- Stage 5: Benchmarking and Final Validation
- ○
- Deployment and Performance Measurement: The optimized model (YOLOv8n-INT8), compiled with the chosen ecosystem’s toolkit (Stage 4), is deployed on the i.MX 8M Plus board. The system’s end-to-end performance is measured, which includes pre-processing, accelerated inference on the NPU, and post-processing executed on the CPU. The results that meet the success condition (b ≤ r) are summarized in Table 18, demonstrating the complete validation of the methodological design.
6.2. Formal Analysis of the Feedback Loop and the Trade-Off Space
- CPU (FP32): This exhibits the highest accuracy (94% mAP) and highest consumption (5.1 W) but results in an unviable latency (980 ms), failing to meet
- GPU (FP32): Similar to the CPU, execution on the GPU also presents excessive latency (975 ms). This demonstrates that, for large models or non-optimized architectures, the runtime overhead and data movement nullify the parallel processing capability of the GPU, making it inefficient for the requirement.
- NPU (Initial INT8): The initial quantization to INT8 on the NPU achieves acceptable latency (132 ms) and efficient consumption (4.1 W). However, the measured accuracy of 89% mAP does not meet the requirement ( ≥ 90% mAP).
- 4.
- NPU (INT8 QAT—Final): The optimized configuration using QAT achieved the best multi-objective balance. The final operating point reached 132 ms latency, efficient consumption of 4.1 W, and a measured accuracy of 91.5% mAP (see Table 19).
7. Discussion and Future Work
7.1. Applications of Neural Networks in Embedded Systems
- Object detection: In embedded systems, CNNs are used for object detection in real-time applications, such as autonomous driving systems and advanced driver-assistance systems (ADAS), to analyze the environment, identify objects like pedestrians, vehicles and traffic signs, and make decisions to ensure safe navigation [81].
- Image Classification and Segmentation: For example, deployed in medical devices to classify medical images or segment tumors accurately and efficiently [82].
7.2. Emerging Trends and Future Opportunities
7.3. Open Challenges and Future Research Agenda
- (a)
- Technical Challenges
- Hardware limitation: It is essential to design models that can operate within the constraints of memory, power and energy consumption.
- Compatibility and Adaptability: Existing frameworks are often not optimized for embedded systems, requiring the development of specialized tools.
- Model optimization: Techniques such as quantization, pruning and knowledge distillation have proven to be effective in reducing model complexity without significantly compromising model accuracy.
- Robustness and Reliability: Systems must be able to operate under adverse conditions and in the face of malicious attacks.
- (b)
- Ethical and Social Challenges
- Bias and Fairness: AI models can reflect and amplify biases present in their training data, leading to discriminatory outcomes.
- Privacy and Security: Protecting user privacy is a critical challenge when implementing AI in embedded systems. Models must also be secure against malicious attacks.
- Labor Impact: The automation of tasks through AI could have a significant impact on employment, requiring careful planning for workforce transitions.
A Future Research Agenda
- Edge-Nativa Security and Robustness: Security challenges extend beyond data privacy. Urgent research is needed on the robustness of edge models against adversarial attacks specifically designed to exploit the weaknesses of quantized and pruned models, which may be more fragile than their 32-bit counterparts.
- Energy-Aware Neural Architecture Search (NAS): NAS methods, such as APQ, have proven effective for co-optimizing architecture, pruning, and quantization. However, these methods often optimize for latency or accuracy. The pending challenge is to integrate power consumption (a pillar of edge computing) directly into the search cost function, creating a true “low-power” NAS that discovers architectures efficient in TOPS/W, not just fast ones.
- Holistic Metrics and Benchmarking: As highlighted in Section 3.3, the field lacks standardized evaluation. Efforts like MLPerf Tiny are a first step, but a research agenda is needed to create holistic benchmarks that evaluate performance under multiple simultaneous constraints (accuracy, latency, power, and memory). This is vital for objectively validating co-design methodologies, such as the one proposed in Section 6.
- Automation of Hardware–Software Co-Design: Overcoming hardware limitations and software fragmentation (such as the diversity of toolchains like TensorFlow Lite, eIQ Toolkit, or OpenVINO) requires automated co-design. Future research should focus on compilation frameworks that not only optimize the model for fixed hardware but also allow for flexible co-design. In this approach, the model architecture (like convolution kernels) and the accelerator micro-architecture (like buffer size) are jointly optimized. This research direction represents a deeper and more automated integration of the Model Selection (Stage 2) and Hardware/Ecosystem Selection (Stage 4) components of the proposed five-stage methodology.
- Memory-Efficient On-Device Learning: The trend of on-device learning offers great promise for adaptability and privacy. However, the main technical challenge is performing backpropagation and weight updating on devices with extremely limited RAM (in the KB range). New optimization algorithms are needed to train models with a near-zero memory footprint. This challenge directly aligns with the proposed methodology.
8. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zheng, Z.; Li, Y.; Chen, J.; Zhou, P.; Chen, X.; Liu, Y. Threshold Neuron: A Brain-Inspired Artificial Neuron for Efficient On-Device Inference. arXiv 2025, arXiv:2412.13902. [Google Scholar]
- Mishra, A.; Cha, J.; Park, H.; Kim, S. (Eds.) Artificial Intelligence and Hardware Accelerators; Springer International Publishing: Cham, Switzerland, 2023; ISBN 978-3-031-22169-9. [Google Scholar]
- Zhu, S.; Yu, T.; Xu, T.; Chen, H.; Dustdar, S.; Gigan, S.; Gunduz, D.; Hossain, E.; Jin, Y.; Lin, F.; et al. Intelligent Computing: The Latest Advances, Challenges, and Future. Intell. Comput. 2023, 2, 6. [Google Scholar] [CrossRef]
- Tang, Q.; Yu, F.R.; Xie, R.; Boukerche, A.; Huang, T.; Liu, Y. Internet of Intelligence: A Survey on the Enabling Technologies, Applications, and Challenges. IEEE Commun. Surv. Tutorials 2022, 24, 1394–1434. [Google Scholar] [CrossRef]
- Gorospe, J.; Mulero, R.; Arbelaitz, O.; Muguerza, J.; Antón, M.Á. A Generalization Performance Study Using Deep Learning Networks in Embedded Systems. Sensors 2021, 21, 1031. [Google Scholar] [CrossRef]
- Chen, Y.; Zheng, B.; Zhang, Z.; Wang, Q.; Shen, C.; Zhang, Q. Deep Learning on Mobile and Embedded Devices: State-of-the-Art, Challenges, and Future Directions. ACM Comput. Surv. 2021, 53, 1–37. [Google Scholar] [CrossRef]
- Zheng, H.; Duan, J.; Dong, Y.; Liu, Y. Real-Time Fire Detection Algorithms Running on Small Embedded Devices Based on MobileNetV3 and YOLOv4. Fire Ecol. 2023, 19, 31. [Google Scholar] [CrossRef]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. Syst. Rev. 2021, 10, 89. [Google Scholar] [CrossRef]
- Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.; et al. A Survey of Uncertainty in Deep Neural Networks. Artif. Intell. Rev. 2023, 56, 1513–1589. [Google Scholar] [CrossRef]
- Padilla, D.; Rashwan, H.A.; Puig, D.S. On Determining Suitable Embedded Devices for Deep Learning Models. In Frontiers in Artificial Intelligence and Applications; Villaret, M., Alsinet, T., Fernández, C., Valls, A., Eds.; IOS Press: Amsterdam, The Netherlands, 2021; ISBN 978-1-64368-210-5. [Google Scholar]
- Biglari, A.; Tang, W. A Review of Embedded Machine Learning Based on Hardware, Application, and Sensing Scheme. Sensors 2023, 23, 2131. [Google Scholar] [CrossRef]
- Roth, W.; Schindler, G.; Klein, B.; Peharz, R.; Tschiatschek, S.; Fröning, H.; Pernkopf, F.; Ghahramani, Z. Resource-Efficient Neural Networks for Embedded Systems. J. Mach. Learn. Res. 2024, 25, 1–51. [Google Scholar]
- Liang, H.; Fan, Z.; Sarkar, R.; Jiang, Z.; Chen, T.; Zou, K.; Cheng, Y.; Hao, C.; Wang, Z. M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-Task Learning with Model-Accelerator Co-Design. Adv. Neural. Inf. Process. Syst. 2022, 35, 28441–28457. [Google Scholar]
- Abo, M.D. An Efficiency Comparison of NPU, CPU, and GPU When Executing an Object Detection Model YOLOv5. Bachelor’s Thesis, School of Electrical Engineering and Computer Science (EECS), University Park, PA, USA, 2024. [Google Scholar]
- Hot Tech Vision and Analysis (HTVA). AI Accelerators for Machine Vision Competitive Analysis and Review: Developer Experience And Performance Evaluation; Hot Tech Vision and Analysis: Chepachet, RI, USA, 2025; p. 19. [Google Scholar]
- Banbury, C.; Reddi, V.J.; Torelli, P.; Holleman, J.; Jeffries, N.; Kiraly, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. MLPerf Tiny Benchmark. arXiv 2021, arXiv:2106.07597. [Google Scholar]
- Baller, S.P.; Jindal, A.; Chadha, M.; Gerndt, M. DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices. In Proceedings of the 2021 IEEE International Conference on Cloud Engineering (IC2E), Virtual, 4–8 October 2021. [Google Scholar]
- Cantero, D.; Esnaola-Gonzalez, I.; Miguel-Alonso, J.; Jauregi, E. Benchmarking Object Detection Deep Learning Models in Embedded Devices. Sensors 2022, 22, 4205. [Google Scholar] [CrossRef]
- Wang, T.; Wang, K.; Cai, H.; Lin, J.; Liu, Z.; Wang, H.; Lin, Y.; Han, S. APQ: Joint Search for Network Architecture, Pruning and Quantization Policy. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2075–2084. [Google Scholar]
- Pham, P.; Abraham, J.A.; Chung, J. Training Multi-Bit Quantized and Binarized Networks with a Learnable Symmetric Quantizer. IEEE Access 2021, 9, 47194–47203. [Google Scholar] [CrossRef]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Sandra, C.P.; Pedro, V.-L.; Antonio, T.D.; Leonel; Rocío, C.S. Developing Linear Algebra Codes on Modern Processors: Emerging Research and Opportunities: Emerging Research and Opportunities; IGI Global: Hershey, PA, USA, 2022; ISBN 978-1-7998-7084-5. [Google Scholar]
- Luo, X.; Wang, H.; Wu, D.; Chen, C.; Deng, M.; Huang, J.; Hua, X.-S. A Survey on Deep Hashing Methods. ACM Trans. Knowl. Discov. Data 2023, 17, 1–50. [Google Scholar] [CrossRef]
- Hawks, B.; Duarte, J.; Fraser, N.J.; Pappalardo, A.; Tran, N.; Umuroglu, Y. Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference. Front. Artif. Intell. 2021, 4, 676564. [Google Scholar] [CrossRef] [PubMed]
- Fahim, F.; Hawks, B.; Herwig, C.; Hirschauer, J.; Jindariani, S.; Tran, N.; Carloni, L.P.; Guglielmo, G.D.; Harris, P.; Krupa, J.; et al. Hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices. arXiv 2021, arXiv:2103.05579. [Google Scholar]
- An, S.; Shin, J.; Kim, J. Quantization-Aware Training With Dynamic and Static Pruning. IEEE Access 2025, 13, 57476–57484. [Google Scholar] [CrossRef]
- Xu, J.; Tan, X.; Luo, R.; Song, K.; Li, J.; Qin, T.; Liu, T.-Y. NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1933–1943. [Google Scholar]
- Akkad, G.; Mansour, A.; Inaty, E. Embedded Deep Learning Accelerators: A Survey on Recent Advances. IEEE Trans. Artif. Intell. 2024, 5, 1954–1972. [Google Scholar] [CrossRef]
- XLA. Optimización del Compilador Para el Aprendizaje Automático. Available online: https://www.tensorflow.org/xla?hl=es-419 (accessed on 7 February 2025).
- Morera, L.M.M.; Por, T.; Cabrera, J.J.H.; Vieira, A.H. Implementación de APIs Comerciales de OpenAI Mediante Modelos Open Source. Bachelor’s Thesis, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain, 2024. [Google Scholar]
- Park, Y.; Budhathoki, K.; Chen, L.; Kübler, J.M.; Huang, J.; Kleindessner, M.; Huan, J.; Cevher, V.; Wang, Y.; Karypis, G. Inference Optimization of Foundation Models on AI Accelerators. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; ACM: New York, NY, USA, 2024; pp. 6605–6615. [Google Scholar]
- Lepri, F.; Khaddam-Aljameh, R.; Ielmini, D.; Indiveri, G.; Eleftheriou, E. In-Memory Computing for Machine Learning and Deep Learning. IEEE J. Explor. Solid-State Comput. Devices Circuits 2023, 9, 1–17. [Google Scholar] [CrossRef]
- Marwedel, P. Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things; Embedded Systems; Springer International Publishing: Cham, Switzerland, 2021; ISBN 978-3-030-60909-2. [Google Scholar]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big. Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
- Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Make 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Techzizou YOLOv4 VS YOLOv4-Tiny. Analytics Vidhya. 2024. Available online: https://medium.com/analytics-vidhya/yolov4-vs-yolov4-tiny-97932b6ec8ec (accessed on 16 September 2025).
- Shuvo, M.M.H.; Islam, S.K.; Cheng, J.; Morshed, B.I. Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review. Proc. IEEE 2023, 111, 42–91. [Google Scholar] [CrossRef]
- Chen, L.; Lin, S.; Lu, X.; Cao, D.; Wu, H.; Guo, C.; Liu, C.; Wang, F.-Y. Deep Neural Network Based Vehicle and Pedestrian Detection for Autonomous Driving: A Survey. IEEE Trans. Intell. Transport. Syst. 2021, 22, 3234–3246. [Google Scholar] [CrossRef]
- Padilla, R.; Netto, S.L.; da Silva, E.A.B. A Survey on Performance Metrics for Object-Detection Algorithms. Remote Sens. 2021, 13, 555. [Google Scholar]
- Alif, M.A.R. YOLOv11 for Vehicle Detection: Advancements, Performance, and Applications in Intelligent Transportation Systems. arXiv 2024, arXiv:2410.22898. [Google Scholar] [CrossRef]
- Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
- Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
- Zhao, X.; Song, Y. Improved Ship Detection with YOLOv8 Enhanced with MobileViT and GSConv. Electronics 2023, 12, 4666. [Google Scholar] [CrossRef]
- Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Khan, F.S. EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Application. In Proceedings of the European conference on computer vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Rezk, N.M.; Purnaprajna, M.; Nordstrom, T.; Ul-Abdin, Z. Recurrent Neural Networks: An Embedded Computing Perspective. IEEE Access 2020, 8, 57967–57996. [Google Scholar] [CrossRef]
- Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
- Buestán Andrade, P.A.; Carrión Zamora, P.E.; Chamba Lara, A.E.; Pazmiño Piedra, J.P. A comprehensive evaluation of ai techniques for air quality index prediction: RNNs and transformers. Ingenius 2025, 33, 60–75. [Google Scholar] [CrossRef]
- Sanford, C.; Hsu, D.; Telgarsky, M. Representational Strengths and Limitations of Transformers. arXiv 2024, arXiv:2402.09268. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. Sensors 2021, 21, 6895. [Google Scholar]
- Panopoulos, I.; Nikolaidis, S.; Venieris, S.I.; Venieris, I.S. Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices. In Proceedings of the 2023 IEEE Symposium on Computers and Communications (ISCC), Gammarth, Tunisia, 9–12 July 2023. [Google Scholar]
- Lalapura, V.; Joseph, A.; Satheesh, H. Recurrent Neural Networks for Edge Intelligence: A Survey. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
- Jayanth, R.; Gupta, N.; Prasanna, V. Benchmarking Edge AI Platforms for High-Performance ML Inference. In Proceedings of the 2024 IEEE High Performance Extreme Computing Conference (HPEC), Wakefield, MA, USA, 23–27 September 2024. [Google Scholar]
- Zhang, P.; Zeng, G.; Wang, T.; Lu, W. TinyLlama: An Open-Source Small Language Model. arXiv 2024, arXiv:2401.02385. [Google Scholar]
- Wieder, O.; Kohlbacher, S.; Kuenemann, M.; Garon, A.; Ducrot, P.; Seidel, T.; Langer, T. A Compact Review of Molecular Property Prediction with Graph Neural Networks. Drug Discov. Today Technol. 2020, 37, 1–12. [Google Scholar] [CrossRef] [PubMed]
- Zhang, L.; Bao, C.; Ma, K. Self-Distillation: Towards Efficient and Compact Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4388–4403. [Google Scholar] [CrossRef] [PubMed]
- Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing Class-Wise Predictions via Self-Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 13873–13882. [Google Scholar] [CrossRef]
- Gao, G.; Liu, Z.; Zhang, G.; Li, J.; Qin, A. DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition. Neural Netw. 2023, 158, 121–131. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Feng, F.; Huang, T. FNNS: An Effective Feedforward Neural Network Scheme with Random Weights for Processing Large-Scale Datasets. Appl. Sci. 2022, 12, 12478. [Google Scholar] [CrossRef]
- Novickis, R.; Justs, D.J.; Ozols, K.; Greitāns, M. An Approach of Feed-Forward Neural Network Throughput-Optimized Implementation in FPGA. Electronics 2020, 9, 2193. [Google Scholar] [CrossRef]
- Liu, X.; Xu, W.; Wang, Q.; Zhang, M. Energy-Efficient Computing Acceleration of Unmanned Aerial Vehicles Based on a CPU/FPGA/NPU Heterogeneous System. IEEE Internet Things J. 2024, 11, 27126–27138. [Google Scholar] [CrossRef]
- Sankaran, A.; Mastropietro, O.; Saboori, E.; Idris, Y.; Sawyer, D.; AskariHemmat, M.; Boukli Hacene, G. Deeplite NeutrinoTM: A BlackBox Framework for Constrained Deep Learning Model Optimization. AAAI 2021, 35, 15166–15174. [Google Scholar] [CrossRef]
- Learn the Architecture–Optimizing C Code with Neon Intrinsics. Available online: https://developer.arm.com/documentation/102467/0201/Why-use-Neon-intrinsics-?lang=en (accessed on 28 February 2025).
- TensorFlow Lite. Available online: https://www.tensorflow.org/lite/guide?hl=es-419 (accessed on 3 December 2024).
- ONNX. Runtime|Home. Available online: https://onnxruntime.ai/ (accessed on 3 December 2024).
- Kwon, Y.; Cha, J.H.; Lee, J.; Yu, M.; Park, J.; Lee, J. ACLTuner: A Profiling-Driven Fast Tuning to Optimized Deep Learning Inference. In Proceedings of the Machine Learning for Systems Workshop at NeurIPS, New Orleans, MI, USA, 28 November–9 December 2022. [Google Scholar]
- Manor, E.; Greenberg, S. Custom Hardware Inference Accelerator for TensorFlow Lite for Microcontrollers. IEEE Access 2022, 10, 73484–73493. [Google Scholar] [CrossRef]
- Wulfert, L.; Kühnel, J.; Krupp, L.; Viga, J.; Wiede, C.; Gembaczka, P.; Grabmaier, A. AIfES: A Next-Generation Edge AI Framework. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4519–4533. [Google Scholar] [CrossRef]
- CMSIS-NN. CMSIS NN Software Library. Available online: https://arm-software.github.io/CMSIS-NN/latest/ (accessed on 3 December 2024).
- PyTorch. Available online: https://pytorch.org/ (accessed on 3 December 2024).
- Hamdi, M.A.; Daghero, F.; Sarda, G.M.; Delm, J.V.; Symons, A.; Benini, L.; Verhelst, M.; Pagliari, D.J.; Burrello, A. MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices. In Transactions on Computer-Aided Design of Integrated Circuits and Systems; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
- Immonen, R.; Hämäläinen, T. Tiny Machine Learning for Resource-Constrained Microcontrollers. J. Sens. 2022, 2022, 1–11. [Google Scholar] [CrossRef]
- Machupalli, R.; Hossain, M.; Mandal, M. Review of ASIC Accelerators for Deep Neural Networks. Microprocess. Microsyst. 2022, 89, 104441. [Google Scholar] [CrossRef]
- Prabu, T.; Srinivasan, K. Design and Implementation of High-Performance FPGA Accelerator for Non-Separable Discrete Fourier Transform Optimizing Real-Time Image and Video Processing. J. Nanoelectron. Optoelectron. 2024, 19, 843–856. [Google Scholar] [CrossRef]
- Messaoud, S.; Bouaafia, S.; Maraoui, A.; Ammari, A.C.; Khriji, L.; Machhout, M. Deep Convolutional Neural Networks-Based Hardware–Software on-Chip System for Computer Vision Application. Comput. Electr. Eng. 2022, 98, 107671. [Google Scholar] [CrossRef]
- Yu, J.; De Antonio, A.; Villalba-Mora, E. Deep Learning (CNN, RNN) Applications for Smart Homes: A Systematic Review. Computers 2022, 11, 26. [Google Scholar] [CrossRef]
- Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
- Novac, P.-E.; Boukli Hacene, G.; Pegatoquet, A.; Miramond, B.; Gripon, V. Quantization and Deployment of Deep Neural Networks on Microcontrollers. Sensors 2021, 21, 2984. [Google Scholar] [CrossRef] [PubMed]
- Berthelier, A.; Chateau, T.; Duffner, S.; Garcia, C.; Blanc, C. Deep Model Compression and Architecture Optimization for Embedded Systems: A Survey. J. Sign Process. Syst. 2021, 93, 863–878. [Google Scholar] [CrossRef]
- Dhilleswararao, P.; Boppu, S.; Manikandan, M.S.; Cenkeramaddi, L.R. Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey. IEEE Access 2022, 10, 131788–131828. [Google Scholar] [CrossRef]
- Alam, S.; Yakopcic, C.; Wu, Q.; Barnell, M.; Khan, S.; Taha, T.M. Survey of Deep Learning Accelerators for Edge and Emerging Computing. Electronics 2024, 13, 2988. [Google Scholar] [CrossRef]
- Nidamanuri, J.; Nibhanupudi, C.; Assfalg, R.; Venkataraman, H. A Progressive Review: Emerging Technologies for ADAS Driven Solutions. IEEE Trans. Intell. Veh. 2022, 7, 326–341. [Google Scholar] [CrossRef]
- Sarmiento-Ramos, J.L. Aplicaciones de las redes neuronales y el deep learning a la ingeniería biomédica. Rev. UIS Ing. 2020, 19, 1–18. [Google Scholar] [CrossRef]
- Vandendriessche, J.; Wouters, N.; Da Silva, B.; Lamrini, M.; Chkouri, M.Y.; Touhafi, A. Environmental Sound Recognition on Embedded Systems: From FPGAs to TPUs. Electronics 2021, 10, 2622. [Google Scholar] [CrossRef]
- Vanting, N.B.; Ma, Z.; Jørgensen, B.N. A Scoping Review of Deep Neural Networks for Electric Load Forecasting. Energy Inf. 2021, 4, 49. [Google Scholar] [CrossRef]
- Delleji, T.; Slimeni, F.; Fekih, H.; Jarray, A.; Boughanmi, W.; Kallel, A.; Chtourou, Z. An Upgraded-YOLO with Object Augmentation: Mini-UAV Detection Under Low-Visibility Conditions by Improving Deep Neural Networks. Oper. Res. Forum 2022, 3, 60. [Google Scholar] [CrossRef]
- Capogrosso, L.; Cunico, F.; Cheng, D.S.; Fummi, F.; Cristani, M. A Machine Learning-Oriented Survey on Tiny Machine Learning. IEEE Access 2024, 12, 23406–23426. [Google Scholar] [CrossRef]
- Saraan, A.; Hussain, N.; Zahara, S.M. Tiny Machine Learning (TinyML) Systems. Preprint, ResearchGate, December 2024. Available online: https://www.researchgate.net/publication/386579238 (accessed on 4 December 2025).
- Park, E.; Yoo, S. PROFIT: A Novel Training Method for Sub-4-Bit MobileNet Models. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 26–28 August 2020. [Google Scholar]
- Liu, H.-I.; Galindo, M.; Xie, H.; Wong, L.-K.; Shuai, H.-H.; Li, Y.-H.; Cheng, W.-H. Lightweight Deep Learning for Resource-Constrained Environments. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
- Delgado, J.R. Sistema Autónomo de Generación de Energía Renovable. In Memorias de las XVII Jornadas de Conferencias de Ingeniería Electrónica; Universidad Politécnica de Cataluña: Terrassa, Spain, 2011. [Google Scholar]
- Dinten, R.; Zorrilla, M. Laredo: Democratización de Análisis de Flujos de Datos Para el Mantenimiento Predictivo; Universidad de Cantabria: Cantabria, Spain, 2023. [Google Scholar]
- Rieyan, S.A.; News, M.R.K.; Rahman, A.B.M.M.; Khan, S.A.; Zaarif, S.T.J.; Alam, M.G.R.; Hassan, M.M.; Ianni, M.; Fortino, G. An Advanced Data Fabric Architecture Leveraging Homomorphic Encryption and Federated Learning. Inf. Fusion 2024, 102, 102004. [Google Scholar] [CrossRef]
- Kuznetsov, M.; Novikova, E.; Kotenko, I.; Doynikova, E. Privacy Policies of IoT Devices: Collection and Analysis. Sensors 2022, 22, 1838. [Google Scholar] [CrossRef] [PubMed]
- Ureña, A.C.; González Calero, P.A. Aprendizaje profundo en IoT: Redes neuronales convolucionales con imágenes aplicadas a un vehículo autónomo. Master’s Thesis, Universidad Complutense de Madrid, Madrid, Spain, 2021. [Google Scholar]
- García, J.G.; Vilda, C.C.; Alonso, D.P. Avances Algorítmicos Aplicados al Procesamiento de Información en el Campo de la Visión Artificial Basada en Eventos Para Sistemas Bioinspirados; Universidad Rey Juan Carlos: Madrid, Spain, 2023. [Google Scholar]




| Metric | ResNet-50 | M3ViT |
|---|---|---|
| Accuracy (mIoU) | 44.2% | 45.6% |
| FLOPs (inference) | 192 G | 100 G |
| Energy (W·s) | 2.145 | 0.845 |
| Hardware Platform | Peak Compute | Task | Throughput (FPS) | Energy Efficiency (Joules/Frame) |
|---|---|---|---|---|
| Intel Core i7-13700K (CPU Only) | 1.1 TFLOPS (FP32) | YOLOv8s | 11.74 | 29.983 |
| NVIDIA GeForce RTX 3060 | 12.7 TFLOPS (FP32) | YOLOv8s | 154.30 | 2.301 |
| Hailo-8 M.2 | 26 TOPS (INT8) | YOLOv8s | 121.50 | 1.276 |
| Axelera Metis AIPU | 214 TOPS (INT8) | YOLOv8s | 178.40 | 1.087 |
| Metric | Google Coral Dev Board (Edge TPU) | Nvidia Jetson Nano (GPU) |
|---|---|---|
| Throughput | ~240.5 FPS (MobileNetV2 Quant.) 417 FPS (MobileNet V1) | ~48.5 FPS (MobileNetV2 Non-Quant.) 80 FPS (MobileNet V1) |
| Latency (approx) | ~4 ms (MobileNetV2 Quant.) | ~20 ms (MobileNetV2 Non-Quant.) |
| Energy Consumption (Active) | 1.543 watt-minutes (MobileNetV2) 5.5 Watts (average execution) | 13.630 watt-minutes (MobileNetV2) ~6.05 Watts (10% more than Coral) |
| Energy Consumption (Idle) | 2.757 Watts (LAN off) 4.8 Watts | 0.903 Watts (LAN off) ~2.4 Watts (Less than half of Coral) |
| Memory Usage (DRAM) | 42–131 MB | ~1.2 GB |
| Compression Technique | Key Techniques | Description | Advantages | Limitations |
|---|---|---|---|---|
| Pruning | Unstructured (magnitude) & structured pruning | Removes less important weights or neurons from the network. | Reduces model size and computational complexity. | Requires retraining to mitigate accuracy loss. Determining what to prune can be challenging. |
| Quantization | Standard (e.g., 8-bit) | Reduces the numerical precision of weights and activations (e.g., 8 bits instead of 32 bits). | Reduces model size and speeds up inference. | May affect model accuracy if quantization is too aggressive or introduces rounding errors. |
| Sub-4-bit (e.g., 4, 3 and 2-bits) | Reduces precision to fewer than 4 bits, requiring advanced training to manage instability and accuracy loss. | Significant reduction in model size and memory footprint; enables ultra-efficient hardware acceleration. | Highly susceptible to accuracy degradation; requires complex and computationally expensive retraining (QAT). | |
| Non-Uniform | Uses non-linear quantization steps that match the natural distribution of weights (often clustered near zero). | Preserves accuracy better than uniform quantization for the same bit-width by assigning more precision to common values. | Can be more complex to implement in hardware; not all inference engines offer native support. | |
| Binarization (1-bit) | Constrains weights to two values ({−1, +1}). To mitigate accuracy loss, this is combined with advanced training strategies like Ternary Weight Splitting (initializing the binary network from a pre-trained ternary model), Progressive Quantization (gradually increasing the level of quantization), or distillation (training the binary network to mimic a larger, more accurate model). | Reduces memory usage by up to 32x and replaces multiplications with logical operations (XNOR), boosting computational efficiency. | Significant accuracy loss and complex training if the mitigation strategies mentioned in the description are not used. | |
| Ternarization (2-bit) | Represents weights with three values ({−1, 0, +1}), striking a balance between full-precision fidelity and binarization. | Significantly decreases model size and inference cost. | Preserving accuracy may require distillation or specialized retraining, especially in NLP tasks. | |
| Knowledge distillation | Output-Based Knowledge Distillation (OBKD) & Feature Imitation Knowledge Distillation (FIKD) | Trains a smaller model using knowledge from a larger pretrained model. | Improves performance in smaller models while reducing size. | Requires a pretrained larger model and additional computational resources for training. |
| Tensor factorization | Singular Value Decomposition (SVD) | Decomposes weight tensors into smaller tensors using techniques. | Reduces parameters and computational requirements for inference. | Computationally expensive; not suitable for all models. |
| Hashing | Parameter hashing (HashedNet) | Groups similar data to reduce redundancy. | Saves memory and accelerates model inference. | May trade off compression efficiency with retrieval accuracy. Sensitivity to hyperparameters. |
| Optimization Techniques | Hybrid Techniques and Automated Search (NAS) | Quantization-Aware Pruning (QAP). QAT + Dynamic/Static Pruning (QADS). Architecture, Pruning and Quantization Search (APQ/NAS) | Jointly optimizes pruning and quantization (QAP, QADS), or even the architecture (APQ), in a single training process. | Achieves much higher compression rates and better accuracy than the sequential application of techniques. Optimizes the model for specific hardware metrics (latency, BOPs). |
| Technique | Reported Compression Ratio | Accuracy Impact | Hardware Compatibility Mentioned |
|---|---|---|---|
| Structured/Unstructured Pruning | 10–80% parameter reduction (various CNNs) | Low to moderate drop; requires fine-tuning | CPU, GPU, FPGA, ASIC |
| Knowledge Distillation (ResNet, WRN, ResNeXt) | 5–16% FLOPs and parameter reduction | Often improves accuracy vs. baseline | CPU, GPU |
| Extreme Distillation (BUnit-Net) | Up to 97% FLOPs and 96% parameter reduction | Slight accuracy drop depending on the task | MCU, low-power accelerators |
| INT8 Quantization | 4× model-size reduction | ~1–3% drop (model-dependent) | Edge TPU, NPU, GPU, CPU |
| Sub-4-bit Quantization (4/3/2-bit) | 8–16× reduction | Significant drop unless QAT + KD used | Specialized NPUs, FPGA |
| Binarization/Ternarization (1–2 bits) | 16–32× reduction | High accuracy drops unless distilled | ASIC, FPGA |
| Non-Uniform Quantization | 4–16× effective compression | Lower accuracy loss than uniform | NPUs with LUT support |
| Tensor Factorization (SVD, CP) | Moderate reduction in FLOPs/params | Minimal accuracy loss | CPU, GPU |
| Hashing (HashedNet) | Large memory reduction | Accuracy varies with collision rate | CPU, GPU |
| Joint Pruning + Quantization (QAP, QADS) | Up to 50× BOP reduction; 80% sparsity | Comparable to FP32 in the best cases | NPUs, FPGA, CPU |
| NAS-assisted Compression (APQ, NAS-BERT) | Superior compression–accuracy ratio via search | Often matches or exceeds FP32 | NPU, GPU, CPU |
| Mixed Precision (4/8-bit) | 4–8× model-size reduction | Minimal loss with QAT + KD | Edge TPU, NPU, GPU |
| Technique/Approach | Advantages | Disadvantages/Limitations | Applications/Examples |
|---|---|---|---|
| Hardware Acceleration | High performance, energy efficiency | Costly, less flexible | Computer vision in IoT devices, real-time object detection, autonomous driving, and NLP on mobile devices. |
| Software-Level Optimization | Greater efficiency, lower latency, optimized resource usage | Requires specific tools and knowledge may affect accuracy | Efficient implementation on mobile devices, memory optimization in NPUs, and attention mechanism acceleration on GPUs. |
| Emerging Techniques | High energy efficiency, potential for low latency | Precision limitations, developing technology | Smart sensors, high-performance applications. |
| Architecture | Task | Complexity (GOP) | Accuracy |
|---|---|---|---|
| AlexNet | Classification | 0.7 | 16% error |
| VGG-16 | Classification | 15 | 7.4% error |
| GoogLeNet | Classification | 1.4 | 6.7% error |
| ResNet50 | Classification | 3.9 | 5.3% error |
| MobileNet v2 | Classification | 0.3 | 9% error |
| ShuffleNet | Classification | 0.26 | 10% error |
| Tiny Yolo | Detection | 6.9 | 60% mAP |
| Yolo V2 | Detection | 39.4 | 76.8% mAP |
| Fast RCNN | Detection | 300 | 78.2% mAP |
| Faster RCNN | Detection | 150 | 76.4% mAP |
| SSD 300 | Detection | 34.9 | 74.3% mAP |
| SSD 512 | Detection | 52 | 76.8% mAP |
| SSD con MobileNet v2 | Detection | 1.2 | 72.7% mAP |
| Architecture | Task | Speed | Accuracy |
|---|---|---|---|
| YOLOv8 | Detection | 200 FPS | mAP@0.5: 73.9% |
| YOLOv10 | Detection | 280 FPS | mAP@0.5: 74.3% |
| YOLOv11 | Detection | 290FPS | mAP@0.5: 76.8% |
| Model | Type | Parameters | Latency (GPU) | Latency (NPU) | Preferred Unit |
|---|---|---|---|---|---|
| LSTM | RNN | 1.6 M | 1.48 ms | 4.10 ms | GPU (2.7×) |
| TinyLlama | Transformer | 1.1 B | 8.30 seg | 2.49 seg | NPU (3.2×) |
| Experiment | Architecture | Dataset | Reduction of FLOPS | Parameter Reduction |
|---|---|---|---|---|
| Image Classification | ResNet-56 | CIFAR-100 | 16.6% | 14.8% |
| ResNet-110 | CIFAR-100 | 15.9% | 13.5% | |
| ResNet-50 | ImageNet | 10.8% | 9.6% | |
| WideResNet-28-10 | CIFAR-100 | 10.5% | 9.3% | |
| ResNeXt-29-4x64d | CIFAR-100 | 12.1% | 10.7% | |
| SE-ResNet-56 | CIFAR-100 | 14.2% | 12.8% | |
| ResNeSt-50 | ImageNet | 8.5% | 7.9% | |
| MobileNetV2 | ImageNet | 6.3% | 5.8% | |
| ShuffleNetV2-1.0x | ImageNet | 5.1% | 4.7% | |
| Point Cloud Classification | ResGCN | ModelNet40 | 11.2% | 10.1% |
| Dataset | Model | Accuracy (%) | FLOPs Reduction (%) | Parameters Reduction (%) |
|---|---|---|---|---|
| MNIST | MLP | 91.95 | 97 | 96 |
| CIFAR-10 | VGG-16 | 93.25 | 79.87 | 72.90 |
| ResNet-20 | 94.52 | 55.02 | 49.35 | |
| MobileNetV2 | 91.58 | 38.56 | 36.81 | |
| CIFAR-100 | VGG-16 | 72.51 | 73.05 | 70.26 |
| ResNet-20 | 76.48 | 57.46 | 62.56 | |
| MobileNetV2 | 71.54 | 62.61 | 60.48 | |
| ImageNet | ResNet-34 | 71.51 | 42.05 | 35.31 |
| ResNet-50 | 75.33 | 43.80 | 43.67 | |
| Tiny-ImageNet | VGG-19 | 53.54 | 63.42 | 54.63 |
| Model | (Top-1) Accuracy (%) | FLOPs Reduction (%) | Latency Predictor Error (%) |
|---|---|---|---|
| VGG-19 | 83.78 | 83.02 | 6.12 |
| 83.12 | 83.02 | 6.12 | |
| ResNet-50 | 84.80 | 76.50 | 6.12 |
| 85.64 | 23.79 | 6.12 | |
| GoogLeNet | 82.80 | 77.22 | 6.12 |
| 85.12 | 23.18 | 6.12 |
| Implementation | Time per Sample (µs) | Total Time (for 10,000 Samples) | Design Complexity | Accuracy |
|---|---|---|---|---|
| Simple FNN-High Precision 1 | 0.0555 | 555 µs | Low | High |
| Simple FNN-High Precision 2 | 0.0586 | 586 µs | Low | High |
| Simple FNN-Medium Precision 1 | 0.0620 | 620 µs | Low | Medium |
| Simple FNN-Medium Precision 2 | 0.0578 | 578 µs | Low | Medium |
| Complex FNN-High Precision | 1.04 | 10.4 ms | High | High |
| Medium FNN-Medium Precision | 0.08067 | 806.7 µs | Medium | Medium |
| Architecture | Description | Applications | Compression | Hardware Acceleration |
|---|---|---|---|---|
| FNN | Fully connected feedforward neural network. | Classification, regression. | Pruning, quantization. | FPGA, ASIC |
| CNN | Convolutional neural network. | Computer vision, image processing. | Pruning, quantization, knowledge distillation. | GPU, FPGA, ASIC, TPU |
| RNN | Recurrent neural network. | Sequence processing, time series analysis. | Pruning, quantization. | FPGA, ASIC |
| LSTM | Long Short-Term Memory (a type of RNN). | Natural language processing, machine translation. | Pruning, quantization. | FPGA, ASIC |
| GRU | Gated Recurrent Unit (a type of RNN). | Natural language processing, machine translation. | Pruning, quantization. | FPGA, ASIC |
| Transformer | Attention-based architecture. | Natural language processing, machine translation. | Pruning, quantization. | TPU, GPU |
| Compact Nets | Models with a reduced number of parameters. | Resource-constrained edge applications. | Efficient design, quantization. | Various hardware platforms. |
| Architecture | Hardware Accelerator | Task | Quantization | Energy Efficiency |
|---|---|---|---|---|
| CNN | XCZU7EV (FPGA) | Classification (MobileNetV2) | 12/10 bits | 0.37 GOPS/W |
| CNN | XC7Z020 FPGA (SCE Co-Design) | TSR (LeNet-5) | - | ~14.7 GFLOPS/W |
| RNN (GRU) | XC7Z007S FPGA (EdgeDRNN) | Reconocimiento Voz (TIDIGITS) | INT16/8 | 8.8 GOPS/W |
| RNN (LSTM) | Arria 10 GX1150 FPGA | Reconocimiento Voz | INT16 | 15.9 GOPS/W |
| CNN/RNN | ASIC (LNPU—KAIST) | Inferencia/Aprendizaje (CONV/FC) | Float8 | 25,300 GPOS/W |
| RNN (Delta) | XC7Z100 FPGA | Reconocimiento de Voz | 16 bits | 26.3 GOPS/W |
| Survey | Scope | Limitations | Additional Contribution of This Review |
|---|---|---|---|
| Chen et al., “Deep Learning on Mobile and Embedded Devices: State-of-the-Art, Challenges, and Future Directions” (2021) [6] | Broad examination of DL techniques applied to mobile and embedded environments. | Lacks an operational deployment workflow; limited integration of model optimization with hardware-specific constraints. | Provides a structured five-stage methodology that links optimization, architecture selection, and hardware requirements into a unified deployment process. |
| Chen et al., “DNN-Based Vehicle and Pedestrian Detection for Autonomous Driving: A Survey” (2021) [38] | Comprehensive survey focused on perception tasks for autonomous driving. | Application-specific; does not generalize to Edge AI deployment or resource-constrained optimization. | Extends analysis beyond automotive tasks, offering a generalizable workflow for embedded AI across multiple domains. |
| Ali & Zhang, “The YOLO Framework: A Comprehensive Review” (2024) [76] | Detailed taxonomy of the YOLO family and its evolution. | Architecture-specific; lacks discussion of compression, quantization, or hardware alignment. | Integrates YOLO within a broader system-level methodology that includes model compression and accelerator-aware deployment. |
| Novac et al., “Quantization and Deployment of Deep Neural Networks on Microcontrollers” (2021) [77] | Focused review on TinyML and quantization techniques for MCUs. | Limited to microcontroller-class systems; excludes NPUs, TPUs, and heterogeneous edge hardware. | Covers the full spectrum of hardware platforms, from MCUs to high-performance NPUs, within a common methodological framework. |
| Berthelier et al., “Deep Model Compression and Architecture Optimization for Embedded Systems: A Survey” (2021) [78] | Extensive review of pruning, quantization, and architecture-level optimization. | Does not incorporate hardware benchmarking nor present a reproducible deployment workflow. | Provides integrated empirical evaluation and a multi-stage methodology grounded in both literature evidence and real hardware trials. |
| Dhilleswararao et al., “Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey” (2022) [79] | Survey of hardware architectures for DNN acceleration, including FPGA and ASIC platforms. | Hardware-centric; lacks alignment between model-level techniques and accelerator constraints. | Bridges model-level optimization with hardware characteristics, enabling informed co-design decisions within the methodology. |
| Akkad et al., “Embedded Deep Learning Accelerators: A Survey on Recent Advances” (2024) [28] | Review of recent embedded DNN accelerators and their architectural design. | Does not address resource-aware model optimization or end-to-end deployment strategies. | Positions accelerators within a complete deployment workflow, including model selection, quantization, and validation. |
| Alam et al., “Survey of Deep Learning Accelerators for Edge and Emerging Computing” (2024) [80] | Examines accelerator trends in emerging edge-computing hardware. | Lacks methodological guidance for selecting and optimizing models for each hardware platform. | Enhances hardware discussions with a reproducible five-stage decision-making methodology and cross-platform benchmark synthesis. |
| Biglari & Tang, “A Review of Embedded Machine Learning Based on Hardware, Application, and Sensing Scheme” (2023) [11] | Broad overview of embedded ML applications and sensing modalities. | Does not integrate optimization techniques or performance-driven hardware selection. | Provides a holistic framework that unifies optimization, architecture design, and empirical hardware validation. |
| Metric | Requirements (r) | Measurement (b) on NPU | Condition (b ≤ r) | Result |
|---|---|---|---|---|
| Latency | ≤Between 100 ms and 200 ms | 132 ms (7.5 FPS) | 132 ≤ 100–120 | Success |
| Accuracy (mAP) | ≥90% | 91.5% | 91.5 ≥ 90 | Success |
| Power (Active) | <5 W | 4.1 W | 4.1 < 5 | Success |
| Metric | CPU | NPU | GPU |
|---|---|---|---|
| Latency | 980 ms | 132 ms | 975 ms |
| Accuracy (mAP) | 94% | 91.5% | 92% |
| Power (Active) | 5.1 W | 4.1 W | 5 W |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cordova-Cardenas, R.; Amor, D.; Gutiérrez, Á. Edge AI in Practice: A Survey and Deployment Framework for Neural Networks on Embedded Systems. Electronics 2025, 14, 4877. https://doi.org/10.3390/electronics14244877
Cordova-Cardenas R, Amor D, Gutiérrez Á. Edge AI in Practice: A Survey and Deployment Framework for Neural Networks on Embedded Systems. Electronics. 2025; 14(24):4877. https://doi.org/10.3390/electronics14244877
Chicago/Turabian StyleCordova-Cardenas, Ruth, Daniel Amor, and Álvaro Gutiérrez. 2025. "Edge AI in Practice: A Survey and Deployment Framework for Neural Networks on Embedded Systems" Electronics 14, no. 24: 4877. https://doi.org/10.3390/electronics14244877
APA StyleCordova-Cardenas, R., Amor, D., & Gutiérrez, Á. (2025). Edge AI in Practice: A Survey and Deployment Framework for Neural Networks on Embedded Systems. Electronics, 14(24), 4877. https://doi.org/10.3390/electronics14244877

