Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions
Abstract
1. Introduction
1.1. Scope of the Review
- Model Compression Techniques: We will survey and critically analyze various methods used to reduce the size and computational requirements of LLMs. This includes a deep dive into quantization (e.g., Post-Training Quantization, Quantization-Aware Training), pruning (e.g., unstructured vs. structured), and knowledge distillation.
- Architectural and Algorithmic Optimizations: The paper will explore architectural modifications and algorithmic enhancements to the standard Transformer model that are specifically designed for improved efficiency. This includes a review of efficient attention mechanisms, new model architectures, and inference-time optimizations like speculative decoding and efficient Key-Value (KV) cache management.
- System-Level and Hybrid Approaches: We will examine strategies that go beyond single-model optimization to improve overall system performance. This includes an overview of hybrid edge-cloud systems, on-device fine-tuning techniques (e.g., Parameter-Efficient Fine-Tuning or PEFT), and federated learning applications for LLMs on the edge.
- Hardware and Software Landscape: The review will provide an overview of the current hardware ecosystem for edge AI, including CPUs, GPUs, and specialized accelerators (NPUs, FPGAs). It will also cover the software frameworks and libraries (e.g., TFLite, ONNX Runtime) that enable and facilitate these deployments.
1.2. Rationale and Contributions
- A Comprehensive Review: The paper provides a review of LLM deployment strategies, challenges, and future directions for Transformer models on edge computing.
- Structured Understanding: It organizes the multi-faceted challenges and diverse solutions from academic and industrial literature into a structured framework.
- Exploration of Optimization: The paper examines methods to compress LLMs and optimize their inference capabilities to make them more efficient for edge environments. This approach not only reduces computational costs but also improves user privacy and security.
- Guidance for Organizations: The review provides insights into best practices, successful implementations, and the unique challenges of integrating LLMs with edge computing, which is crucial for organizations looking to leverage AI effectively.
1.3. Paper Organization
2. Methodology and Literature Selection
2.1. Search Strategy and Databases
- IEEE Xplore & ACM Digital Library: Primary sources for architectural innovations (e.g., GQA, PagedAttention) and hardware acceleration.
- ScienceDirect (Elsevier) & SpringerLink: Sources for multi-disciplinary applications, such as structural health monitoring and industrial IoT.
- arXiv: Utilized for the most recent state-of-the-art (SOTA) algorithms (e.g., GPTQ, AWQ, and Llama variants) that are currently the industry standard but may still be in the pre-publication phase.
2.2. Inclusion and Exclusion Criteria
2.3. The Foundation of LLMs, Transformers, and the Edge Computing
2.4. Large Language Models (LLMs)
- Model SizeModern LLMs are characterized by their massive number of parameters, often reaching billions or even trillions [16]. For instance, models like GPT-3 contain 175 billion parameters, while newer iterations may have even more. This vast number of parameters allows LLMs to capture intricate patterns and relationships in the data, enabling them to generate coherent and contextually relevant text.
- Training DataThe complexity of LLMs is further amplified by the sheer volume and diversity of the training data used [17]. These models are trained on extensive datasets that encompass a wide range of topics, languages, and styles. This diversity is crucial for enabling the models to generalize well across different contexts and applications, from casual conversation to technical writing.
- Architectural InnovationsThe underlying architecture of LLMs, primarily based on the Transformer model, introduces complexity through its innovative mechanisms such as self-attention [16]. This allows the model to weigh the significance of different words in a sequence and understand context better. However, this self-attention mechanism has a quadratic complexity concerning the input length, which poses challenges in terms of computational resources and memory usage.
- Computational RequirementsTraining and deploying modern LLMs require substantial computational resources. The training process often involves distributed computing across numerous GPUs or TPUs, leading to significant energy consumption and costs [18]. The need for efficient training algorithms and optimization techniques is crucial to manage these resource demands.
- Inference ComplexityOnce trained, the complexity continues during inference, where generating responses can be computationally intensive, especially for long sequences [19]. Techniques such as caching, pruning, and quantization are employed to optimize inference times and reduce latency, particularly important in edge computing scenarios.
- Integration with Edge ComputingAs LLMs are integrated into edge computing environments, the complexity increases due to the need for model compression and optimization [20]. Techniques like quantization and pruning are essential to deploy these models on resource-constrained devices while maintaining performance. This integration allows for real-time processing and decision-making, but adds layers of complexity in terms of ensuring efficiency and responsiveness.
- Ethical and Regulatory ConsiderationsThe scale and complexity of LLMs also raise ethical and regulatory challenges [21]. As these models become more powerful, concerns regarding bias, misinformation, and data privacy become more pronounced. Organizations must navigate these issues while leveraging the capabilities of LLMs, adding another layer of complexity to their deployment.
2.4.1. Transformers
- Input Representation: Each word in the input sequence is transformed into a vector representation (embedding) [25].
- Query, Key, and Value Vectors: For each word, three vectors are created: a query vector, a key vector, and a value vector. These are derived from the input embeddings using learned linear transformations [26].
- Attention Scores: The attention scores are calculated by taking the dot product of the query vector of a word with the key vectors of all words in the sequence. This results in a score that indicates how much focus should be placed on each word when processing a particular word [27].
- Softmax Normalization: The scores are then normalized using the softmax function to create a probability distribution, which determines the weight of each word in the context of the current word [28].
- Weighted Sum: Finally, the output for each word is computed as a weighted sum of the value vectors, using the attention weights derived from the previous step [29].
- Contextual Understanding: Self-attention allows the model to capture long-range dependencies and relationships between words, significantly improving contextual understanding [30].
- Parallelization: Unlike recurrent neural networks (RNNs), which process sequences sequentially, self-attention can process all words in parallel, leading to faster training times [31].
- Quadratic Complexity: The self-attention mechanism has a computational complexity of , where n is the length of the input sequence. This is due to the need to compute attention scores for each pair of words in the sequence [32].
- Memory Usage: The requirement to store and compute these scores for all pairs of words can lead to high memory consumption, especially for long sequences [33].
2.4.2. Edge Computing
- Reduced Latency:Edge computing significantly reduces latency by minimizing the distance data must travel for processing. This is particularly beneficial for applications requiring real-time responses, as it allows for quicker data analysis and decision-making. For LLMs, this means that users can receive immediate feedback when interacting with AI systems, enhancing the overall user experience [36].
- Enhanced Privacy:By processing data locally on edge devices, organizations can better protect sensitive information. This approach reduces the need to transmit personal or sensitive data to centralized cloud servers, thereby mitigating risks associated with data breaches and unauthorized access. For LLM applications, maintaining privacy is crucial, especially when handling user-generated content or confidential information [37].
- Bandwidth Conservation:Edge computing helps conserve bandwidth by reducing the volume of data transmitted to and from cloud servers. Since data can be processed locally, only essential information needs to be sent to the cloud, leading to more efficient use of network resources. This is particularly advantageous for LLMs deployed in environments with limited connectivity or high data transfer costs, ensuring that AI applications remain functional and responsive [34].
- Scalability and Flexibility:Edge computing enables organizations to scale their AI applications more flexibly. By distributing processing across multiple edge devices, businesses can adapt to varying workloads and resource availability without over-relying on centralized infrastructure. This flexibility supports the deployment of LLMs in diverse settings, from smart homes to industrial automation [36].
- Limited Power:Edge devices often operate on battery power or have strict energy consumption limits. This limitation necessitates the use of energy-efficient algorithms and models. According to [34], edge computing reduces latency and conserves bandwidth, but it also requires careful management of power consumption to ensure device longevity. The need for power efficiency is particularly critical when deploying resource-intensive LLMs, as their computational demands can quickly deplete battery life.
- Memory Constraints:The memory available on edge devices is typically much lower than that of traditional cloud servers. Modern LLMs, such as GPT-3, can have hundreds of billions of parameters, which require substantial memory for both storage and processing [39]. Deploying LLMs on resource-constrained devices necessitates model compression techniques to fit within the limited memory capacity. Without effective memory management strategies, such as pruning or quantization, running LLMs on edge devices can be impractical.
- Computational Limitations:Edge devices generally possess less computational power compared to centralized cloud servers. The computational intensity of LLMs, particularly during inference, poses a challenge for these devices. As highlighted by [36], the complexity of processing LLMs can lead to high latency and resource consumption, which are unsuitable for the real-time requirements of many edge applications. This necessitates the development of optimized algorithms and architectures that can operate efficiently under these constraints.
- Network Bandwidth:Although not a direct constraint of the devices themselves, the network bandwidth available to edge devices can limit their ability to interact with cloud services for additional processing or data retrieval. As stated by [34], edge computing helps conserve bandwidth by processing data locally, but this also means that edge devices must be capable of handling as much processing as possible on their own. This further emphasizes the need for LLMs to be optimized for edge deployment.
3. Taxonomy of Edge LLM Deployment Strategies
- The Memory Wall (VRAM/Bandwidth): Solved via Model Compression (Quantization, Pruning, KD). These methods prioritize reducing the weight-loading overhead.
- The Quadratic Wall (Context Complexity): Solved via Architecture Optimization (GQA, SWA, Sparse Attention). These methods focus on the scaling of the self-attention mechanism.
- The Compute Wall (Latency/FLOPs): Solved via System-Level Management (Speculative Decoding, PagedAttention, Partitioning). These strategies optimize the execution pipeline for real-time responsiveness.
- The Thermal Wall (Power/TDP): Solved via Hardware-Software Co-design (NPU/FPGA acceleration). These ensure sustainable operation in battery-limited environments.
3.1. Model Compression Techniques
3.1.1. Quantization Methods for Transformer Models
3.1.2. Pruning in Large Language Models (LLMs)
- Unstructured Pruning in LLMs: Due to the massive size of LLMs, unstructured pruning can be effective in reducing the number of parameters without altering the overall architecture significantly. For example, ref. [50] demonstrated that unstructured pruning could lead to a 90% reduction in model size while retaining competitive performance in tasks such as text generation.
- Structured Pruning in LLMs: Structured pruning is particularly advantageous when deploying LLMs on edge devices, where computational efficiency is crucial. Techniques such as channel pruning can be applied to transformer models, leading to reduced latency and improved responsiveness. For instance, ref. [51] showed that structured pruning could significantly reduce the number of operations required for inference in LLMs, making them more suitable for real-time applications.
3.1.3. Knowledge Distillation
- Model Compression: Knowledge distillation effectively reduces the size of neural networks while maintaining performance. The student model learns to mimic the teacher model’s outputs, which allows it to achieve competitive accuracy with significantly fewer parameters. Hinton et al. [54] demonstrated that a small student model could achieve similar performance to a larger teacher model by learning from the soft targets (probabilities) produced by the teacher, rather than just the hard labels. This approach enables the student model to capture more nuanced information about the data distribution.
- Task-Specific Adaptation: KD is particularly effective for creating task-specific models. By distilling knowledge from a teacher model trained on a broad dataset, the student model can be fine-tuned for specific tasks or domains. For instance, Sun et al. [55] showed that distillation could be used to adapt a general language model to a specific downstream task, leading to improved performance on that task while retaining the efficiency benefits of a smaller model.
- Improved Generalization: Knowledge distillation can enhance the generalization capabilities of student models. The teacher model often has learned robust features and representations from extensive training data. By transferring this knowledge, the student model can generalize better to unseen data. An empirical study by Fang et al. [56] found that student models trained through distillation exhibited improved performance on various tasks compared to models trained directly on the same dataset.
- Efficiency in Inference: Smaller student models resulting from KD are more efficient during inference, making them suitable for deployment on edge devices. These models require less computational power and memory, which is crucial for real-time applications. KD can significantly reduce inference latency while maintaining high accuracy, making it a valuable strategy for applications in environments with limited resources [57].
- Flexibility in Architecture: Knowledge distillation allows for flexibility in the architecture of student models. Researchers can experiment with different architectures and hyperparameters to find the optimal configuration for specific tasks. This adaptability is highlighted by Cho and Hariharan [58], who demonstrated that various student architectures could successfully learn from a single teacher model, resulting in tailored solutions for different applications.
3.2. Architectural and Algorithmic Optimizations
3.2.1. Efficient Transformer Variants: A Survey of New Architectures
3.2.2. Inference Optimization Techniques for Large Language Models
3.2.3. Automated Search and Hardware-Aware Optimization
3.3. System-Level and Hybrid Approaches
3.3.1. Edge-Cloud Collaboration
3.3.2. On-Device Fine-Tuning
3.3.3. Federated Learning
- Overview of Federated Learning in LLMsFederated Learning facilitates the training of LLMs by allowing each participant to train a local model on their private data and subsequently share only model updates (gradients) with a central server. The server aggregates these updates to improve a global model, which is then distributed back to the participants [92]. This approach ensures that sensitive data remains on local devices, mitigating privacy concerns.
- Applications of Federated Learning in LLMs
- Healthcare: In healthcare applications, federated learning has been employed to train language models on patient records while preserving confidentiality. For instance, researchers have utilized FL to develop LLMs that can analyze medical texts and support clinical decision-making without exposing sensitive patient data [93].
- Natural Language Processing: Federated learning has been applied in NLP tasks, such as sentiment analysis and text classification, where organizations can collaboratively train models on user-generated content without sharing the underlying data. This enables the creation of more robust models that generalize better across different contexts [94].
- Personalized Language Models: FL allows for the development of personalized language models that adapt to individual user preferences while preserving privacy. By training on local data, organizations can create models that reflect unique user interactions without compromising sensitive information [95].
- Benefits of Federated Learning for LLMs
- Data Privacy: One of the primary advantages of FL is its ability to preserve data privacy. Since raw data never leaves the local device, organizations can comply with stringent data protection regulations, such as GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act) [96].
- Diverse Data Sources: Federated learning enables the aggregation of diverse datasets from multiple sources, which can enhance the generalization capabilities of LLMs. This diversity is particularly beneficial for NLP tasks, where language usage can vary significantly across different demographics and contexts [97].
- Reduced Communication Costs: By transmitting only model updates instead of raw data, FL reduces communication costs and bandwidth usage, making it more efficient, especially in environments with limited connectivity [98].
- Challenges in Implementing Federated Learning for LLMs
- Heterogeneity of Data: One of the significant challenges in FL is the non-IID (Independent and Identically Distributed) nature of data across participants. This heterogeneity can lead to biased model updates and affect the convergence of the global model [99].
- Communication Overhead: While federated learning reduces the amount of data transmitted, the need for frequent communication between devices and the central server can still introduce latency and overhead, particularly when dealing with large models like LLMs [100].
- Security Risks: Although FL enhances data privacy, it is still susceptible to potential attacks, such as model inversion or poisoning attacks, where adversaries may attempt to infer sensitive information from shared gradients [101].
- Communication-efficient Federated Learning strategies
- Gradient Compression: Reduces update size via quantization (e.g., 1-bit SGD) or sparsification (sending only top-k gradients).
- Integration with PEFT: Instead of full gradients, only updates for small adapter modules (like LoRA) are transmitted.
- Local SGD/FedAvg: Increases local training steps on the device before communicating with the server.
4. Hardware and Software Considerations
4.1. Edge AI Hardware
4.1.1. Review of Hardware Platforms for Running Transformers
4.1.2. Evaluation Metrics for Edge LLMs
4.1.3. Hardware Accelerators for AI Workloads
4.2. Inference Frameworks and Toolkits
4.2.1. Key Software Libraries and Frameworks for Edge Deployment
4.2.2. Toolchains of Model Training and Deployment Gaps
- Streamlined Workflow: Toolchains provide a structured workflow that integrates various stages of the machine learning lifecycle, from data preprocessing and model training to deployment and monitoring. This integration helps reduce the complexity involved in managing multiple tools and ensures that data scientists and engineers can focus on building effective models rather than dealing with disparate systems. A cohesive toolchain can significantly improve productivity by automating repetitive tasks and providing a unified interface for managing the model lifecycle [122].
- Model Optimization: Many toolchains include features for model optimization, such as quantization, pruning, and compression techniques. These optimizations are crucial for deploying models in resource-constrained environments, such as edge devices. For instance, TensorFlow Lite and ONNX Runtime offer built-in support for these optimizations, enabling developers to reduce model size and improve inference speed without sacrificing accuracy [123]. This capability is vital for ensuring that models perform efficiently in real-time applications.
- Cross-Platform Compatibility: Toolchains often support multiple frameworks and platforms, allowing models to be trained in one environment and deployed in another. This flexibility is particularly important in heterogeneous computing environments where different hardware accelerators (e.g., CPUs, GPUs, NPUs) may be utilized. For example, ONNX Runtime enables models trained in various frameworks, such as PyTorch and TensorFlow, to be converted to the ONNX format for deployment, facilitating seamless integration across different systems [124].
- Monitoring and Maintenance: Effective toolchains include monitoring capabilities that allow organizations to track model performance in production. This monitoring is essential for identifying issues such as model drift, where the model’s performance degrades over time due to changes in data distribution. By incorporating monitoring tools, organizations can implement strategies for model retraining and updates, ensuring sustained model accuracy and relevance [125].
- Collaboration and Version Control: Toolchains enhance collaboration among team members by providing version control and reproducibility features. This is particularly important in machine learning projects where multiple stakeholders may be involved in model development and deployment. A well-designed toolchain can facilitate collaboration and ensure that all team members are working with the same model versions and datasets, thereby reducing errors and improving project outcomes [126].
5. Key Challenges
5.1. Performance vs. Accuracy Trade-Offs in Transformer Models on Edge Devices
- Model Size and Computational DemandsTransformers, especially modern LLMs, often contain billions of parameters, making them computationally intensive. While larger models tend to achieve higher accuracy due to their ability to capture intricate patterns in data, they also require substantial resources for both training and inference. This necessitates a careful consideration of model size and its impact on performance.
- Speed and Latency ConsiderationsEdge devices are often tasked with real-time processing, which requires low latency responses. The computational intensity of LLMs can lead to increased inference times, making them unsuitable for applications that demand immediate feedback. Consequently, optimizing for speed may involve sacrificing some degree of accuracy, particularly if model simplifications are made.
- Model Compression TechniquesTo address the performance vs. accuracy dilemma, various model compression techniques have been developed. These include quantization, pruning, and knowledge distillation, which aim to reduce the size and computational requirements of models while striving to maintain accuracy. However, these techniques can lead to a degradation in model performance, especially in layers sensitive to changes in weight distributions.
- Quantization and Its ImpactQuantization methods, such as Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), can significantly reduce model size but may also introduce accuracy trade-offs. This highlights the inherent trade-off between achieving a smaller, faster model and maintaining high levels of accuracy.
- The Role of Efficient ArchitecturesEfficient Transformer variants, such as Linformer and Performer, have been proposed to mitigate the computational demands of standard Transformers. These architectures aim to reduce the complexity of the self-attention mechanism, allowing for faster processing without a significant drop in accuracy. Efficient architectures should achieve linear complexity with respect to sequence length while maintaining competitive performance, thus addressing the performance-accuracy trade-off effectively.
5.2. Generalization and Long-Tail Problems in Edge Models
- Limited Training Data and DiversityEdge models often operate on smaller, domain-specific datasets due to constraints in data availability or privacy concerns. This limited exposure can hinder their ability to generalize across a wide range of scenarios. In contrast, larger cloud models are typically trained on vast and diverse datasets, allowing them to capture complex patterns and variations in language. The lack of diversity in training data for edge models can lead to overfitting, where the model performs well on the training data but fails to generalize to new inputs.
- Long-Tail Distribution ChallengesThe long-tail problem arises when models encounter rare events or infrequent categories that are underrepresented in the training data. Edge models, which may be trained on specific tasks or user interactions, often struggle to handle these long-tail scenarios effectively. This is particularly problematic in applications like natural language processing, where certain phrases or topics may appear infrequently but are critical for comprehensive understanding. In contrast, larger models benefit from extensive training on diverse datasets, enabling them to better manage long-tail distributions and provide more robust responses.
- Model Size and ComplexityThe size and complexity of edge models are inherently limited by the constraints of the hardware they run on, which affects their capacity to learn and generalize. Smaller models may lack the representational power needed to capture intricate relationships within data, leading to poorer performance in generalization tasks. In contrast, massive cloud-based models, such as GPT-3, leverage billions of parameters to understand subtle nuances in language, resulting in superior generalization capabilities.
- Mitigation StrategiesTo address these challenges, several strategies can be employed:
- Data Augmentation: Enhancing the training dataset through techniques such as paraphrasing or synthetic data generation can improve the diversity and robustness of edge models.
- Transfer Learning: Utilizing pre-trained models as a starting point for fine-tuning on edge devices can help leverage the knowledge captured by larger models, improving generalization on specific tasks.
- Ensemble Methods: Combining multiple models or predictions can help mitigate the effects of long-tail distributions, as ensemble approaches can capture a wider range of patterns.
5.3. Data Privacy and Security in On-Device Processing
- Importance of On-Device Processing for PrivacyOn-device processing is essential for safeguarding user privacy, as it minimizes the risk of unauthorized data access and ensures compliance with stringent data protection regulations. By keeping sensitive data on local devices, organizations can reduce exposure to potential breaches while still harnessing the power of AI for real-time applications.
- Data Minimization: On-device processing allows for data minimization by reducing the need to transmit sensitive information to centralized servers. By analyzing data locally, only essential information is sent to the cloud, thereby limiting exposure to potential data breaches. This is particularly crucial for applications handling personal or sensitive data, such as health records or financial information.
- User Control: Users retain greater control over their data when processing occurs on their devices. This empowerment fosters trust, as individuals can manage their data without relying on third-party entities. Research indicates that users are more likely to engage with applications that prioritize data privacy.
- Regulatory Compliance: With stringent data protection regulations like the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), on-device processing can facilitate compliance by ensuring that sensitive data remains within the user’s device. This reduces the risk of regulatory penalties associated with data mishandling.
- Security Vulnerabilities of Deploying Models on Physical DevicesDespite the privacy benefits, deploying models on physical devices introduces several security vulnerabilities that must be addressed:
- Physical Access Risks: Devices can be physically accessed by unauthorized individuals, leading to potential data theft or manipulation. Attackers can exploit vulnerabilities in the device’s operating system or hardware to extract sensitive information stored locally.
- Malware and Exploits: Edge devices are often targets for malware attacks, which can compromise the integrity of the models and the data they process. Malicious software can manipulate the model’s behavior or exfiltrate sensitive data, posing significant risks to user privacy.
- Model Inversion Attacks: Attackers can perform model inversion attacks to infer sensitive information about the training data by querying the model with specific inputs. This vulnerability is particularly concerning for models deployed on devices that process sensitive data, as it can lead to unauthorized access to private information.
- Data Leakage through Side Channels: On-device models may inadvertently leak information through side channels, such as timing attacks or power consumption patterns. These attacks exploit the physical characteristics of the device during processing to extract sensitive information about the data being processed.
- Lack of Regular Updates: Unlike cloud-based systems that can be updated centrally, edge devices may lack regular security updates, leaving them vulnerable to newly discovered exploits. This can result in prolonged exposure to known vulnerabilities, increasing the risk of security breaches.
5.4. Energy Efficiency
- Power Constraints: Battery-powered devices often operate under strict energy constraints, necessitating the development of energy-efficient algorithms and models. The need for power efficiency is particularly critical when deploying resource-intensive Large Language Models (LLMs), as their computational demands can quickly deplete battery life.
- Model Compression Techniques: Techniques such as quantization, pruning, and knowledge distillation are essential for reducing the size and computational requirements of models, thereby enhancing energy efficiency. For instance, When converting from FP32 to INT8, which helps in conserving energy during inference.
- Energy-Efficient Architectures: The development of specialized hardware, such as Neural Processing Units (NPUs) and Tensor Processing Units (TPUs), has been driven by the need for energy-efficient processing of AI workloads. These accelerators are designed to optimize energy consumption while delivering high performance for deep learning tasks.
- Dynamic Power Management: Implementing dynamic power management strategies can significantly reduce energy consumption. Techniques such as adaptive voltage scaling and dynamic frequency scaling allow devices to adjust power usage based on workload demands, enhancing overall energy efficiency.
- Algorithmic Optimizations: Efficient algorithms can minimize the number of computations required for model inference, directly impacting energy consumption. For example, the use of efficient attention mechanisms in Transformer models can reduce the computational burden, leading to lower power usage during processing.
- Battery Life Considerations: As AI applications become more prevalent on edge devices, the need for energy-efficient designs is paramount to extend battery life. Research indicates that optimizing AI models for energy efficiency can lead to significant improvements in battery longevity, which is crucial for user satisfaction and device usability.
5.5. Dynamic Resource Management
- Importance of Dynamic Resource ManagementDynamic resource management ensures that applications can efficiently utilize available resources while maintaining performance standards. This adaptability is particularly critical in edge computing, where devices often have limited computational power, memory, and energy resources.
- Adaptive Resource AllocationAdaptive resource allocation strategies allow systems to dynamically adjust resource distribution based on real-time conditions. For instance, resource allocation can be modified based on the current workload, device capabilities, and network latency. This flexibility enables the deployment of LLMs on edge devices while ensuring that performance remains optimal.
- Context-Aware Resource ManagementContext-aware resource management systems leverage information about the current state of the device and network environment to make informed decisions about resource allocation. By analyzing contextual data, such as user behavior and application requirements, these systems can optimize resource usage effectively. This approach can lead to improved responsiveness and user experience in applications reliant on LLMs.
- Proactive vs. Reactive StrategiesDynamic resource management can be categorized into proactive and reactive strategies. Proactive strategies anticipate changes in resource demands and adjust resources accordingly before issues arise. In contrast, reactive strategies respond to changes after they occur, which may lead to temporary performance degradation. Implementing proactive strategies can enhance the reliability and efficiency of LLM applications on edge devices.
- Load Balancing TechniquesLoad balancing is a critical aspect of dynamic resource management, particularly in hybrid edge-cloud systems. By distributing workloads evenly across available resources, load balancing techniques can prevent bottlenecks and ensure that no single device is overwhelmed. This is essential for maintaining performance and responsiveness in applications that utilize LLMs, especially during peak usage times.
- Resource Prediction ModelsResource prediction models use machine learning techniques to forecast future resource requirements based on historical data and current usage patterns. By accurately predicting resource needs, systems can allocate resources more efficiently and reduce the likelihood of performance degradation due to resource shortages. This predictive capability is particularly valuable in dynamic environments where resource demands can fluctuate significantly.
- Challenges in Dynamic Resource ManagementDespite the benefits, dynamic resource management faces several challenges, including:
- Heterogeneity of Devices: The diversity in device capabilities complicates the management of resources, as different devices may require different approaches to resource allocation.
- Network Variability: Fluctuations in network conditions can impact the performance of edge applications, making it difficult to maintain consistent resource management strategies.
- Security and Privacy Concerns: Dynamic resource management systems must also address security and privacy issues, particularly when handling sensitive data on edge devices.
5.6. Cross-Disciplinary Applications of Edge Intelligence
- Synchronization vs. Latency: In Bio-inspired Robotics, the challenge is not just the speed of inference but the synchronization with physical hardware (Hardware-in-the-Loop) [127]. While an LLM can afford to be asynchronous, a medical robotic sphincter must respond to tissue pressure changes with zero-latency precision.
- Reliability vs. Power: As substantiated by the Bernardini et al. [73], civil engineering applications face a Power-Accuracy Paradox. The more complex the ML model used to filter environmental noise, the faster the edge node’s battery depletes, creating a hard limit on long-term autonomous monitoring.
- Safety vs. Complexity: In Aerospace systems, the complexity of the Graph-Language Model (BiDGCNLLM) must be balanced against the safety-critical need for low-latency state forecasting [9]. This mirrors the challenges in edge LLMs where aggressive quantization can reduce latency but may introduce hallucinations that are unacceptable in a flight-safety context.
6. Future Research Directions
6.1. Hardware-Software Co-Design
- Optimized Performance for AI Workloads:AI workloads, especially those involving LLMs, demand high computational power and efficiency. By collaborating closely, chip designers can create specialized hardware architectures tailored specifically for the computational patterns of AI algorithms. For instance, the development of Neural Processing Units (NPUs) has been driven by the need for architectures optimized for matrix multiplications and deep learning tasks, significantly enhancing performance compared to traditional CPUs and GPUs. This synergy allows for the development of hardware that maximizes the efficiency of AI models while ensuring that software can leverage these optimizations effectively.
- Energy Efficiency:Energy consumption is a critical concern for deploying AI models, particularly on battery-powered edge devices. Hardware-software co-design enables the creation of energy-efficient architectures that reduce power consumption during both training and inference phases. For example, specialized accelerators like Tensor Processing Units (TPUs) are designed to perform tensor operations with minimal energy usage. By collaborating with AI researchers, chip designers can identify the most energy-intensive operations and optimize hardware accordingly, leading to significant improvements in battery life and overall system efficiency.
- Adaptation to Evolving AI Techniques:The field of AI is rapidly evolving, with new algorithms and techniques emerging frequently. A close partnership between chip designers and AI researchers ensures that hardware can quickly adapt to these advancements. For instance, as AI models become more complex and require advanced features such as dynamic memory management or efficient data handling, hardware must be designed to accommodate these needs. The collaboration fosters a feedback loop where hardware capabilities can inform software design and vice versa, leading to more agile and responsive development processes.
- Addressing Scalability Challenges:As AI applications scale, the challenges associated with deploying models on various devices become more pronounced. Hardware-software co-design facilitates the development of scalable solutions that can efficiently handle increased workloads without compromising performance. By jointly exploring architectural innovations and algorithmic efficiencies, teams can create systems that scale effectively across different hardware platforms, ensuring consistent performance in diverse environments.
- Enhanced Security and Privacy:The deployment of AI models, particularly in sensitive applications, raises concerns about data privacy and security. A collaborative approach allows for the integration of security features directly into the hardware design, providing robust protection against potential vulnerabilities. For example, incorporating hardware-level encryption and secure processing units can help safeguard sensitive information while processing data on edge devices. This proactive strategy ensures that security measures are aligned with the operational requirements of AI applications.
- Facilitating Real-Time Processing:Many AI applications require real-time processing capabilities, which can be hindered by traditional hardware-software separations. By collaborating closely, chip designers and AI researchers can develop systems that minimize latency and enhance responsiveness. Techniques such as efficient key-value (KV) cache management and speculative decoding can be better optimized when hardware is designed with these specific AI processing requirements. This co-design approach ensures that both hardware and software are aligned to meet the demands of real-time applications effectively.
6.2. New Architectures of Transformer Variants for Edge Constraints
6.3. On-Device Lifelong Learning
- Incremental Learning ApproachesIncremental learning allows models to update their knowledge base without retraining from scratch. This is particularly useful for adapting to new data while retaining previously learned information. Incremental learning techniques enable models to learn from new data instances as they become available, effectively adapting to changes in the environment or user preferences. This approach helps mitigate catastrophic forgetting, a common issue where new learning interferes with previously acquired knowledge.
- Federated LearningFederated learning enables multiple devices to collaboratively learn a shared model while keeping their data decentralized, enhancing privacy and security. In this paradigm, local models are trained on-device and only model updates are sent to a central server for aggregation. This allows for continuous learning from diverse data sources without compromising user privacy.
- Model DistillationModel distillation is a technique where a smaller, more efficient model (student) is trained to replicate the behavior of a larger, more complex model (teacher), facilitating on-device learning. Distillation allows for the transfer of knowledge from a more complex model to a simpler one, enabling the latter to learn more efficiently and effectively on resource-constrained devices.
- Adaptive Learning RatesUsing adaptive learning rates allows models to adjust their learning speed based on the characteristics of the incoming data. Techniques such as AdaGrad, RMSprop, and Adam can help models converge more quickly and effectively, especially when learning from non-stationary data streams.
- Data Augmentation and Synthetic DataData augmentation techniques can enhance the diversity of training data, allowing models to generalize better and adapt to new scenarios. By artificially increasing the size and variability of training datasets through transformations (e.g., rotations, translations), models can learn to be more robust and adaptable to new data distributions.
6.4. Standardized Benchmarks for Comprehensive Evaluation of Edge LLM Performance
- Consistency in EvaluationStandardized benchmarks provide a consistent framework for evaluating the performance of edge LLMs across different devices, architectures, and applications. This consistency is crucial for researchers and practitioners to compare results and understand the capabilities of various models. Without standardized metrics, it becomes challenging to ascertain which models perform best under specific conditions or to identify the trade-offs involved in deploying LLMs on resource-constrained devices.
- Facilitating ComparisonsWith the rapid development of new architectures and techniques, standardized benchmarks enable meaningful comparisons between different models and approaches. For instance, metrics such as latency, throughput, and accuracy can be uniformly assessed, allowing stakeholders to make informed decisions about which models to adopt for particular applications. This comparative analysis is vital for guiding the selection of models that best meet the performance requirements of edge computing environments.
- Identifying Performance BottlenecksStandardized benchmarks help in identifying performance bottlenecks in edge LLMs. By evaluating models against a common set of tasks and conditions, researchers can pinpoint specific areas that require optimization, such as memory usage, computational efficiency, or inference speed. This targeted analysis can drive advancements in model compression techniques, architectural innovations, and algorithmic optimizations that enhance edge LLM performance.
- Supporting ReproducibilityReproducibility is a cornerstone of scientific research. Standardized benchmarks ensure that evaluations can be replicated across different studies and environments, fostering trust in the reported results. This is particularly important in the field of AI, where the complexity of models and variability in hardware can lead to inconsistent findings. By adhering to standardized evaluation protocols, researchers can contribute to a more reliable body of knowledge regarding edge LLM performance.
- Benchmarking Metrics for Edge LLMsBenchmarking edge LLMs need to turn away from general accuracy scores and toward metrics that emphasize the device’s physical limitations and real-time interaction in order to deliver useful insights for hardware-software co-design. Time-to-First-Token (TTFT), which gauges initial responsiveness, and Tokens-per-Second (TPS), which makes sure the generation pace satisfies human reading needs, are examples of key performance indicators. In addition, these rates need to be assessed in relation to the Thermal Design Power (TDP), also known as TPS per Watt, to make sure the model does not cause excessive battery drain or thermal throttling. Lastly, to confirm that architectural optimizations like PagedAttention successfully maximize the constrained VRAM accessible on edge hardware, memory efficiency measures like KV Cache use are crucial.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kamath, U.; Keenan, K.; Somers, G.; Sorenson, S. Large Language Models: A Deep Dive; Springer Nature: Cham, Switzerland, 2024; Volume 10. [Google Scholar]
- Dwivedi, Y.K.; Hughes, L.; Ismagilova, E.; Aarts, G.; Coombs, C.; Crick, T.; Duan, Y.; Dwivedi, R.; Edwards, J.; Eirug, A.; et al. Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int. J. Inf. Manag. 2021, 57, 101994. [Google Scholar] [CrossRef]
- Maity, K.; Chaulwar, A.T.; Vala, V.; Guntur, R.S. NanoBERT: An extremely compact language model. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), Bangalore, India, 4–7 January 2024; pp. 342–349. [Google Scholar]
- Nikdast, M.; Afifi, S.; Pasricha, S. Shedding Light on LLMs: Harnessing Photonic Neural Networks for Accelerating LLMs. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, Newark, NJ, USA, 27–31 October 2024; pp. 1–8. [Google Scholar]
- Thanasi-Boçe, M.; Hoxha, J. From ideas to ventures: Building entrepreneurship knowledge with LLM, prompt engineering, and conversational agents. Educ. Inf. Technol. 2024, 29, 24309–24365. [Google Scholar] [CrossRef]
- Vishwas, B.V.K.; Macharla, S.R. Time Series Forecasting Using Generative AI: Leveraging AI for Precision Forecasting; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
- Peng, D.; Zheng, L.; Liu, D.; Han, C.; Wang, X.; Yang, Y.; Song, L.; Zhao, M.; Wei, Y.; Li, J.; et al. Large-language models facilitate discovery of the molecular signatures regulating sleep and activity. Nat. Commun. 2024, 15, 3685. [Google Scholar] [CrossRef]
- Géza, G.; Varga, B. Method and Management Node in a Communication Network, for Supporting Management of Network Nodes Based on LLDP Messages. U.S. Patent 11,431,728, 30 August 2022. [Google Scholar]
- Wen, Z.; Zhao, J.; Zhang, A.; Bi, W.; Kuang, B.; Su, Y.; Wang, R. BiDGCNLLM: A Graph–Language Model for Drone State Forecasting and Separation in Urban Air Mobility Using Digital Twin-Augmented Remote ID Data. Drones 2025, 9, 508. [Google Scholar] [CrossRef]
- Zhang, B.; Zhang, J.; Hou, J.; Wang, Y. TensAllo: Adaptive Deployment of LLMs on Resource-Constrained Heterogeneous Edge Devices. In Proceedings of the IEEE INFOCOM 2025-IEEE Conference on Computer Communications, London, UK, 19–22 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–10. [Google Scholar]
- Hu, B.; Zhao, C.; Zhang, P.; Zhou, Z.; Yang, Y.; Xu, Z.; Liu, B. Enabling intelligent interactions between an agent and an LLM: A reinforcement learning approach. arXiv 2023, arXiv:2306.03604. [Google Scholar] [CrossRef]
- Qualcomm Technologies. Qualcomm AI Hub: On-Device LLM Benchmarks for Snapdragon 8 Gen 3. 2023. Available online: https://aihub.qualcomm.com/ (accessed on 30 November 2025).
- NVIDIA Corporation. Accelerating LLMs on the Edge with TensorRT-LLM and Jetson Orin; Technical Report, NVIDIA Technical Reports; NVIDIA Corporation: Santa Clara, CA, USA, 2024. [Google Scholar]
- Joshi, P.; Hasanuzzaman, M.; Thapa, C.; Afli, H.; Scully, T. Enabling all in-edge deep learning: A literature review. IEEE Access 2023, 11, 3431–3460. [Google Scholar] [CrossRef]
- Sinh, V.T.; Minh, N.L. A study on self-attention mechanism for AMR-to-text generation. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, Salford, UK, 26–28 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 321–328. [Google Scholar]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
- Li, R.; Fu, D.; Shi, C.; Huang, Z.; Lu, G. Efficient LLMs training and inference: An introduction. IEEE Access 2024, 13, 32944–32970. [Google Scholar] [CrossRef]
- Zhang, X.; Nie, J.; Huang, Y.; Xie, G.; Xiong, Z.; Liu, J.; Niyato, D.; Shen, X.S. Beyond the cloud: Edge inference for generative large language models in wireless networks. IEEE Trans. Wirel. Commun. 2024, 24, 643–658. [Google Scholar] [CrossRef]
- Zhang, M.; Shen, X.; Cao, J.; Cui, Z.; Jiang, S. Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet Things J. 2024, 12, 13119–13131. [Google Scholar] [CrossRef]
- Ong, J.C.L.; Chang, S.Y.H.; William, W.; Butte, A.J.; Shah, N.H.; Chew, L.S.T.; Liu, N.; Doshi-Velez, F.; Lu, W.; Savulescu, J.; et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit. Health 2024, 6, e428–e432. [Google Scholar] [CrossRef]
- Annepaka, Y.; Pakray, P. Large language models: A survey of their development, capabilities, and applications. Knowl. Inf. Syst. 2025, 67, 2967–3022. [Google Scholar] [CrossRef]
- Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
- Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.P.; Bing, L.; Xu, X.; Poria, S.; Lee, R.K.W. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv 2023, arXiv:2304.01933. [Google Scholar] [CrossRef]
- Pragst, L.; Rach, N.; Minker, W.; Ultes, S. On the vector representation of utterances in dialogue context. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Kopru, S.; Liu, M.; SAWAF, H. Vector Representation of Descriptions and Queries. U.S. Patent 15/192,323, 28 December 2017. [Google Scholar]
- Kouretas, I.; Paliouras, V. Hardware implementation of a softmax-like function for deep learning. Technologies 2020, 8, 46. [Google Scholar] [CrossRef]
- Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. A simple and light-weight attention module for convolutional neural networks. Int. J. Comput. Vis. 2020, 128, 783–798. [Google Scholar] [CrossRef]
- Kobayashi, G.; Kuribayashi, T.; Yokoi, S.; Inui, K. Attention is not only a weight: Analyzing transformers with vector norms. arXiv 2020, arXiv:2004.10102. [Google Scholar] [CrossRef]
- Essam, M.; Eldawlatly, S.; Abbas, H. Contextualized Word Representations for Self-Attention Network. In Proceedings of the 2018 13th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, 18–19 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 116–121. [Google Scholar]
- Wu, E.; Liu, X.; Chen, Y.; Zhang, T. A Self-Attention Based Joint Sequence Labeling Model. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 24–26 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 784–787. [Google Scholar]
- Zheng, Z.; Huang, S.; Weng, R.; Dai, X.Y.; Chen, J. Improving self-attention networks with sequential relations. IEEE/ACM Trans. Audio Speech, Lang. Process. 2020, 28, 1707–1716. [Google Scholar] [CrossRef]
- Lee, S.; Bakker, C.R.; Vitzthum, C.; Alver, B.H.; Park, P.J. Pairs and Pairix: A file format and a tool for efficient storage and retrieval for Hi-C read pairs. Bioinformatics 2022, 38, 1729–1731. [Google Scholar] [CrossRef] [PubMed]
- Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
- Qu, G.; Chen, Q.; Wei, W.; Lin, Z.; Chen, X.; Huang, K. Mobile edge intelligence for large language models: A contemporary survey. IEEE Commun. Surv. & Tutor. 2025, 27, 3820–3860. [Google Scholar]
- Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized compression for implementing convolutional neural networks on FPGA. Electronics 2019, 8, 295. [Google Scholar] [CrossRef]
- Kibriya, H.; Khan, W.Z.; Siddiqa, A.; Khan, M.K. Privacy issues in large language models: A survey. Comput. Electr. Eng. 2024, 120, 109698. [Google Scholar] [CrossRef]
- Huang, W.; Deng, X. Real-time tracking railway intruders using multiple-agent cooperated large language models with edge stream processing engine. J. Netw. Comput. Appl. 2025, 242, 104231. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Hasan, J. Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques. arXiv 2024, arXiv:2411.06084. [Google Scholar] [CrossRef]
- Kodali, R.K.; Upreti, Y.P.; Boppana, L. A quantization approach for the reduced size of large language models. In Proceedings of the 2024 16th International Conference on Knowledge and Smart Technology (KST), Krabi, Thailand, 28 February–2 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 144–148. [Google Scholar]
- Pandey, N.P.; Nagel, M.; van Baalen, M.; Huang, Y.; Patel, C.; Blankevoort, T. A practical mixed precision algorithm for post-training quantization. arXiv 2023, arXiv:2302.05397. [Google Scholar] [CrossRef]
- Yu, C.; Yang, S.; Zhang, F.; Ma, H.; Wang, A.; Li, E.P. Improving quantization-aware training of low-precision network via block replacement on full-precision counterpart. arXiv 2024, arXiv:2412.15846. [Google Scholar] [CrossRef]
- Chu, T.; Luo, Q.; Yang, J.; Huang, X. Mixed-precision quantized neural networks with progressively decreasing bitwidth. Pattern Recognit. 2021, 111, 107647. [Google Scholar] [CrossRef]
- Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar] [CrossRef]
- Lin, J.; Tang, J.; Tang, H.; Yang, S.; Xiao, G.; Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. GetMobile Mob. Comput. Commun. 2025, 28, 12–17. [Google Scholar] [CrossRef]
- Vadera, S.; Ameen, S. Methods for pruning deep neural networks. IEEE Access 2022, 10, 63280–63300. [Google Scholar] [CrossRef]
- Xia, H.; Zheng, Z.; Li, Y.; Zhuang, D.; Zhou, Z.; Qiu, X.; Li, Y.; Lin, W.; Song, S.L. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv 2023, arXiv:2309.10285. [Google Scholar] [CrossRef]
- Cheng, H.; Zhang, M.; Shi, J.Q. MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models. arXiv 2024, arXiv:2407.11681. [Google Scholar] [CrossRef]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar] [CrossRef]
- Gale, T.; Elsen, E.; Hooker, S. The state of sparsity in deep neural networks. arXiv 2019, arXiv:1902.09574. [Google Scholar] [CrossRef]
- Alkhulaifi, A.; Alsahli, F.; Ahmad, I. Knowledge distillation in deep learning and its applications. PeerJ Comput. Sci. 2021, 7, e474. [Google Scholar] [CrossRef] [PubMed]
- Thrivikram, G.; Ganesh, V.; Sethuraman, T.; Perepu, S.K. Efficient knowledge distillation of teacher model to multiple student models. In Proceedings of the 2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Bandung, Indonesia, 27–28 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 173–179. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar] [CrossRef]
- Fang, L.; Yu, X.; Cai, J.; Chen, Y.; Wu, S.; Liu, Z.; Yang, Z.; Lu, H.; Gong, X.; Liu, Y.; et al. Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions. arXiv 2025, arXiv:2504.14772. [Google Scholar] [CrossRef]
- Tang, J.; Shivanna, R.; Zhao, Z.; Lin, D.; Singh, A.; Chi, E.H.; Jain, S. Understanding and improving knowledge distillation. arXiv 2020, arXiv:2002.03532. [Google Scholar] [CrossRef]
- Cho, J.H.; Hariharan, B. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4794–4802. [Google Scholar]
- Liu, D. Contemporary model compression on large language models inference. arXiv 2024, arXiv:2409.01990. [Google Scholar] [CrossRef]
- Zhuang, B.; Liu, J.; Pan, Z.; He, H.; Weng, Y.; Shen, C. A survey on efficient training of transformers. arXiv 2023, arXiv:2302.01107. [Google Scholar] [CrossRef]
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar] [CrossRef]
- Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts in large language models. IEEE Trans. Knowl. Data Eng. 2025, 37, 3896–3915. [Google Scholar]
- Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
- Chinnakonduru, S.S.; Mohapatra, A. Weighted grouped query attention in transformers. arXiv 2024, arXiv:2407.10855. [Google Scholar] [CrossRef]
- Fu, Z.; Song, W.; Wang, Y.; Wu, X.; Zheng, Y.; Zhang, Y.; Xu, D.; Wei, X.; Xu, T.; Zhao, X. Sliding Window Attention Training for Efficient Large Language Models. arXiv 2025, arXiv:2502.18845. [Google Scholar] [CrossRef]
- Dhar, N.; Deng, B.; Lo, D.; Wu, X.; Zhao, L.; Suo, K. An empirical analysis and resource footprint study of deploying large language models on edge devices. In Proceedings of the 2024 ACM Southeast Conference, Marietta, GA, USA, 18–20 April 2024; pp. 69–76. [Google Scholar]
- Barad, H.; Aidova, E.; Gorbachev, Y. Leveraging speculative sampling and kv-cache optimizations together for generative ai using openvino. arXiv 2023, arXiv:2311.04951. [Google Scholar] [CrossRef]
- Spector, B.; Re, C. Accelerating llm inference with staged speculative decoding. arXiv 2023, arXiv:2308.04623. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, H.; Yao, Y.; Li, Z.; Zhao, H. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption. arXiv 2024, arXiv:2407.18003. [Google Scholar] [CrossRef]
- Joshi, T.; Saini, H.; Dhillon, N.; i Martin, K.V.; Maghraoui, K.E. Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference. arXiv 2025, arXiv:2506.07311. [Google Scholar] [CrossRef]
- Bernardini, L.; Bono, F.M.; Collina, A. Drive-by damage detection based on the use of CWT and sparse autoencoder applied to steel truss railway bridge. Adv. Mech. Eng. 2025, 17, 16878132251339857. [Google Scholar] [CrossRef]
- Bhardwaj, S.; Singh, P.; Pandit, M.K. A survey on the integration and optimization of large language models in edge computing environments. In Proceedings of the 2024 16th International Conference on Computer and Automation Engineering (ICCAE), Melbourne, Australia, 14–16 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 168–172. [Google Scholar]
- Jin, H.; Wu, Y. Ce-collm: Efficient and adaptive large language models through cloud-edge collaboration. arXiv 2024, arXiv:2411.02829. [Google Scholar] [CrossRef]
- Hao, Z.; Jiang, H.; Jiang, S.; Ren, J.; Cao, T. Hybrid slm and llm for edge-cloud collaborative inference. In Proceedings of the Workshop on Edge and Mobile Foundation Models, Tokyo, Minato-ku, Japan, 3–7 June 2024; pp. 36–41. [Google Scholar]
- Ji, C.; Hou, P.; Yu, J.; Wu, Y.; Tai, Y. Novel Adaptive DNN Partitioning Method Based on Image-Stream Pipeline Inference between the Edge and Cloud. In Proceedings of the 2022 3rd International Conference on Computing, Networks and Internet of Things (CNIOT), Qingdao, China, 20–22 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 75–82. [Google Scholar]
- Al Maruf, M.; Azim, A. Optimizing DNNs Model Partitioning for Enhanced Performance on Edge Devices. In Proceedings of the Canadian AI, Montreal, QC, Canada, 5–9 June 2023. [Google Scholar]
- Moon, S.; Kim, J.H.; Kim, J.; Hong, S.; Cha, J.; Kim, M.; Lim, S.; Choi, G.; Seo, D.; Kim, J.; et al. Lpu: A latency-optimized and highly scalable processor for large language model inference. IEEE Micro 2024, 44, 17–33. [Google Scholar] [CrossRef]
- Wang, B.; Wang, C.; Huang, W.; Song, Y.; Qin, X. A survey and taxonomy on task offloading for edge-cloud computing. IEEE Access 2020, 8, 186080–186101. [Google Scholar] [CrossRef]
- Shen, S.; Zhu, T.; Wu, D.; Wang, W.; Zhou, W. From distributed machine learning to federated learning: In the view of data privacy and security. Concurr. Comput. Pract. Exp. 2022, 34, e6002. [Google Scholar] [CrossRef]
- Wagner, N.; Fan, D.; Jaggi, M. Personalized collaborative fine-tuning for on-device large language models. arXiv 2024, arXiv:2404.09753. [Google Scholar] [CrossRef]
- Srihith, I.D.; Donald, A.D.; Srinivas, T.A.S.; Thippanna, G.; Anjali, D. Empowering Privacy-Preserving Machine Learning: A Comprehensive Survey on Federated Learning. Int. J. Adv. Res. Sci. Commun. Technol. 2023, 3, 133–144. [Google Scholar] [CrossRef]
- Chandrasekaran, S.; Athinarayanan, S.; Masthan, M.; Kakkar, A.; Bhatnagar, P.; Samad, A. Edge Intelligence Paradigm Shift on Optimizing the Edge Intelligence Using Artificial Intelligence State-of-the-Art Models. In Advancing Intelligent Networks Through Distributed Optimization; IGI Global: Hershey, PA, USA, 2024; pp. 1–18. [Google Scholar]
- Röbert, K.; Bornholdt, H.; Fischer, M.; Edinger, J. Latency-aware scheduling for real-time application support in edge computing. In Proceedings of the 6th International Workshop on Edge Systems, Analytics and Networking, Rome, Italy, 8 May 2023; pp. 13–18. [Google Scholar]
- Peng, D.; Fu, Z.; Wang, J. Pocketllm: Enabling on-device fine-tuning for personalized llms. arXiv 2024, arXiv:2407.01031. [Google Scholar] [CrossRef]
- Hayou, S.; Ghosh, N.; Yu, B. LoRA+: Efficient Low Rank Adaptation of Large Models. arXiv 2024, arXiv:2402.12354. [Google Scholar] [CrossRef]
- Rücklé, A.; Geigle, G.; Glockner, M.; Beck, T.; Pfeiffer, J.; Reimers, N.; Gurevych, I. AdapterDrop: On the Efficiency of Adapters in Transformers. arXiv 2020, arXiv:2010.11918. [Google Scholar] [CrossRef]
- Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar] [CrossRef]
- Zaken, E.B.; Ravfogel, S.; Goldberg, Y. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. arXiv 2021, arXiv:2106.10199. [Google Scholar] [CrossRef]
- Hu, K.; Li, Y.; Xia, M.; Wu, J.; Lu, M.; Zhang, S.; Weng, L. Federated Learning: A Distributed Shared Machine Learning Method. Complexity 2021, 2021, 8261663. [Google Scholar] [CrossRef]
- Yu, S.; Muñoz, J.P.; Jannesari, A. Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models. arXiv 2023, arXiv:2305.11414. [Google Scholar] [CrossRef]
- Kaur, A.; Kaushal, C.; Hassan, M.M.; Aung, S.T. Federated Deep Learning for Healthcare: A Practical Guide with Challenges and Opportunities; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar] [CrossRef]
- Prabhu, O.S.; Gupta, P.K.; Shashank, P.; Chandrasekaran, K.; Usha, D. Towards a Federated Learning Approach for NLP Applications. In Applications of Artificial Intelligence and Machine Learning; Springer: Singapore, 2021; pp. 157–167. [Google Scholar] [CrossRef]
- Dasaradharami Reddy, K.; S, A. Security and privacy in federated learning: A survey. Trends Comput. Sci. Inf. Technol. 2023, 8, 29–37. [Google Scholar] [CrossRef]
- Agarwal, A.; Rezagholizadeh, M.; Parthasarathi, P. Practical Takes on Federated Learning with Pretrained Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EACL, Dubrovnik, Croatia, 2–6 May 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 454–471. [Google Scholar] [CrossRef]
- Almanifi, O.R.A.; Chow, C.O.; Tham, M.L.; Chuah, J.H.; Kanesan, J. Communication and computation efficiency in Federated Learning: A survey. Internet Things 2023, 22, 100742. [Google Scholar] [CrossRef]
- Malan, E.; Peluso, V.; Calimera, A.; Macii, E. Communication-Efficient Federated Learning with Gradual Layer Freezing. IEEE Embed. Syst. Lett. 2023, 15, 25–28. [Google Scholar] [CrossRef]
- Gao, D.; Yao, X.; Yang, Q. A Survey on Heterogeneous Federated Learning. arXiv 2022, arXiv:2210.04505. [Google Scholar] [CrossRef]
- Qin, Z.; Chen, D.; Qian, B.; Ding, B.; Li, Y.; Deng, S. Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes. arXiv 2023, arXiv:2312.06353. [Google Scholar] [CrossRef]
- Zhang, J.; Zhu, H.; Wang, F.; Zhao, J.; Xu, Q.; Li, H. Security and Privacy Threats to Federated Learning: Issues, Methods, and Challenges. Secur. Commun. Netw. 2022, 2022, 1–24. [Google Scholar] [CrossRef]
- Kachris, C. A survey on hardware accelerators for large language models. Appl. Sci. 2025, 15, 586. [Google Scholar] [CrossRef]
- Kimm, H.; Paik, I.; Kimm, H. Performance comparision of tpu, gpu, cpu on google colaboratory over distributed deep learning. In Proceedings of the 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, Singapore, 20–23 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 312–319. [Google Scholar]
- He, W. The promise of training deep neural networks on CPUs: A survey. J. Phys. Conf. Ser. 2023, 2649, 012017. [Google Scholar] [CrossRef]
- Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021; pp. 1–15. [Google Scholar]
- Tan, T.; Cao, G. Deep learning on mobile devices with neural processing units. Computer 2023, 56, 48–57. [Google Scholar] [CrossRef]
- Babu, P.; Parthasarathy, E. Reconfigurable FPGA architectures: A survey and applications. J. Inst. Eng. Ser. B 2021, 102, 143–156. [Google Scholar] [CrossRef]
- Google Coral. Coral: Efficient On-Device AI with Edge TPU. 2024. Available online: https://coral.ai/ (accessed on 25 October 2024).
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
- Wan, Z.; Liu, C.K.; Yang, H.; Raj, R.; Li, C.; You, H.; Fu, Y.; Wan, C.; Li, S.; Kim, Y.; et al. Towards efficient neuro-symbolic ai: From workload characterization to hardware architecture. IEEE Trans. Circuits Syst. Artif. Intell. 2024, 1, 53–68. [Google Scholar] [CrossRef]
- Han, M.; Sun, X.; Wang, X.; Zhan, W.; Chen, X. Transformer-based Distributed Task Offloading and Resource Management in Cloud-Edge Computing Networks. IEEE J. Sel. Areas Commun. 2025, 43, 2938–2953. [Google Scholar] [CrossRef]
- Yang, X.; Su, T. Efa-trans: An efficient and flexible acceleration architecture for transformers. Electronics 2022, 11, 3550. [Google Scholar] [CrossRef]
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar]
- Tanurhan, Y.; Paulin, P.; Michiels, T. Generative AI on a Budget: Processing Transformer-based Neural Networks at the Edge. In Proceedings of the 2023 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 9–13 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
- Surianarayanan, C.; Lawrence, J.J.; Chelliah, P.R.; Prakash, E.; Hewage, C. A survey on optimization techniques for edge artificial intelligence (AI). Sensors 2023, 23, 1279. [Google Scholar] [CrossRef]
- Orăşan, I.L.; Seiculescu, C.; Caleanu, C.D. Benchmarking tensorflow lite quantization algorithms for deep neural networks. In Proceedings of the 2022 IEEE 16th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 25–28 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 000221–000226. [Google Scholar]
- Lin, W.F.; Tsai, D.Y.; Tang, L.; Hsieh, C.T.; Chou, C.Y.; Chang, P.H.; Hsu, L. Onnc: A compilation framework connecting onnx to proprietary deep learning accelerators. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 18–20 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 214–218. [Google Scholar]
- Ray, P.P.; Pradhan, M.P. Llmedge: A novel framework for localized llm inferencing at resource constrained edge. In Proceedings of the 2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS), Bengaluru, India, 17–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar]
- Daniel Bevenius, D.; Gerganov, G.; Devesa, D. GGUF. 2025. Available online: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md (accessed on 25 October 2024).
- Turboderp. ExLlamaV2. 2025. Available online: https://github.com/turboderp-org/exllamav2 (accessed on 25 October 2024).
- Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv 2022, arXiv:20222205. [Google Scholar] [CrossRef]
- Xia, Y.; Zhang, J.; Jazdi, N.; Weyrich, M. Incorporating large language models into production systems for enhanced task automation and flexibility. arXiv 2024, arXiv:2407.08550. [Google Scholar] [CrossRef]
- Ngo, D.; Park, H.C.; Kang, B. Edge Intelligence: A Review of Deep Neural Network Inference in Resource-Limited Environments. Electronics 2025, 14, 2495. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, C.; Zhang, R.; Qin, T.; Ji, X.; Lin, H.; Yang, M. Enhancing the interoperability between deep learning frameworks by model conversion. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Sacramento, CA, USA, 8–13 November 2020; pp. 1320–1330. [Google Scholar]
- Bodor, A.; Hnida, M.; Najima, D. From development to deployment: An approach to MLOps monitoring for machine learning model operationalization. In Proceedings of the 2023 14th International Conference on Intelligent Systems: Theories and Applications (SITA), Mohammedia, Morocco, 19–20 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar]
- Celik, A.; Mahmoud, Q.H. A Review of Large Language Models for Automated Test Case Generation. Mach. Learn. Knowl. Extr. 2025, 7, 97. [Google Scholar] [CrossRef]
- Mao, Z.; Suzuki, S.; Nabae, H.; Miyagawa, S.; Suzumori, K.; Maeda, S. Machine learning-enhanced soft robotic system inspired by rectal functions to investigate fecal incontinence. Nat. Commun. 2024, 15, 482–494. [Google Scholar] [CrossRef]



| Criteria | Inclusion Rules | Exclusion Rules |
|---|---|---|
| Timeframe | Primarily 2017–2025, focusing on the post-Transformer era. | Research published prior to 2017 (pre-Transformer architecture). |
| Topic Relevance | Must explicitly discuss Transformer-based models or edge hardware constraints. | General AI surveys that do not address LLM-specific bottlenecks. |
| Technical Depth | Papers providing quantitative data, novel architectures, or system-level trade-offs. | Short abstracts, non-peer-reviewed blog posts, or purely marketing materials. |
| Language | Documents published in English. | Non-English publications. |
| Quantization Method | Overview | Impact on Model Size, Accuracy, and Challenges |
|---|---|---|
| Post-Training Quantization (PTQ) | PTQ quantizes a pre-trained model without additional training. It’s simple and fast, converting from FP32 to lower-bit formats (e.g., INT8) [42]. | Model Size Reduction: Reduces model size by up to 75% from FP32 to INT8. Accuracy Trade-offs: Can cause slight degradation, especially in layers with high weight variance. Challenges: Layer sensitivity and calibration requirements. |
| Quantization-Aware Training (QAT) | QAT integrates quantization into the training process, using simulated quantization effects. This allows the model to learn to compensate for reduced precision [43]. | Model Size Reduction: Similar to PTQ, QAT provides significant size reductions. Accuracy Preservation: Often leads to better accuracy than PTQ as the model adapts to quantization during training. Challenges: Can be complex for Transformers and other models sensitive to weight perturbations. |
| Mixed-Precision Quantization | This method uses different precision levels for various parts of the model (e.g., critical layers use FP16 while others use INT8) [44]. | Model Size Reduction: Provides substantial reductions while maintaining accuracy. Accuracy Maintenance: Preserves accuracy in sensitive areas. Challenges: Complex to implement, as it requires careful consideration of which layers to quantize. Performance can be inconsistent. |
| Generalized Post-Training Quantization (GPTQ) | GPTQ quantizes weights layer-by-layer, using second-order information (the Hessian matrix) to adjust the remaining unquantized weights in a layer to compensate for the error introduced by quantizing others [45]. | Model Size Reduction: Moving from FP16 (2 bytes per parameter) to 4-bit (0.5 bytes per parameter) reduces the memory required to store weights by 75%. Accuracy Maintenance: To prevent the significant performance drop typically caused by naive rounding, GPTQ uses sophisticated mathematical optimizations. Challenges: While effective, GPTQ involves specific trade-offs and deployment requirements. |
| Activation-aware Weight Quantization (AWQ) | AWQ identifies critical weights by observing activation magnitudes during a calibration phase. It then scales these salient weights to protect their precision before quantizing, without needing to keep them in mixed precision [46]. | Model Size Reduction: By converting weights from 16-bit to 4-bit, AWQ reduces the model size by 75%. Accuracy Maintenance: In AWQ not all weights are equally important. It focuses on protecting the most critical 1% of the model. Challenges: AWQ requires specialized CUDA kernels that can handle the dequantization (4-bit back to 16-bit) during the matrix multiplication. |
| Pruning Category | Overview | Advantages and Challenges |
|---|---|---|
| Unstructured Pruning | Involves removing individual weights from a neural network, often based on their magnitude. This is a fine-grained approach that retains the overall model architecture [48]. | Advantages: High compression rates and performance maintenance by selectively removing less significant weights. Challenges: The resulting sparse matrix can be inefficient for standard hardware. Implementation is complex and may require fine-tuning. |
| Structured Pruning | Involves removing entire structures (e.g., neurons, channels, or layers) from the model. This results in a more regular and compact architecture [49]. | Advantages: Leads to more efficient inference and is simpler to implement on existing hardware due to a dense representation. Challenges: Can result in a more significant drop in performance compared to unstructured pruning. Offers less flexibility in compression. |
| Transformer Variant | Overview | Key Features |
|---|---|---|
| Linformer | Reduces time complexity from to by projecting token embeddings into a lower-dimensional space [61]. | Linear Complexity: Achieves linear complexity with respect to sequence length. Performance: Maintains competitive performance on various NLP tasks. |
| Performer | Uses a kernelized attention mechanism to approximate attention scores, enabling linear time complexity and significant speed improvements [62]. | Kernelized Attention: Leverages kernel methods. Scalability: Can handle longer sequences efficiently, suitable for large datasets. |
| Mixture of Experts (MoE) | A sparsely activated set of experts increases model capacity without a proportional increase in computational cost. Only a subset of experts is active for each input [63]. | Sparse Activation: A few experts are activated. Improved Performance: Demonstrates significant improvements on various benchmarks, especially in language modeling tasks. |
| Reformer | Reduces memory footprint by using locality-sensitive hashing (LSH) for attention and reversible layers [64]. | LSH Attention: Uses LSH to significantly reduce complexity. Reversible Layers: Allows for memory savings during training as activations are not stored. |
| Longformer | Designed for long document processing, it uses a combination of local and global attention mechanisms [65]. | Local and Global Attention: Uses a sliding window for local attention and incorporates global tokens. Efficiency: Efficient for processing documents much longer than standard Transformers. |
| Grouped Query Attention (GQA) | an optimization technique that serves as an intermediate middle ground between Multi-Head Attention (MHA) and Multi-Query Attention (MQA) [66]. | Reduced KV Cache: significantly shrinks the size of the KV cache, allowing the model to handle much longer contexts on the same hardware. Interpolated Quality: maintains near-MHA levels of performance while being nearly as fast as MQA. |
| Sliding Window Attention (SWA) | a sparse attention mechanism that limits the attention span of each token to a fixed-size local context rather than the entire preceding sequence [67]. | Linear Scaling: reduces computational complexity of window size, making the cost of processing a token constant regardless of total sequence length. Memory Efficiency: utilizes a rolling buffer that keeps memory usage constant per layer, enabling the processing of massive documents on consumer-grade GPUs. |
| Inference Optimization | Overview | Key Features and Challenges |
|---|---|---|
| Speculative Decoding | An inference optimization technique that reduces the latency of text generation by generating multiple potential tokens in parallel [70]. | Key Features: Achieves parallel generation and reduced latency, making it ideal for real-time applications. Challenges: Increased computational resource consumption, requiring a balance between speed and efficiency. |
| Efficient Key-Value (KV) Cache Management | Optimizes inference in LLMs, particularly during autoregressive generation tasks, by storing and reusing key-value pairs from previously generated tokens [71]. | Key Features: The cache mechanism allows the model to access relevant information quickly without recomputing. It also improves memory efficiency, crucial for edge devices. Challenges: Requires careful management of the cache size to avoid excessive memory consumption, especially in long sequence tasks. |
| PagedAttention | A specialized memory management algorithm designed to solve the inefficiencies of storing Key-Value (KV) caches in Large Language Model (LLM) inference, particularly on memory-constrained hardware [72]. | Key Features: By maximizing VRAM utilization, edge and server hardware can increase the concurrent batch size, leading to significantly higher token throughput. Challenges: Integrating PagedAttention requires deep modifications to the attention kernels of a model, making it more difficult to implement compared to standard contiguous cache management. |
| Optimization Strategy | Overview | Key Features and Challenges |
|---|---|---|
| Neural Architecture Search (NAS) | Automates the design of the model architecture itself, such as layer depth and width, rather than manually tuning individual hyperparameters. | Key Features: Can be strictly constrained to search for architectures meeting a specific latency budget (e.g., <50 ms/token). Challenges: The search process is computationally expensive and requires significant initial resources to find the optimal architecture. |
| Hardware-Aware Hyperparameter Optimization (HPO) | Integrates hardware-specific metrics like energy per inference and peak VRAM directly into the objective function. | Key Features: Optimizes for Accuracy per Watt to ensure device longevity. Challenges: Requires accurate hardware-in-the-loop measurement, as seen in SHM applications [73], poor tuning can lead to failure in detecting critical signals against environmental noise. |
| Workload Distribution | Overview | Methods and Key Concepts |
|---|---|---|
| Model Partitioning | Divides an LLM into segments that can be run on different devices (edge/cloud) to leverage the strengths of each [77]. | Vertical Partitioning: Different layers of the model are allocated to either the edge or the cloud [78]. Horizontal Partitioning: The model is divided based on input data or task type, balancing the workload [79]. |
| Task Offloading | A strategy where certain computational tasks are performed on edge devices while others are sent to the cloud [80]. | Static Offloading: Predefined tasks are assigned to the edge or the cloud based on resource requirements. Dynamic Offloading: Tasks are offloaded dynamically based on current conditions (e.g., network bandwidth). |
| Federated Learning | A collaborative machine learning approach where models are trained across multiple edge devices without sharing raw data. Only model updates are shared with the cloud to improve the global model [81]. | Collaborative Model Training: Different devices contribute to the training of a single global model [82]. Personalized Model Updates: Users can adapt a general model with personalized updates on their devices [83]. |
| Edge-Cloud Hybrid Architectures | Combines both edge and cloud devices into a single, seamless workflow for deployment. These architectures dynamically allocate tasks [76]. | Edge-Cloud Synergy: Emphasizes collaboration between edge and cloud systems [84]. Resource-Aware Scheduling: Intelligently distributes workloads based on resource availability and task requirements to reduce latency [85]. |
| Mitigation Technique | Mechanism | Key Benefits and Challenges |
|---|---|---|
| Differential Privacy (DP) | Adds mathematical noise to intermediate activations before transmission to the cloud. | Benefits: Provides a formal guarantee against input reconstruction. Challenges: Can degrade model accuracy if noise levels are too high. |
| Secure Multi-Party Computation (SMPC) | Computes partition layers across the edge and cloud using cryptographic shares. | Benefits: Neither party sees the actual latent representations. Challenges: Introduces significant communication and computational latency. |
| Adversarial Training | Trains the edge “stub” to minimize the information available for reconstruction attacks. | Benefits: Reduces the “leakage” of sensitive user attributes in the latent space. Challenges: Requires complex re-training of the base model. |
| Homomorphic Encryption (HE) | Performs cloud-side Transformer layers directly on encrypted activations. | Benefits: Data remains encrypted even during processing in the cloud. Challenges: Extremely high overhead; often too slow for real-time edge responses. |
| Fine-Tuning Method | Overview | Key Features |
|---|---|---|
| Low-Rank Adaptation (LoRA) | A technique that enables fine-tuning by injecting low-rank matrices into the model’s layers, rather than updating all parameters [87]. | Efficiency: Allows for fine-tuning with a small number of trainable parameters. Performance Preservation: Maintains performance while reducing computational cost. |
| PEFT: Adapter Layers | Adapter layers are small, task-specific neural network layers inserted into a pre-trained model. Only the adapter layers are trained during fine-tuning [88]. | Modularity: Adapter layers can be easily added or removed for different tasks. Resource Efficiency: Reduces computational and memory requirements. |
| PEFT: Prompt Tuning | Involves optimizing a soft prompt or a small set of continuous, task-specific vectors that are prepended to the input. The main model weights remain frozen [89]. | No Weight Updates: The original model weights are not updated, preserving the pre-trained knowledge. Task-Specific Guidance: The optimized prompt guides the model’s behavior for a specific task. |
| PEFT: BitFit | A lightweight fine-tuning technique that only updates the bias parameters of the model’s layers while freezing all other weights [90]. | Minimal Parameter Updates: Only updates a small fraction of the total parameters (biases). Effective Adaptation: Despite minimal updates, it is effective for adapting to new tasks. |
| Hardware Platform | Strengths | Weaknesses |
|---|---|---|
| Central Processing Unit (CPU) | Versatility: CPUs are general-purpose processors suitable for a wide range of tasks [104]. Ease of Programming: The software ecosystem is mature and well-supported. | Limited Parallelism: CPUs have fewer cores, limiting their ability to handle highly parallel tasks. Lower Throughput: Less efficient for large-scale matrix operations common in AI. |
| Graphics Processing Unit (GPU) | High Parallelism: GPUs are designed for massive parallel computations [105]. Optimized Libraries: Many deep learning frameworks are highly optimized for GPUs. | Power Consumption: GPUs consume significant power. Cost: High-performance GPUs can be expensive. |
| Neural Processing Unit (NPU) | Specialized for AI Tasks: NPUs are custom-designed for efficient AI/ML computations [106]. Energy Efficiency: NPUs generally consume less power than GPUs for similar AI tasks. | Limited General-Purpose Use: Less versatile than CPUs or GPUs. Development Complexity: The software ecosystem is less mature. |
| Field-Programmable Gate Array (FPGA) | Customization: FPGAs can be configured for specific tasks, offering high efficiency [107]. Low Latency: Can achieve extremely low latency for real-time applications. | Development Time: Programming FPGAs is complex and time-consuming. Performance Variability: Performance heavily depends on the specific design and implementation. |
| Mobile System-on-Chip (SoC) | Heterogeneous Integration: Combines CPU, GPU, and NPU on a single die with unified memory (e.g., Snapdragon 8 Gen 3, Apple A17 Pro) [12]. Efficiency: Optimized for INT4/INT8 quantization, reaching 15–20 TPS on 7B models. | Thermal Throttling: Compact form factors lead to heat accumulation, causing performance drops during sustained inference. Ecosystem Lock-in: Maximum performance often requires vendor-specific SDKs. |
| Edge GPU Accelerators | High Throughput: Dedicated hardware like NVIDIA Jetson Orin utilizes Tensor Cores to achieve 30–50 TPS on 7B models via AWQ [13]. Software Maturity: Leverages established CUDA and TensorRT ecosystems. | Power Envelope: High TDP (up to 60W) may exceed the capabilities of battery-powered or solar-powered edge nodes. Physical Size: Larger footprint compared to integrated SoCs. |
| Application-Specific Accelerators | Niche Efficiency: Devices like Google Coral (Edge TPU) offer high TOPS/Watt for specific vision tasks [108]. Cost-Effective: Generally lower price point for low-power IoT applications. | Memory Bottleneck: Limited on-chip SRAM and lack of dynamic attention support result in extremely low LLM throughput (<1 TPS). Not suitable for Transformer architectures. |
| Unified Memory Architecture (UMA) | Zero-Copy Transfer: Found in Apple M-series Silicon; allows the NPU and GPU to share the same high-speed RAM pool (up to 400 GB/s) [104]. Context Capacity: Enables running massive models that exceed traditional discrete VRAM limits. | Hardware Cost: High entry cost for professional-grade unified memory configurations. Fixed Hardware: RAM is non-upgradeable, limiting the long-term flexibility of the edge node. |
| Optimization Strategy | Hardware Platform | Throughput (TPS) | Memory Footprint | Energy Efficiency |
|---|---|---|---|---|
| FP16 (Baseline) [13] | NVIDIA Jetson Orin | 12–15 TPS | 14.0 GB | 2.4 J/token |
| INT4 (GPTQ/AWQ) [12] | Snapdragon 8 Gen 3 | 18–22 TPS | 3.8 GB | 0.45 J/token |
| Pruning (50%) [104] | Apple M3 (MLX) | 25–30 TPS | 7.2 GB | 0.8 J/token |
| KD (NanoBERT) [3] | Embedded CPU | 5–8 TPS | <1.0 GB | 1.2 J/token |
| Hardware Accelerator | Overview | Key Features |
|---|---|---|
| Neural Processing Unit (NPU) | NPUs are specialized hardware designed to accelerate machine learning and AI tasks. They are often integrated into mobile devices and edge computing platforms [106]. | Energy Efficiency: NPUs are generally more energy-efficient than CPUs and GPUs for AI workloads.
Optimized for AI: They are built specifically for the demands of neural networks and are highly efficient at common AI operations. |
| Tensor Processing Unit (TPU) | TPUs, developed by Google, are application-specific integrated circuits (ASICs) optimized specifically for machine learning workloads, particularly in a data center environment [113]. | High Throughput: TPUs are capable of performing a massive number of matrix multiplications per second.
Optimized for Tensor Operations: Their architecture is custom-built to handle tensor computations with extreme efficiency. |
| Field-Programmable Gate Array (FPGA) | FPGAs are integrated circuits whose hardware can be reconfigured after manufacturing. They offer a unique balance of flexibility and performance [107]. | Customization: FPGAs can be configured to create custom hardware circuits for specific applications and algorithms, offering great flexibility.
Low Latency: The ability to achieve very low latency is a key advantage, making them suitable for real-time applications and specialized tasks. |
| Library or Framework | Overview | Key Features |
|---|---|---|
| TensorFlow Lite (TFLite) | A lightweight version of TensorFlow designed for on-device machine learning inference, specifically for mobile, embedded, and IoT devices [116]. | Model Optimization: Supports various techniques like quantization and pruning to reduce model size and latency.
Cross-Platform Compatibility: Works across a wide range of devices and operating systems (e.g., Android, iOS, embedded Linux). |
| ONNX Runtime | An open-source inference engine for the Open Neural Network Exchange (ONNX) format, designed to accelerate machine learning models across different hardware and software [117]. | Interoperability: Allows models trained in various frameworks (e.g., PyTorch, TensorFlow) to be run on a single platform.
Performance Optimization: Provides a set of optimizations to achieve high performance on various hardware. |
| llama.cpp | A C++ library designed for efficient inference of large language models (LLMs) on consumer hardware, particularly CPUs [118]. | Memory Efficiency: Highly optimized to run large models with limited memory resources.
Speed: Engineered to provide fast inference speeds, even on less powerful hardware, through techniques like quantization. |
| GPT-Generated Unified Format (GGUF) | High CPU compatibility, memory mapping (mmap), and single-file distribution [119]. | Device Suitability: Ideal for devices like Apple Silicon or laptops where VRAM and RAM are shared.
Universal Quantization: supports a wide range of quantization levels (Q2_K, Q4_K_M, etc.), allowing a user to pick a file that fits exactly within their device’s specific RAM constraints. |
| ExLlamaV2(EXL2) | High-speed GPU inference with varying bit-rate quantization [120]. | GPU-Dedicated: Best for edge GPUs with dedicated VRAM where maximizing token-per-second is the priority.
Speed: utilizes specialized CUDA kernels that minimize the overhead of dequantization during the forward pass, significantly reducing the time to first token compared to general-purpose formats. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kristiani, E.; Verma, V.K.; Yang, C.-T. Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI 2026, 7, 15. https://doi.org/10.3390/ai7010015
Kristiani E, Verma VK, Yang C-T. Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI. 2026; 7(1):15. https://doi.org/10.3390/ai7010015
Chicago/Turabian StyleKristiani, Endah, Vinod Kumar Verma, and Chao-Tung Yang. 2026. "Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions" AI 7, no. 1: 15. https://doi.org/10.3390/ai7010015
APA StyleKristiani, E., Verma, V. K., & Yang, C.-T. (2026). Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI, 7(1), 15. https://doi.org/10.3390/ai7010015

