Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions

Kristiani, Endah; Verma, Vinod Kumar; Yang, Chao-Tung

doi:10.3390/ai7010015

Open AccessReview

Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions

by

Endah Kristiani

^1,2

,

Vinod Kumar Verma

³

and

Chao-Tung Yang

^1,4,5,*

¹

Department of Computer Science, Tunghai University, Taichung City 407224, Taiwan

²

Department of Informatics, Krida Wacana Christian University, Jakarta 11470, Indonesia

³

Department of Computer Science Engineering, Sant Longowal Institute of Engineering Technology (SLIET), Longowal 148106, India

⁴

Research Center for Smart Sustainable Circular Economy, Tunghai University, No. 1727, Section 4, Taiwan Boulevard, Taichung City 407224, Taiwan

⁵

Department of Medical Research Kuang Tien General Hospital, Taichung City 43304, Taiwan

^*

Author to whom correspondence should be addressed.

AI 2026, 7(1), 15; https://doi.org/10.3390/ai7010015

Submission received: 25 November 2025 / Revised: 30 December 2025 / Accepted: 1 January 2026 / Published: 7 January 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

The intersection of edge computing, Large Language Models (LLMs), and the Transformer architecture is a very active and fascinating area of research. The core tension is that LLMs, which are built on the Transformer architecture, are massive and computationally intensive, while edge devices are resource-constrained in terms of power, memory, and processing capabilities. Therefore, LLMs based on the Transformer architecture are inherently unsuitable for edge computing in their original, full-sized form. They were designed for powerful, resource-rich cloud data centers. However, there is a massive and growing effort to make them suitable for edge devices. Implementing Transformer-based LLMs on edge computing devices is a complex but crucial task that requires a multi-faceted strategy. This paper reviews LLM deployment strategies for Transformer models on edge computing devices, examines the challenges, and estimates future directions. To address these challenges, researchers are exploring methods to compress LLMs and optimize their inference capabilities, making them more efficient for edge environments. Recent advancements in compact LLMs have shown promise in enhancing their deployment on edge devices, enabling improved performance while addressing the limitations of traditional models. This approach not only reduces computational costs but also enhances user privacy and security.

Keywords:

large language models; LLM; artificial intelligence; edge computing; transformers

1. Introduction

The landscape of artificial intelligence is undergoing a remarkable transformation, primarily driven by the advent of Large Language Models (LLMs) and the shift towards edge computing. LLMs, such as GPT-3/4/5 and their successors, have revolutionized the way we interact with technology, enabling machines to understand and generate human-like text with unprecedented accuracy [1]. This innovation has opened up new avenues for applications across various sectors, including customer service, content creation, and education, fundamentally changing how we communicate and access information [2]. Simultaneously, the shift toward edge computing is reshaping the deployment of AI technologies. By processing data closer to the source—on devices such as smartphones, IoT sensors, and local servers—edge computing reduces latency, enhances privacy, and conserves bandwidth. This shift allows for real-time data analysis and decision-making, empowering organizations to leverage AI capabilities in environments where cloud computing may not be feasible or efficient [3]. Together, the integration of LLMs with edge computing presents a powerful paradigm shift. It enables intelligent applications to operate seamlessly in diverse settings, from smart homes to industrial automation, while ensuring that data privacy and responsiveness are prioritized [4]. As this transformation unfolds, we are witnessing a new era of AI that is not only more accessible but also more integrated into our daily lives, paving the way for innovative solutions and enhanced user experiences [5].

The field of AI, particularly with LLMs like GPT-3 and its successors, is evolving at an unprecedented pace. As these models become more sophisticated, understanding their capabilities, limitations, and potential applications is crucial for organizations looking to leverage AI effectively [6]. Many sectors, including healthcare, finance, and education, are increasingly adopting LLMs for various applications. A thorough review can provide insights into best practices, successful implementations, and the unique challenges faced in different contexts, guiding organizations in their AI strategies [7]. Furthermore, with the rise of IoT devices and the need for real-time processing, edge computing has become a vital component of modern AI deployment. Reviewing how LLMs can be integrated with edge computing will help in understanding how to optimize performance, reduce latency, and enhance privacy, which are critical considerations for many applications [8]. As data privacy regulations become more stringent, organizations must be aware of how to manage sensitive information effectively. An exploration of LLMs in conjunction with edge computing can illuminate ways to maintain data security while still leveraging powerful AI capabilities [9]. The ability to process data at the edge allows organizations to make real-time decisions. A comprehensive review can highlight the potential of LLMs in facilitating these decisions, particularly in dynamic environments such as smart homes or industrial automation [10]. Therefore, the integration of LLMs with edge computing can significantly enhance user experiences by providing more responsive and context-aware applications. Understanding this intersection can aid developers and businesses in creating innovative solutions that meet user needs more effectively [11].

1.1. Scope of the Review

The scope of this review paper is to provide a comprehensive synthesis of existing research on the implementation and deployment of Transformer-based Large Language Models (LLMs) on resource-constrained edge computing devices. This paper aims to create a structured understanding of the multi-faceted challenges and the diverse solutions proposed in academic and industrial literature. Specifically, this review focuses on the following key areas:

Model Compression Techniques: We will survey and critically analyze various methods used to reduce the size and computational requirements of LLMs. This includes a deep dive into quantization (e.g., Post-Training Quantization, Quantization-Aware Training), pruning (e.g., unstructured vs. structured), and knowledge distillation.
Architectural and Algorithmic Optimizations: The paper will explore architectural modifications and algorithmic enhancements to the standard Transformer model that are specifically designed for improved efficiency. This includes a review of efficient attention mechanisms, new model architectures, and inference-time optimizations like speculative decoding and efficient Key-Value (KV) cache management.
System-Level and Hybrid Approaches: We will examine strategies that go beyond single-model optimization to improve overall system performance. This includes an overview of hybrid edge-cloud systems, on-device fine-tuning techniques (e.g., Parameter-Efficient Fine-Tuning or PEFT), and federated learning applications for LLMs on the edge.
Hardware and Software Landscape: The review will provide an overview of the current hardware ecosystem for edge AI, including CPUs, GPUs, and specialized accelerators (NPUs, FPGAs). It will also cover the software frameworks and libraries (e.g., TFLite, ONNX Runtime) that enable and facilitate these deployments.

1.2. Rationale and Contributions

The rationale for this paper is to address the fundamental conflict between the computational demands of Transformer-based Large Language Models (LLMs) and the limited resources of edge computing devices. LLMs were designed for powerful, cloud-based data centers, making them unsuitable for deployment of edge devices in their original form. The paper aims to synthesize the growing body of research dedicated to making LLMs viable for edge computing. The key contributions of this paper are:

A Comprehensive Review: The paper provides a review of LLM deployment strategies, challenges, and future directions for Transformer models on edge computing.
Structured Understanding: It organizes the multi-faceted challenges and diverse solutions from academic and industrial literature into a structured framework.
Exploration of Optimization: The paper examines methods to compress LLMs and optimize their inference capabilities to make them more efficient for edge environments. This approach not only reduces computational costs but also improves user privacy and security.
Guidance for Organizations: The review provides insights into best practices, successful implementations, and the unique challenges of integrating LLMs with edge computing, which is crucial for organizations looking to leverage AI effectively.

The primary contributions revolve around synthesizing research to address the conflict between the high computational demands of Transformer-based Large Language Models (LLMs) and the resource constraints of edge devices. The paper provides a systematic review of LLM deployment strategies, existing challenges, and future directions specifically for Transformer models in edge computing environments. The core of the paper establishes a taxonomy that categorizes research into three main areas, model compression, architectural optimizations, and system-level approaches.

1.3. Paper Organization

The paper is structured to provide a comprehensive, data-driven review of the field. Section 2, the foundation, provides necessary context on Transformer architecture and edge computing. Section 3 is the core of the paper; it presents the taxonomy of edge LLM deployment strategies and systematically organizes existing research into three main categories: model compression, architectural optimizations, and system-level approaches. This is followed by Section 4 on Hardware and Software Considerations, which gives a practical overview of the ecosystem. The paper then synthesizes the information to identify Key Challenges and Open Problems in Section 5, and proposes future research directions in Section 6 based on the identified gaps. Finally, the conclusion summarizes the main findings and reiterates the paper’s significance. Figure 1 describes the outline of this survey paper.

2. Methodology and Literature Selection

To ensure a comprehensive and objective review of the strategies for deploying Large Language Models (LLMs) in edge computing environments, this survey adopted a systematic literature review (SLR) approach. This section outlines the databases searched, the timeframe covered, and the formal inclusion and exclusion criteria used to validate the analytical rigor of the study.

2.1. Search Strategy and Databases

The literature search was conducted across five major academic databases to capture both high-impact journal articles and cutting-edge conference proceedings:

IEEE Xplore & ACM Digital Library: Primary sources for architectural innovations (e.g., GQA, PagedAttention) and hardware acceleration.
ScienceDirect (Elsevier) & SpringerLink: Sources for multi-disciplinary applications, such as structural health monitoring and industrial IoT.
arXiv: Utilized for the most recent state-of-the-art (SOTA) algorithms (e.g., GPTQ, AWQ, and Llama variants) that are currently the industry standard but may still be in the pre-publication phase.
Google Scholar: Used for cross-referencing and identifying foundational whitepapers from hardware vendors (e.g., Qualcomm [12], NVIDIA [13]).

The primary search queries involved combinations of the following keywords: “Large Language Model”, “Transformer”, “Edge Computing”, “Model Compression”, “Quantization”, “Federated Learning”, and “Hardware-Software Co-design”.

2.2. Inclusion and Exclusion Criteria

To maintain a focus on current, relevant, and high-quality research, the following rules were applied to the retrieved documents (Table 1):

2.3. The Foundation of LLMs, Transformers, and the Edge Computing

Large Language Models (LLMs) have emerged as a cornerstone of modern artificial intelligence, transforming how machines process and generate human language. At the heart of these models lies the Transformer architecture, which has redefined natural language processing (NLP) tasks through its innovative design and capabilities.

2.4. Large Language Models (LLMs)

LLMs, such as OpenAI’s GPT-3, are designed to understand and generate text that closely resembles human language. These models are pre-trained on vast amounts of text data, allowing them to capture intricate patterns, grammar, and contextual nuances inherent in language [14]. The ability of LLMs to generate coherent and contextually relevant text has made them invaluable across various applications, including chatbots, content creation, and educational tools [15]. However, the massive size of these models poses significant challenges in terms of computational resources and deployment, particularly in resource-constrained environments. The scale and complexity of modern LLMs are significant, driven by their size, architectural innovations, and the vast data they are trained on [16]. While these models offer remarkable capabilities, they also present challenges in terms of computational demands, ethical considerations, and practical deployment, particularly in resource-constrained environments. Understanding these complexities is essential for organizations aiming to effectively leverage LLMs in their applications. Modern Large Language Models (LLMs) have reached unprecedented scales and complexity, fundamentally transforming the landscape of artificial intelligence and natural language processing. The scale and complexity of these models can be discussed across several dimensions:

Model Size
Modern LLMs are characterized by their massive number of parameters, often reaching billions or even trillions [16]. For instance, models like GPT-3 contain 175 billion parameters, while newer iterations may have even more. This vast number of parameters allows LLMs to capture intricate patterns and relationships in the data, enabling them to generate coherent and contextually relevant text.
Training Data
The complexity of LLMs is further amplified by the sheer volume and diversity of the training data used [17]. These models are trained on extensive datasets that encompass a wide range of topics, languages, and styles. This diversity is crucial for enabling the models to generalize well across different contexts and applications, from casual conversation to technical writing.
Architectural Innovations
The underlying architecture of LLMs, primarily based on the Transformer model, introduces complexity through its innovative mechanisms such as self-attention [16]. This allows the model to weigh the significance of different words in a sequence and understand context better. However, this self-attention mechanism has a quadratic complexity concerning the input length, which poses challenges in terms of computational resources and memory usage.
Computational Requirements
Training and deploying modern LLMs require substantial computational resources. The training process often involves distributed computing across numerous GPUs or TPUs, leading to significant energy consumption and costs [18]. The need for efficient training algorithms and optimization techniques is crucial to manage these resource demands.
Inference Complexity
Once trained, the complexity continues during inference, where generating responses can be computationally intensive, especially for long sequences [19]. Techniques such as caching, pruning, and quantization are employed to optimize inference times and reduce latency, particularly important in edge computing scenarios.
Integration with Edge Computing
As LLMs are integrated into edge computing environments, the complexity increases due to the need for model compression and optimization [20]. Techniques like quantization and pruning are essential to deploy these models on resource-constrained devices while maintaining performance. This integration allows for real-time processing and decision-making, but adds layers of complexity in terms of ensuring efficiency and responsiveness.
Ethical and Regulatory Considerations
The scale and complexity of LLMs also raise ethical and regulatory challenges [21]. As these models become more powerful, concerns regarding bias, misinformation, and data privacy become more pronounced. Organizations must navigate these issues while leveraging the capabilities of LLMs, adding another layer of complexity to their deployment.

2.4.1. Transformers

The Transformer architecture is pivotal to the success of LLMs [22]. Unlike previous sequence-to-sequence models that relied on recurrent neural networks (RNNs), Transformers utilize a self-attention mechanism that allows the model to weigh the significance of different words in a sentence relative to one another, regardless of their position. This capability enables Transformers to process entire sequences of text simultaneously, resulting in improved efficiency and performance in understanding context.

The architecture consists of an encoder-decoder framework, where the encoder processes input text and the decoder generates output text [23]. The self-attention mechanism, combined with feed-forward neural networks, allows for parallel processing and has significantly reduced training times compared to traditional models. This efficiency is crucial for training large models on extensive datasets, which is a hallmark of LLMs. The self-attention mechanism is a fundamental component of the Transformer architecture that enables models to weigh the importance of different words in a sequence when encoding or generating text. Unlike traditional models that process words in a linear fashion, self-attention allows the model to consider the entire context of the input sequence simultaneously [15,24].

How Self-Attention Works:

Input Representation: Each word in the input sequence is transformed into a vector representation (embedding) [25].
Query, Key, and Value Vectors: For each word, three vectors are created: a query vector, a key vector, and a value vector. These are derived from the input embeddings using learned linear transformations [26].
Attention Scores: The attention scores are calculated by taking the dot product of the query vector of a word with the key vectors of all words in the sequence. This results in a score that indicates how much focus should be placed on each word when processing a particular word [27].
Softmax Normalization: The scores are then normalized using the softmax function to create a probability distribution, which determines the weight of each word in the context of the current word [28].
Weighted Sum: Finally, the output for each word is computed as a weighted sum of the value vectors, using the attention weights derived from the previous step [29].

Effectiveness:

Contextual Understanding: Self-attention allows the model to capture long-range dependencies and relationships between words, significantly improving contextual understanding [30].
Parallelization: Unlike recurrent neural networks (RNNs), which process sequences sequentially, self-attention can process all words in parallel, leading to faster training times [31].

Computational Intensity:

Quadratic Complexity: The self-attention mechanism has a computational complexity of $O (n^{2})$ , where n is the length of the input sequence. This is due to the need to compute attention scores for each pair of words in the sequence [32].
Memory Usage: The requirement to store and compute these scores for all pairs of words can lead to high memory consumption, especially for long sequences [33].

2.4.2. Edge Computing

Edge computing refers to the practice of processing data closer to its source rather than relying solely on centralized cloud servers. This paradigm shift is particularly significant for AI applications, as it reduces latency, enhances privacy, and conserves bandwidth [34]. By deploying AI models, including LLMs, on edge devices such as smartphones, IoT sensors, and local servers, organizations can achieve real-time data analysis and decision-making, which is essential for applications in dynamic environments.

The integration of LLMs with edge computing is particularly promising, as it allows for intelligent applications to operate seamlessly while prioritizing data privacy and responsiveness [35]. This combination empowers organizations to leverage the capabilities of LLMs in scenarios where cloud computing may not be feasible or efficient, thereby enhancing user experiences and enabling innovative solutions.

Edge computing refers to the practice of processing data closer to its source rather than relying solely on centralized cloud servers. This paradigm shift allows data to be analyzed and acted upon in real-time at or near the location where it is generated, such as on devices like smartphones, IoT sensors, and local servers [34]. Core Benefits:

Reduced Latency:
Edge computing significantly reduces latency by minimizing the distance data must travel for processing. This is particularly beneficial for applications requiring real-time responses, as it allows for quicker data analysis and decision-making. For LLMs, this means that users can receive immediate feedback when interacting with AI systems, enhancing the overall user experience [36].
Enhanced Privacy:
By processing data locally on edge devices, organizations can better protect sensitive information. This approach reduces the need to transmit personal or sensitive data to centralized cloud servers, thereby mitigating risks associated with data breaches and unauthorized access. For LLM applications, maintaining privacy is crucial, especially when handling user-generated content or confidential information [37].
Bandwidth Conservation:
Edge computing helps conserve bandwidth by reducing the volume of data transmitted to and from cloud servers. Since data can be processed locally, only essential information needs to be sent to the cloud, leading to more efficient use of network resources. This is particularly advantageous for LLMs deployed in environments with limited connectivity or high data transfer costs, ensuring that AI applications remain functional and responsive [34].
Scalability and Flexibility:
Edge computing enables organizations to scale their AI applications more flexibly. By distributing processing across multiple edge devices, businesses can adapt to varying workloads and resource availability without over-relying on centralized infrastructure. This flexibility supports the deployment of LLMs in diverse settings, from smart homes to industrial automation [36].

The implementation of LLMs on typical edge devices is constrained by limited power, memory, and computational resources. Addressing these challenges requires innovative approaches, such as model compression, algorithmic optimizations, and efficient resource management, to ensure that LLMs can function effectively in edge environments. Typical edge devices, such as smartphones, IoT sensors, and embedded systems, face several constraints that significantly impact the implementation of Large Language Models (LLMs) [38]. These constraints include limited power, memory, and computational resources, which are crucial for the efficient deployment of AI technologies. Below are the key constraints characterized:

Limited Power:
Edge devices often operate on battery power or have strict energy consumption limits. This limitation necessitates the use of energy-efficient algorithms and models. According to [34], edge computing reduces latency and conserves bandwidth, but it also requires careful management of power consumption to ensure device longevity. The need for power efficiency is particularly critical when deploying resource-intensive LLMs, as their computational demands can quickly deplete battery life.
Memory Constraints:
The memory available on edge devices is typically much lower than that of traditional cloud servers. Modern LLMs, such as GPT-3, can have hundreds of billions of parameters, which require substantial memory for both storage and processing [39]. Deploying LLMs on resource-constrained devices necessitates model compression techniques to fit within the limited memory capacity. Without effective memory management strategies, such as pruning or quantization, running LLMs on edge devices can be impractical.
Computational Limitations:
Edge devices generally possess less computational power compared to centralized cloud servers. The computational intensity of LLMs, particularly during inference, poses a challenge for these devices. As highlighted by [36], the complexity of processing LLMs can lead to high latency and resource consumption, which are unsuitable for the real-time requirements of many edge applications. This necessitates the development of optimized algorithms and architectures that can operate efficiently under these constraints.
Network Bandwidth:
Although not a direct constraint of the devices themselves, the network bandwidth available to edge devices can limit their ability to interact with cloud services for additional processing or data retrieval. As stated by [34], edge computing helps conserve bandwidth by processing data locally, but this also means that edge devices must be capable of handling as much processing as possible on their own. This further emphasizes the need for LLMs to be optimized for edge deployment.

3. Taxonomy of Edge LLM Deployment Strategies

The taxonomy of edge LLM deployment strategies encompasses various approaches, including model compression techniques, architectural optimizations, and hybrid systems, each tailored to address the unique constraints of edge environments. This taxonomy serves as a foundational framework for understanding how to effectively implement LLMs on edge devices while considering the trade-offs between model performance and resource constraints. Figure 2 describes the taxonomy of edge LLM deployment strategies.

The taxonomy presented in this survey is not merely a list but a Bottleneck-Driven Hierarchy, derived by mapping the Transformer computational graph against the physical limitations of edge hardware. The classification framework used in this survey is derived via a multi-objective optimization lens, categorizing techniques based on the specific hardware constraint they address.

The Memory Wall (VRAM/Bandwidth): Solved via Model Compression (Quantization, Pruning, KD). These methods prioritize reducing the weight-loading overhead.
The Quadratic Wall (Context Complexity): Solved via Architecture Optimization (GQA, SWA, Sparse Attention). These methods focus on the $O (n^{2})$ scaling of the self-attention mechanism.
The Compute Wall (Latency/FLOPs): Solved via System-Level Management (Speculative Decoding, PagedAttention, Partitioning). These strategies optimize the execution pipeline for real-time responsiveness.
The Thermal Wall (Power/TDP): Solved via Hardware-Software Co-design (NPU/FPGA acceleration). These ensure sustainable operation in battery-limited environments.

3.1. Model Compression Techniques

To effectively deploy LLMs on edge devices, it is crucial to explore various model compression techniques that reduce their size and computational demands, ensuring efficient performance without sacrificing accuracy. These techniques include quantization, pruning, and knowledge distillation, which aim to maintain the model’s effectiveness while minimizing resource usage.

3.1.1. Quantization Methods for Transformer Models

Quantization methods such as PTQ, QAT, and mixed-precision provide effective strategies for optimizing Large Language Models for edge deployment [40]. While these techniques can lead to significant reductions in model size and computational demands, they also introduce challenges, particularly regarding the sensitivity of different Transformer layers to quantization. Understanding these complexities is essential for effectively leveraging quantization in practical applications, as organizations seek to balance resource efficiency with model performance. Quantization is a crucial technique for deploying Large Language Models (LLMs) on resource-constrained edge devices. By reducing the precision of the model weights and activations, quantization can significantly decrease model size and computational requirements while striving to maintain accuracy [41]. This section surveys various quantization methods, including Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and mixed-precision techniques, and discusses their impact on model size and accuracy, along with challenges associated with quantizing different parts of the Transformer architecture in Table 2.

3.1.2. Pruning in Large Language Models (LLMs)

Pruning is a technique used to reduce the size of neural networks by removing parameters (weights) that contribute less to the model’s performance [47]. This process is particularly important for deploying Large Language Models (LLMs) on resource-constrained edge devices, as it can significantly decrease the computational and memory requirements. Pruning can be broadly categorized into two types: structured and unstructured pruning. Both structured and unstructured pruning offer valuable strategies for optimizing Large Language Models for deployment in resource-constrained environments [41]. While unstructured pruning provides flexibility and potential for high compression rates, structured pruning enhances efficiency and simplifies implementation on edge devices. We must carefully consider the trade-offs between these methods to effectively leverage LLMs in practical applications. Table 3 explains the pruning strategies for optimizing LLMs on edge computing.

In the context of LLM architecture application, both pruning methods can be utilized to optimize models for deployment on edge devices. However, the choice between structured and unstructured pruning often depends on the specific requirements of the application and the constraints of the target hardware.

Unstructured Pruning in LLMs: Due to the massive size of LLMs, unstructured pruning can be effective in reducing the number of parameters without altering the overall architecture significantly. For example, ref. [50] demonstrated that unstructured pruning could lead to a 90% reduction in model size while retaining competitive performance in tasks such as text generation.
Structured Pruning in LLMs: Structured pruning is particularly advantageous when deploying LLMs on edge devices, where computational efficiency is crucial. Techniques such as channel pruning can be applied to transformer models, leading to reduced latency and improved responsiveness. For instance, ref. [51] showed that structured pruning could significantly reduce the number of operations required for inference in LLMs, making them more suitable for real-time applications.

3.1.3. Knowledge Distillation

Knowledge Distillation (KD) is a powerful technique for creating smaller, task-specific student models by transferring knowledge from a larger, pre-trained teacher model [52]. This method has gained traction in the field of deep learning, particularly for deploying large models, such as Large Language Models (LLMs), on resource-constrained edge devices. By leveraging the knowledge of larger teacher models, student models can achieve competitive accuracy with fewer parameters, making them suitable for deployment in resource-constrained environments [53]. The ability to adapt to specific tasks further enhances the utility of this approach, providing a pathway for organizations to implement advanced AI capabilities without the computational overhead associated with larger models. Below, we discuss the effectiveness of knowledge distillation and highlighting its benefits.

Model Compression: Knowledge distillation effectively reduces the size of neural networks while maintaining performance. The student model learns to mimic the teacher model’s outputs, which allows it to achieve competitive accuracy with significantly fewer parameters. Hinton et al. [54] demonstrated that a small student model could achieve similar performance to a larger teacher model by learning from the soft targets (probabilities) produced by the teacher, rather than just the hard labels. This approach enables the student model to capture more nuanced information about the data distribution.
Task-Specific Adaptation: KD is particularly effective for creating task-specific models. By distilling knowledge from a teacher model trained on a broad dataset, the student model can be fine-tuned for specific tasks or domains. For instance, Sun et al. [55] showed that distillation could be used to adapt a general language model to a specific downstream task, leading to improved performance on that task while retaining the efficiency benefits of a smaller model.
Improved Generalization: Knowledge distillation can enhance the generalization capabilities of student models. The teacher model often has learned robust features and representations from extensive training data. By transferring this knowledge, the student model can generalize better to unseen data. An empirical study by Fang et al. [56] found that student models trained through distillation exhibited improved performance on various tasks compared to models trained directly on the same dataset.
Efficiency in Inference: Smaller student models resulting from KD are more efficient during inference, making them suitable for deployment on edge devices. These models require less computational power and memory, which is crucial for real-time applications. KD can significantly reduce inference latency while maintaining high accuracy, making it a valuable strategy for applications in environments with limited resources [57].
Flexibility in Architecture: Knowledge distillation allows for flexibility in the architecture of student models. Researchers can experiment with different architectures and hyperparameters to find the optimal configuration for specific tasks. This adaptability is highlighted by Cho and Hariharan [58], who demonstrated that various student architectures could successfully learn from a single teacher model, resulting in tailored solutions for different applications.

3.2. Architectural and Algorithmic Optimizations

To further enhance the deployment of LLMs on edge devices, it is essential to explore hybrid approaches that combine model compression techniques with architectural innovations tailored for efficiency. This section will delve into the efficacy of combining model compression techniques with architectural optimizations, paving the way for more efficient deployment of LLMs on edge devices. By integrating these strategies, we aim to address the unique challenges posed by resource constraints while maximizing performance.

3.2.1. Efficient Transformer Variants: A Survey of New Architectures

The Transformer architecture has become the backbone of many state-of-the-art models in natural language processing and beyond. However, its inherent computational complexity and memory requirements, particularly due to the self-attention mechanism, have led researchers to explore various efficient variants [59]. We survey some of the prominent architectures designed to improve efficiency while maintaining performance. The exploration of efficient Transformer variants described in Table 4 has led to significant advancements in the ability to deploy these models in resource-constrained environments. This has been achieved by reducing the complexity of the self-attention mechanism and leveraging innovative architectural strategies [60].

3.2.2. Inference Optimization Techniques for Large Language Models

As the deployment of Large Language Models (LLMs) on resource-constrained edge devices becomes increasingly prevalent, optimizing inference is essential to ensure efficiency and responsiveness [68]. Two notable techniques for optimizing inference are speculative decoding and efficient Key-Value (KV) cache management. Both speculative decoding and efficient KV cache management are vital techniques for optimizing the inference of Large Language Models, particularly in resource-constrained environments [69]. By leveraging these techniques, we can enhance the performance and responsiveness of AI applications while maintaining efficiency in resource usage. Table 5 lists a review of inference optimization for LLMs.

3.2.3. Automated Search and Hardware-Aware Optimization

In deploying models to edge environments, hyperparameter optimization (HPO) is essential since, in contrast to cloud deployments, the optimal window is greatly constrained by physical hardware limitations like power limits and memory bandwidth. By incorporating techniques such as Neural Architecture Search (NAS) and Hardware-aware HPO, developers can automatically find models that fall on the Pareto front of the trade-off between accuracy, latency, and energy. According to Bernardini et al. [73], hyperparameter optimization is not just a performance enhancer but a necessity for functional reliability in a drive-by monitoring system. Continuous Wavelet Transform (CWT) and Sparse Autoencoders were utilized to extract features from train bogie vertical acceleration. HPO is essential here to balance the sparsity of the autoencoder—too much sparsity may miss subtle bridge damage, while too little could increase the computational load beyond what an edge-mounted sensor can process in real-time. The study notes that raw bogie vertical accelerations must be preprocessed to extract specific frequency regions governed by the bridge’s health. Tuning hyperparameters related to the CWT (such as scales or wavelet types) is critical; incorrect settings could lead to the extraction of noise rather than structural signatures, rendering subsequent AI analysis useless. Because the system is designed to support visual inspections in a cost-efficient manner, the model must be hardware-aware to operate on low-power devices mounted on trains. Strategies like NAS can be used to automatically find the optimal depth and width of the autoencoder that fits within the memory limits of the onboard monitoring unit while maintaining the sensitivity required for damage detection. There are two strategic HPO Methods for Edge LLMs, as listed in Table 6.

3.3. System-Level and Hybrid Approaches

The integration of LLMs with edge computing presents both opportunities and challenges, necessitating innovative strategies to optimize performance while addressing resource constraints. The exploration of these strategies will be crucial as the demand for efficient AI applications continues to grow in resource-constrained environments [74].

3.3.1. Edge-Cloud Collaboration

The integration of edge computing and cloud resources presents an opportunity to optimize the deployment of Large Language Models (LLMs) by effectively distributing workloads between edge devices and centralized cloud servers [75]. This collaboration can enhance performance, reduce latency, and improve resource utilization. As the demand for real-time AI applications grows, these paradigms will play a critical role in shaping the future of AI deployment [76]. Table 7 shows several paradigms for splitting the model or the workload between the edge and the cloud.

While partitioning leverages the strengths of both environments, the transmission of intermediate layer activations (the data passed from the edge stub to the cloud head) creates a unique attack surface. Therefore, it needs a specific strategy for Edge-Cloud Collaboration. To address these security gaps, the following strategies should be considered (Table 8).

3.3.2. On-Device Fine-Tuning

On-device fine-tuning has become increasingly important for adapting Large Language Models (LLMs) to specific tasks while operating within the constraints of edge devices [86]. Techniques such as Low-Rank Adaptation (LoRA) and other Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as effective strategies for achieving this goal. These methods allow models to be fine-tuned with minimal computational overhead and memory usage, making them suitable for deployment in resource-constrained environments. As the demand for personalized and context-aware AI applications grows, these techniques will play a critical role in enhancing user experiences across various domains. Table 9 shows an overview of the Fine-Tuning methods.

3.3.3. Federated Learning

Federated Learning (FL) is a decentralized machine learning approach that enables multiple participants (e.g., edge devices, organizations) to collaboratively train models without sharing their raw data [91]. This paradigm is particularly relevant for training Large Language Models (LLMs) as it allows for leveraging diverse datasets while maintaining data privacy and security. However, challenges such as data heterogeneity and communication overhead must be addressed to fully realize the benefits of this paradigm. Below, we survey how Federated Learning is being applied to collaboratively train LLMs, highlighting its benefits and challenges.

Overview of Federated Learning in LLMs
Federated Learning facilitates the training of LLMs by allowing each participant to train a local model on their private data and subsequently share only model updates (gradients) with a central server. The server aggregates these updates to improve a global model, which is then distributed back to the participants [92]. This approach ensures that sensitive data remains on local devices, mitigating privacy concerns.
Applications of Federated Learning in LLMs
- Healthcare: In healthcare applications, federated learning has been employed to train language models on patient records while preserving confidentiality. For instance, researchers have utilized FL to develop LLMs that can analyze medical texts and support clinical decision-making without exposing sensitive patient data [93].
- Natural Language Processing: Federated learning has been applied in NLP tasks, such as sentiment analysis and text classification, where organizations can collaboratively train models on user-generated content without sharing the underlying data. This enables the creation of more robust models that generalize better across different contexts [94].
- Personalized Language Models: FL allows for the development of personalized language models that adapt to individual user preferences while preserving privacy. By training on local data, organizations can create models that reflect unique user interactions without compromising sensitive information [95].
Benefits of Federated Learning for LLMs
- Data Privacy: One of the primary advantages of FL is its ability to preserve data privacy. Since raw data never leaves the local device, organizations can comply with stringent data protection regulations, such as GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act) [96].
- Diverse Data Sources: Federated learning enables the aggregation of diverse datasets from multiple sources, which can enhance the generalization capabilities of LLMs. This diversity is particularly beneficial for NLP tasks, where language usage can vary significantly across different demographics and contexts [97].
- Reduced Communication Costs: By transmitting only model updates instead of raw data, FL reduces communication costs and bandwidth usage, making it more efficient, especially in environments with limited connectivity [98].
Challenges in Implementing Federated Learning for LLMs
- Heterogeneity of Data: One of the significant challenges in FL is the non-IID (Independent and Identically Distributed) nature of data across participants. This heterogeneity can lead to biased model updates and affect the convergence of the global model [99].
- Communication Overhead: While federated learning reduces the amount of data transmitted, the need for frequent communication between devices and the central server can still introduce latency and overhead, particularly when dealing with large models like LLMs [100].
- Security Risks: Although FL enhances data privacy, it is still susceptible to potential attacks, such as model inversion or poisoning attacks, where adversaries may attempt to infer sensitive information from shared gradients [101].
Communication-efficient Federated Learning strategies
- Gradient Compression: Reduces update size via quantization (e.g., 1-bit SGD) or sparsification (sending only top-k gradients).
- Integration with PEFT: Instead of full gradients, only updates for small adapter modules (like LoRA) are transmitted.
- Local SGD/FedAvg: Increases local training steps on the device before communicating with the server.

4. Hardware and Software Considerations

To effectively deploy LLMs on edge devices, it is essential to consider the hardware capabilities and software frameworks that facilitate efficient processing and resource management. Figure 3 shows the hardware and software considerations scheme.

4.1. Edge AI Hardware

To effectively deploy LLMs on edge devices, it is essential to consider the hardware capabilities and software frameworks that facilitate efficient processing and resource management [68]. The choice of hardware, including CPUs, GPUs, and specialized accelerators, along with optimized software frameworks, plays a crucial role in ensuring successful LLM deployment on edge devices. The integration of hardware accelerators, such as FPGAs and NPUs, can significantly enhance the efficiency of LLMs on edge devices, addressing the challenges of limited computational resources and energy consumption.

4.1.1. Review of Hardware Platforms for Running Transformers

The choice of hardware platform for running Transformers depends on the specific requirements of the application, such as power efficiency, cost, and performance [102]. CPUs offer versatility and ease of use, while GPUs provide high parallel processing capabilities. NPUs are optimized for AI tasks with energy efficiency, and FPGAs allow for customization at the cost of increased development complexity. The deployment of Transformer models, particularly Large Language Models (LLMs), on edge devices demands careful consideration of the hardware platforms available. Each platform—CPUs, GPUs, NPUs, and FPGAs—comes with its own set of strengths and weaknesses that can significantly impact the performance and efficiency of running Transformers [103]. Table 10 presents a review of these hardware platforms, highlighting their specific attributes.

4.1.2. Evaluation Metrics for Edge LLMs

To substantiate the analytical comparison of edge deployment strategies, it is essential to move beyond general accuracy scores and adopt metrics that reflect hardware-level constraints [18,109]. Standard benchmarks for edge LLMs prioritize real-time responsiveness: Time-to-First-Token (TTFT) measures the initial delay, while Tokens-per-Second (TPS) determines sustained generation speed [39,79,110]. Given the limited VRAM of edge devices, the Memory Footprint (VRAM) and Energy Efficiency (Joules/Token) are the primary indicators of a model’s operational sustainability [10,36,111]. Table 11 shows the benchmarks of Edge-Optimized LLMs for 7B Parameter model.

4.1.3. Hardware Accelerators for AI Workloads

The rise of specialized accelerators for Transformer models has been driven by the increasing demand for efficient processing of large-scale neural networks, particularly in natural language processing (NLP) tasks [112]. These accelerators, such as Neural Processing Units (NPUs), Tensor Processing Units (TPUs), and Field-Programmable Gate Arrays (FPGAs), are designed to optimize the performance and energy efficiency of Transformer architectures, which have become the backbone of many state-of-the-art models in AI. As Transformer models continue to evolve and expand in complexity, the role of these specialized accelerators will likely become increasingly critical in facilitating their deployment across various applications. Table 12 shows the reviews of Hardware Accelerators for AI Workloads.

4.2. Inference Frameworks and Toolkits

To effectively deploy Large Language Models (LLMs) on edge devices, it is essential to leverage specialized inference frameworks and toolkits designed for optimizing performance and resource management [114]. These frameworks facilitate the deployment of LLMs by providing optimized libraries and tools that enhance efficiency and enable seamless integration with various hardware platforms.

4.2.1. Key Software Libraries and Frameworks for Edge Deployment

The libraries and frameworks, including TensorFlow Lite, ONNX Runtime, and llama.cpp, play a crucial role in enabling the deployment of machine learning models on edge devices [115]. Their optimization features, cross-platform compatibility, and efficient resource management make them suitable for various applications in real-time environments. Table 13 shows the key software Libraries and Frameworks for edge deployment.

4.2.2. Toolchains of Model Training and Deployment Gaps

Toolchains that bridge the gap between model training and deployment are essential in the machine learning workflow, particularly for ensuring that models transition smoothly from development to production environments [121]. These toolchains encompass a variety of tools and frameworks that facilitate tasks such as model optimization, performance tuning, and deployment management. The importance of these toolchains can be highlighted through several key aspects:

Streamlined Workflow: Toolchains provide a structured workflow that integrates various stages of the machine learning lifecycle, from data preprocessing and model training to deployment and monitoring. This integration helps reduce the complexity involved in managing multiple tools and ensures that data scientists and engineers can focus on building effective models rather than dealing with disparate systems. A cohesive toolchain can significantly improve productivity by automating repetitive tasks and providing a unified interface for managing the model lifecycle [122].
Model Optimization: Many toolchains include features for model optimization, such as quantization, pruning, and compression techniques. These optimizations are crucial for deploying models in resource-constrained environments, such as edge devices. For instance, TensorFlow Lite and ONNX Runtime offer built-in support for these optimizations, enabling developers to reduce model size and improve inference speed without sacrificing accuracy [123]. This capability is vital for ensuring that models perform efficiently in real-time applications.
Cross-Platform Compatibility: Toolchains often support multiple frameworks and platforms, allowing models to be trained in one environment and deployed in another. This flexibility is particularly important in heterogeneous computing environments where different hardware accelerators (e.g., CPUs, GPUs, NPUs) may be utilized. For example, ONNX Runtime enables models trained in various frameworks, such as PyTorch and TensorFlow, to be converted to the ONNX format for deployment, facilitating seamless integration across different systems [124].
Monitoring and Maintenance: Effective toolchains include monitoring capabilities that allow organizations to track model performance in production. This monitoring is essential for identifying issues such as model drift, where the model’s performance degrades over time due to changes in data distribution. By incorporating monitoring tools, organizations can implement strategies for model retraining and updates, ensuring sustained model accuracy and relevance [125].
Collaboration and Version Control: Toolchains enhance collaboration among team members by providing version control and reproducibility features. This is particularly important in machine learning projects where multiple stakeholders may be involved in model development and deployment. A well-designed toolchain can facilitate collaboration and ensure that all team members are working with the same model versions and datasets, thereby reducing errors and improving project outcomes [126].

5. Key Challenges

The exploration of these challenges is essential for advancing the deployment of Large Language Models in edge computing environments, ensuring both efficiency and user satisfaction.

5.1. Performance vs. Accuracy Trade-Offs in Transformer Models on Edge Devices

The deployment of Transformer models, particularly Large Language Models (LLMs), on edge devices, presents a critical challenge: balancing performance (speed and efficiency) with accuracy. This trade-off is particularly pronounced due to the constraints of edge devices, such as limited computational power, memory, and energy resources. Below are key aspects of this challenge:

Model Size and Computational Demands
Transformers, especially modern LLMs, often contain billions of parameters, making them computationally intensive. While larger models tend to achieve higher accuracy due to their ability to capture intricate patterns in data, they also require substantial resources for both training and inference. This necessitates a careful consideration of model size and its impact on performance.
Speed and Latency Considerations
Edge devices are often tasked with real-time processing, which requires low latency responses. The computational intensity of LLMs can lead to increased inference times, making them unsuitable for applications that demand immediate feedback. Consequently, optimizing for speed may involve sacrificing some degree of accuracy, particularly if model simplifications are made.
Model Compression Techniques
To address the performance vs. accuracy dilemma, various model compression techniques have been developed. These include quantization, pruning, and knowledge distillation, which aim to reduce the size and computational requirements of models while striving to maintain accuracy. However, these techniques can lead to a degradation in model performance, especially in layers sensitive to changes in weight distributions.
Quantization and Its Impact
Quantization methods, such as Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), can significantly reduce model size but may also introduce accuracy trade-offs. This highlights the inherent trade-off between achieving a smaller, faster model and maintaining high levels of accuracy.
The Role of Efficient Architectures
Efficient Transformer variants, such as Linformer and Performer, have been proposed to mitigate the computational demands of standard Transformers. These architectures aim to reduce the complexity of the self-attention mechanism, allowing for faster processing without a significant drop in accuracy. Efficient architectures should achieve linear complexity with respect to sequence length while maintaining competitive performance, thus addressing the performance-accuracy trade-off effectively.

5.2. Generalization and Long-Tail Problems in Edge Models

The deployment of Large Language Models (LLMs) on edge devices presents significant challenges, particularly in achieving generalization comparable to their larger cloud-based counterparts. Generalization refers to the model’s ability to perform well on unseen data, while the long-tail problem denotes the difficulty of effectively handling rare or less frequent events in datasets. Below, we explore these challenges and their implications for edge models.

Limited Training Data and Diversity
Edge models often operate on smaller, domain-specific datasets due to constraints in data availability or privacy concerns. This limited exposure can hinder their ability to generalize across a wide range of scenarios. In contrast, larger cloud models are typically trained on vast and diverse datasets, allowing them to capture complex patterns and variations in language. The lack of diversity in training data for edge models can lead to overfitting, where the model performs well on the training data but fails to generalize to new inputs.
Long-Tail Distribution Challenges
The long-tail problem arises when models encounter rare events or infrequent categories that are underrepresented in the training data. Edge models, which may be trained on specific tasks or user interactions, often struggle to handle these long-tail scenarios effectively. This is particularly problematic in applications like natural language processing, where certain phrases or topics may appear infrequently but are critical for comprehensive understanding. In contrast, larger models benefit from extensive training on diverse datasets, enabling them to better manage long-tail distributions and provide more robust responses.
Model Size and Complexity
The size and complexity of edge models are inherently limited by the constraints of the hardware they run on, which affects their capacity to learn and generalize. Smaller models may lack the representational power needed to capture intricate relationships within data, leading to poorer performance in generalization tasks. In contrast, massive cloud-based models, such as GPT-3, leverage billions of parameters to understand subtle nuances in language, resulting in superior generalization capabilities.
Mitigation Strategies
To address these challenges, several strategies can be employed:
- Data Augmentation: Enhancing the training dataset through techniques such as paraphrasing or synthetic data generation can improve the diversity and robustness of edge models.
- Transfer Learning: Utilizing pre-trained models as a starting point for fine-tuning on edge devices can help leverage the knowledge captured by larger models, improving generalization on specific tasks.
- Ensemble Methods: Combining multiple models or predictions can help mitigate the effects of long-tail distributions, as ensemble approaches can capture a wider range of patterns.

5.3. Data Privacy and Security in On-Device Processing

On-device processing has emerged as a critical paradigm in the deployment of artificial intelligence models, particularly in the context of data privacy and security. By processing data locally on devices such as smartphones, IoT sensors, and edge servers, organizations can significantly enhance user privacy and mitigate various security risks associated with cloud-based processing. However, this approach is not without its vulnerabilities.

Importance of On-Device Processing for Privacy
On-device processing is essential for safeguarding user privacy, as it minimizes the risk of unauthorized data access and ensures compliance with stringent data protection regulations. By keeping sensitive data on local devices, organizations can reduce exposure to potential breaches while still harnessing the power of AI for real-time applications.
- Data Minimization: On-device processing allows for data minimization by reducing the need to transmit sensitive information to centralized servers. By analyzing data locally, only essential information is sent to the cloud, thereby limiting exposure to potential data breaches. This is particularly crucial for applications handling personal or sensitive data, such as health records or financial information.
- User Control: Users retain greater control over their data when processing occurs on their devices. This empowerment fosters trust, as individuals can manage their data without relying on third-party entities. Research indicates that users are more likely to engage with applications that prioritize data privacy.
- Regulatory Compliance: With stringent data protection regulations like the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), on-device processing can facilitate compliance by ensuring that sensitive data remains within the user’s device. This reduces the risk of regulatory penalties associated with data mishandling.
Security Vulnerabilities of Deploying Models on Physical Devices
Despite the privacy benefits, deploying models on physical devices introduces several security vulnerabilities that must be addressed:
- Physical Access Risks: Devices can be physically accessed by unauthorized individuals, leading to potential data theft or manipulation. Attackers can exploit vulnerabilities in the device’s operating system or hardware to extract sensitive information stored locally.
- Malware and Exploits: Edge devices are often targets for malware attacks, which can compromise the integrity of the models and the data they process. Malicious software can manipulate the model’s behavior or exfiltrate sensitive data, posing significant risks to user privacy.
- Model Inversion Attacks: Attackers can perform model inversion attacks to infer sensitive information about the training data by querying the model with specific inputs. This vulnerability is particularly concerning for models deployed on devices that process sensitive data, as it can lead to unauthorized access to private information.
- Data Leakage through Side Channels: On-device models may inadvertently leak information through side channels, such as timing attacks or power consumption patterns. These attacks exploit the physical characteristics of the device during processing to extract sensitive information about the data being processed.
- Lack of Regular Updates: Unlike cloud-based systems that can be updated centrally, edge devices may lack regular security updates, leaving them vulnerable to newly discovered exploits. This can result in prolonged exposure to known vulnerabilities, increasing the risk of security breaches.

5.4. Energy Efficiency

Energy efficiency is a critical consideration in the deployment of artificial intelligence, especially for battery-powered devices such as smartphones, IoT sensors, and edge computing devices. The challenge lies in minimizing power consumption while maintaining performance and functionality. Here are key points regarding energy efficiency and power consumption, along with relevant citations:

Power Constraints: Battery-powered devices often operate under strict energy constraints, necessitating the development of energy-efficient algorithms and models. The need for power efficiency is particularly critical when deploying resource-intensive Large Language Models (LLMs), as their computational demands can quickly deplete battery life.
Model Compression Techniques: Techniques such as quantization, pruning, and knowledge distillation are essential for reducing the size and computational requirements of models, thereby enhancing energy efficiency. For instance, When converting from FP32 to INT8, which helps in conserving energy during inference.
Energy-Efficient Architectures: The development of specialized hardware, such as Neural Processing Units (NPUs) and Tensor Processing Units (TPUs), has been driven by the need for energy-efficient processing of AI workloads. These accelerators are designed to optimize energy consumption while delivering high performance for deep learning tasks.
Dynamic Power Management: Implementing dynamic power management strategies can significantly reduce energy consumption. Techniques such as adaptive voltage scaling and dynamic frequency scaling allow devices to adjust power usage based on workload demands, enhancing overall energy efficiency.
Algorithmic Optimizations: Efficient algorithms can minimize the number of computations required for model inference, directly impacting energy consumption. For example, the use of efficient attention mechanisms in Transformer models can reduce the computational burden, leading to lower power usage during processing.
Battery Life Considerations: As AI applications become more prevalent on edge devices, the need for energy-efficient designs is paramount to extend battery life. Research indicates that optimizing AI models for energy efficiency can lead to significant improvements in battery longevity, which is crucial for user satisfaction and device usability.

5.5. Dynamic Resource Management

Dynamic resource management in edge computing environments is crucial for optimizing the performance of applications, particularly those utilizing Large Language Models (LLMs). This problem involves creating systems that can intelligently adapt to varying device capabilities, network conditions, and workload demands. Below are key aspects of dynamic resource management, along with relevant citations.

Importance of Dynamic Resource Management
Dynamic resource management ensures that applications can efficiently utilize available resources while maintaining performance standards. This adaptability is particularly critical in edge computing, where devices often have limited computational power, memory, and energy resources.
Adaptive Resource Allocation
Adaptive resource allocation strategies allow systems to dynamically adjust resource distribution based on real-time conditions. For instance, resource allocation can be modified based on the current workload, device capabilities, and network latency. This flexibility enables the deployment of LLMs on edge devices while ensuring that performance remains optimal.
Context-Aware Resource Management
Context-aware resource management systems leverage information about the current state of the device and network environment to make informed decisions about resource allocation. By analyzing contextual data, such as user behavior and application requirements, these systems can optimize resource usage effectively. This approach can lead to improved responsiveness and user experience in applications reliant on LLMs.
Proactive vs. Reactive Strategies
Dynamic resource management can be categorized into proactive and reactive strategies. Proactive strategies anticipate changes in resource demands and adjust resources accordingly before issues arise. In contrast, reactive strategies respond to changes after they occur, which may lead to temporary performance degradation. Implementing proactive strategies can enhance the reliability and efficiency of LLM applications on edge devices.
Load Balancing Techniques
Load balancing is a critical aspect of dynamic resource management, particularly in hybrid edge-cloud systems. By distributing workloads evenly across available resources, load balancing techniques can prevent bottlenecks and ensure that no single device is overwhelmed. This is essential for maintaining performance and responsiveness in applications that utilize LLMs, especially during peak usage times.
Resource Prediction Models
Resource prediction models use machine learning techniques to forecast future resource requirements based on historical data and current usage patterns. By accurately predicting resource needs, systems can allocate resources more efficiently and reduce the likelihood of performance degradation due to resource shortages. This predictive capability is particularly valuable in dynamic environments where resource demands can fluctuate significantly.
Challenges in Dynamic Resource Management
Despite the benefits, dynamic resource management faces several challenges, including:
- Heterogeneity of Devices: The diversity in device capabilities complicates the management of resources, as different devices may require different approaches to resource allocation.
- Network Variability: Fluctuations in network conditions can impact the performance of edge applications, making it difficult to maintain consistent resource management strategies.
- Security and Privacy Concerns: Dynamic resource management systems must also address security and privacy issues, particularly when handling sensitive data on edge devices.

5.6. Cross-Disciplinary Applications of Edge Intelligence

The convergence of LLMs with safety-critical fields like bio-medical robotics, civil engineering, and aerospace reveals a persistent gap: the lack of a Unified Edge Reliability Standard. While each domain currently operates in a silo—using distinct benchmarks for structural integrity vs. linguistic perplexity—the underlying hardware-software co-design challenges are identical.

Synchronization vs. Latency: In Bio-inspired Robotics, the challenge is not just the speed of inference but the synchronization with physical hardware (Hardware-in-the-Loop) [127]. While an LLM can afford to be asynchronous, a medical robotic sphincter must respond to tissue pressure changes with zero-latency precision.
Reliability vs. Power: As substantiated by the Bernardini et al. [73], civil engineering applications face a Power-Accuracy Paradox. The more complex the ML model used to filter environmental noise, the faster the edge node’s battery depletes, creating a hard limit on long-term autonomous monitoring.
Safety vs. Complexity: In Aerospace systems, the complexity of the Graph-Language Model (BiDGCNLLM) must be balanced against the safety-critical need for low-latency state forecasting [9]. This mirrors the challenges in edge LLMs where aggressive quantization can reduce latency but may introduce hallucinations that are unacceptable in a flight-safety context.

6. Future Research Directions

Future research should focus on enhancing the integration of lightweight LLMs into edge computing, addressing scalability challenges while maintaining performance and data privacy. This includes exploring innovative model compression techniques, adaptive resource management strategies, and the implementation of federated learning to enhance collaboration while preserving user data privacy.

6.1. Hardware-Software Co-Design

The rapid advancement of artificial intelligence (AI), particularly through the deployment of Large Language Models (LLMs) on edge devices, necessitates a more integrated approach between hardware and software development. This collaboration is essential for optimizing performance, energy efficiency, and overall system effectiveness. The need for closer collaboration between chip designers and AI researchers is paramount in addressing the unique challenges posed by deploying AI, particularly LLMs, on edge devices. By fostering a hardware-software co-design approach, stakeholders can optimize performance, enhance energy efficiency, and ensure that systems are adaptable to the rapidly evolving landscape of AI. This integrated strategy not only benefits individual applications but also paves the way for innovative solutions that can redefine the capabilities of AI technology in various domains. Below are key reasons highlighting the importance of hardware-software co-design in the context of AI:

Optimized Performance for AI Workloads:
AI workloads, especially those involving LLMs, demand high computational power and efficiency. By collaborating closely, chip designers can create specialized hardware architectures tailored specifically for the computational patterns of AI algorithms. For instance, the development of Neural Processing Units (NPUs) has been driven by the need for architectures optimized for matrix multiplications and deep learning tasks, significantly enhancing performance compared to traditional CPUs and GPUs. This synergy allows for the development of hardware that maximizes the efficiency of AI models while ensuring that software can leverage these optimizations effectively.
Energy Efficiency:
Energy consumption is a critical concern for deploying AI models, particularly on battery-powered edge devices. Hardware-software co-design enables the creation of energy-efficient architectures that reduce power consumption during both training and inference phases. For example, specialized accelerators like Tensor Processing Units (TPUs) are designed to perform tensor operations with minimal energy usage. By collaborating with AI researchers, chip designers can identify the most energy-intensive operations and optimize hardware accordingly, leading to significant improvements in battery life and overall system efficiency.
Adaptation to Evolving AI Techniques:
The field of AI is rapidly evolving, with new algorithms and techniques emerging frequently. A close partnership between chip designers and AI researchers ensures that hardware can quickly adapt to these advancements. For instance, as AI models become more complex and require advanced features such as dynamic memory management or efficient data handling, hardware must be designed to accommodate these needs. The collaboration fosters a feedback loop where hardware capabilities can inform software design and vice versa, leading to more agile and responsive development processes.
Addressing Scalability Challenges:
As AI applications scale, the challenges associated with deploying models on various devices become more pronounced. Hardware-software co-design facilitates the development of scalable solutions that can efficiently handle increased workloads without compromising performance. By jointly exploring architectural innovations and algorithmic efficiencies, teams can create systems that scale effectively across different hardware platforms, ensuring consistent performance in diverse environments.
Enhanced Security and Privacy:
The deployment of AI models, particularly in sensitive applications, raises concerns about data privacy and security. A collaborative approach allows for the integration of security features directly into the hardware design, providing robust protection against potential vulnerabilities. For example, incorporating hardware-level encryption and secure processing units can help safeguard sensitive information while processing data on edge devices. This proactive strategy ensures that security measures are aligned with the operational requirements of AI applications.
Facilitating Real-Time Processing:
Many AI applications require real-time processing capabilities, which can be hindered by traditional hardware-software separations. By collaborating closely, chip designers and AI researchers can develop systems that minimize latency and enhance responsiveness. Techniques such as efficient key-value (KV) cache management and speculative decoding can be better optimized when hardware is designed with these specific AI processing requirements. This co-design approach ensures that both hardware and software are aligned to meet the demands of real-time applications effectively.

6.2. New Architectures of Transformer Variants for Edge Constraints

The deployment of Larges Language Models (LLMs) on edge devices presents unique challenges due to constraints in computational power, memory, and energy efficiency. To address these challenges, researchers have developed novel Transformer variants specifically designed for edge environments. By optimizing for efficiency, scalability, and performance, they pave the way for deploying advanced AI capabilities in resource-constrained settings, enhancing user experiences across various applications. Below are some of the prominent architectures that have emerged to optimize performance while adhering to the limitations of edge devices. As the demand for deploying Large Language Models (LLMs) on edge devices continues to grow, researchers are increasingly focusing on the development of hybrid architectures that leverage both edge and cloud resources to optimize performance while addressing resource constraints. This approach not only allows for the immediate processing capabilities of edge devices but also enables offloading more complex computations to the cloud when necessary, creating a seamless integration that enhances overall system efficiency. For instance, dynamic task allocation can be employed to determine whether a task should be executed locally or sent to the cloud based on real-time conditions, such as network latency and device power levels, thereby improving responsiveness and user experience.

6.3. On-Device Lifelong Learning

On-device lifelong learning refers to the capability of machine learning models to continuously learn and adapt to new data over time while operating directly on user devices. This approach is particularly beneficial in edge computing environments, where privacy, efficiency, and real-time processing are critical. Below are key strategies and methodologies for enabling on-device lifelong learning.

Incremental Learning Approaches
Incremental learning allows models to update their knowledge base without retraining from scratch. This is particularly useful for adapting to new data while retaining previously learned information. Incremental learning techniques enable models to learn from new data instances as they become available, effectively adapting to changes in the environment or user preferences. This approach helps mitigate catastrophic forgetting, a common issue where new learning interferes with previously acquired knowledge.
Federated Learning
Federated learning enables multiple devices to collaboratively learn a shared model while keeping their data decentralized, enhancing privacy and security. In this paradigm, local models are trained on-device and only model updates are sent to a central server for aggregation. This allows for continuous learning from diverse data sources without compromising user privacy.
Model Distillation
Model distillation is a technique where a smaller, more efficient model (student) is trained to replicate the behavior of a larger, more complex model (teacher), facilitating on-device learning. Distillation allows for the transfer of knowledge from a more complex model to a simpler one, enabling the latter to learn more efficiently and effectively on resource-constrained devices.
Adaptive Learning Rates
Using adaptive learning rates allows models to adjust their learning speed based on the characteristics of the incoming data. Techniques such as AdaGrad, RMSprop, and Adam can help models converge more quickly and effectively, especially when learning from non-stationary data streams.
Data Augmentation and Synthetic Data
Data augmentation techniques can enhance the diversity of training data, allowing models to generalize better and adapt to new scenarios. By artificially increasing the size and variability of training datasets through transformations (e.g., rotations, translations), models can learn to be more robust and adaptable to new data distributions.

6.4. Standardized Benchmarks for Comprehensive Evaluation of Edge LLM Performance

The establishment of comprehensive, standardized benchmarks for evaluating edge LLM performance is essential for advancing the field of AI. These benchmarks will facilitate consistent evaluations, meaningful comparisons, and targeted optimizations, ultimately driving innovation and improving user experiences in edge computing environments. As the demand for efficient and effective AI solutions grows, the importance of standardized benchmarks cannot be overstated. These benchmarks are essential for several reasons:

Consistency in Evaluation
Standardized benchmarks provide a consistent framework for evaluating the performance of edge LLMs across different devices, architectures, and applications. This consistency is crucial for researchers and practitioners to compare results and understand the capabilities of various models. Without standardized metrics, it becomes challenging to ascertain which models perform best under specific conditions or to identify the trade-offs involved in deploying LLMs on resource-constrained devices.
Facilitating Comparisons
With the rapid development of new architectures and techniques, standardized benchmarks enable meaningful comparisons between different models and approaches. For instance, metrics such as latency, throughput, and accuracy can be uniformly assessed, allowing stakeholders to make informed decisions about which models to adopt for particular applications. This comparative analysis is vital for guiding the selection of models that best meet the performance requirements of edge computing environments.
Identifying Performance Bottlenecks
Standardized benchmarks help in identifying performance bottlenecks in edge LLMs. By evaluating models against a common set of tasks and conditions, researchers can pinpoint specific areas that require optimization, such as memory usage, computational efficiency, or inference speed. This targeted analysis can drive advancements in model compression techniques, architectural innovations, and algorithmic optimizations that enhance edge LLM performance.
Supporting Reproducibility
Reproducibility is a cornerstone of scientific research. Standardized benchmarks ensure that evaluations can be replicated across different studies and environments, fostering trust in the reported results. This is particularly important in the field of AI, where the complexity of models and variability in hardware can lead to inconsistent findings. By adhering to standardized evaluation protocols, researchers can contribute to a more reliable body of knowledge regarding edge LLM performance.
Benchmarking Metrics for Edge LLMs
Benchmarking edge LLMs need to turn away from general accuracy scores and toward metrics that emphasize the device’s physical limitations and real-time interaction in order to deliver useful insights for hardware-software co-design. Time-to-First-Token (TTFT), which gauges initial responsiveness, and Tokens-per-Second (TPS), which makes sure the generation pace satisfies human reading needs, are examples of key performance indicators. In addition, these rates need to be assessed in relation to the Thermal Design Power (TDP), also known as TPS per Watt, to make sure the model does not cause excessive battery drain or thermal throttling. Lastly, to confirm that architectural optimizations like PagedAttention successfully maximize the constrained VRAM accessible on edge hardware, memory efficiency measures like KV Cache use are crucial.

7. Conclusions

The paper highlights the transformative potential of integrating Large Language Models (LLMs) with edge computing, emphasizing the need for a multi-faceted approach to overcome the challenges of deploying these models on resource-constrained devices. Various methods such as quantization, pruning, and knowledge distillation are crucial for reducing the size and computational demands of LLMs, enabling their deployment on edge devices without significantly sacrificing performance. Efficient Transformer variants, such as Linformer and Performer, have been developed to address the computational intensity of traditional models, allowing for faster processing and lower resource requirements. Techniques like speculative decoding and efficient Key-Value (KV) cache management are essential for enhancing the responsiveness of LLMs during inference, particularly in real-time applications. The collaboration between edge and cloud resources, through model partitioning and federated learning, allows for optimized workload distribution, ensuring that both latency-sensitive and resource-intensive tasks are handled effectively. The choice of hardware, including specialized accelerators like NPUs and TPUs, alongside software frameworks such as TensorFlow Lite and ONNX Runtime, plays a pivotal role in facilitating efficient model deployment. The findings underscore that a multi-pronged approach combining hardware, software, and systems-level strategies is essential for the successful implementation of LLMs in edge environments. As we advance toward a future where powerful LLMs become ubiquitous and accessible on the edge, we stand on the brink of a new era in artificial intelligence—one where intelligent applications seamlessly integrate into our daily lives, enhancing user experiences and unlocking innovative solutions across various domains.

Author Contributions

Conceptualization, E.K., C.-T.Y. and V.K.V.; Methodology, E.K., C.-T.Y. and V.K.V.; Writing—Original Draft Preparation, E.K.; Writing—Review and Editing, C.-T.Y. and V.K.V.; Supervision, C.-T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Science and Technology Council (NSTC), Taiwan R.O.C. (114-2622-E-029-003, 114-2221-E-029-025-MY3, 113-2221-E-029-MY3, and 114-2811-E-029-003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to legal.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kamath, U.; Keenan, K.; Somers, G.; Sorenson, S. Large Language Models: A Deep Dive; Springer Nature: Cham, Switzerland, 2024; Volume 10. [Google Scholar]
Dwivedi, Y.K.; Hughes, L.; Ismagilova, E.; Aarts, G.; Coombs, C.; Crick, T.; Duan, Y.; Dwivedi, R.; Edwards, J.; Eirug, A.; et al. Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int. J. Inf. Manag. 2021, 57, 101994. [Google Scholar] [CrossRef]
Maity, K.; Chaulwar, A.T.; Vala, V.; Guntur, R.S. NanoBERT: An extremely compact language model. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), Bangalore, India, 4–7 January 2024; pp. 342–349. [Google Scholar]
Nikdast, M.; Afifi, S.; Pasricha, S. Shedding Light on LLMs: Harnessing Photonic Neural Networks for Accelerating LLMs. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, Newark, NJ, USA, 27–31 October 2024; pp. 1–8. [Google Scholar]
Thanasi-Boçe, M.; Hoxha, J. From ideas to ventures: Building entrepreneurship knowledge with LLM, prompt engineering, and conversational agents. Educ. Inf. Technol. 2024, 29, 24309–24365. [Google Scholar] [CrossRef]
Vishwas, B.V.K.; Macharla, S.R. Time Series Forecasting Using Generative AI: Leveraging AI for Precision Forecasting; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
Peng, D.; Zheng, L.; Liu, D.; Han, C.; Wang, X.; Yang, Y.; Song, L.; Zhao, M.; Wei, Y.; Li, J.; et al. Large-language models facilitate discovery of the molecular signatures regulating sleep and activity. Nat. Commun. 2024, 15, 3685. [Google Scholar] [CrossRef]
Géza, G.; Varga, B. Method and Management Node in a Communication Network, for Supporting Management of Network Nodes Based on LLDP Messages. U.S. Patent 11,431,728, 30 August 2022. [Google Scholar]
Wen, Z.; Zhao, J.; Zhang, A.; Bi, W.; Kuang, B.; Su, Y.; Wang, R. BiDGCNLLM: A Graph–Language Model for Drone State Forecasting and Separation in Urban Air Mobility Using Digital Twin-Augmented Remote ID Data. Drones 2025, 9, 508. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, J.; Hou, J.; Wang, Y. TensAllo: Adaptive Deployment of LLMs on Resource-Constrained Heterogeneous Edge Devices. In Proceedings of the IEEE INFOCOM 2025-IEEE Conference on Computer Communications, London, UK, 19–22 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–10. [Google Scholar]
Hu, B.; Zhao, C.; Zhang, P.; Zhou, Z.; Yang, Y.; Xu, Z.; Liu, B. Enabling intelligent interactions between an agent and an LLM: A reinforcement learning approach. arXiv 2023, arXiv:2306.03604. [Google Scholar] [CrossRef]
Qualcomm Technologies. Qualcomm AI Hub: On-Device LLM Benchmarks for Snapdragon 8 Gen 3. 2023. Available online: https://aihub.qualcomm.com/ (accessed on 30 November 2025).
NVIDIA Corporation. Accelerating LLMs on the Edge with TensorRT-LLM and Jetson Orin; Technical Report, NVIDIA Technical Reports; NVIDIA Corporation: Santa Clara, CA, USA, 2024. [Google Scholar]
Joshi, P.; Hasanuzzaman, M.; Thapa, C.; Afli, H.; Scully, T. Enabling all in-edge deep learning: A literature review. IEEE Access 2023, 11, 3431–3460. [Google Scholar] [CrossRef]
Sinh, V.T.; Minh, N.L. A study on self-attention mechanism for AMR-to-text generation. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, Salford, UK, 26–28 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 321–328. [Google Scholar]
Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
Li, R.; Fu, D.; Shi, C.; Huang, Z.; Lu, G. Efficient LLMs training and inference: An introduction. IEEE Access 2024, 13, 32944–32970. [Google Scholar] [CrossRef]
Zhang, X.; Nie, J.; Huang, Y.; Xie, G.; Xiong, Z.; Liu, J.; Niyato, D.; Shen, X.S. Beyond the cloud: Edge inference for generative large language models in wireless networks. IEEE Trans. Wirel. Commun. 2024, 24, 643–658. [Google Scholar] [CrossRef]
Zhang, M.; Shen, X.; Cao, J.; Cui, Z.; Jiang, S. Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet Things J. 2024, 12, 13119–13131. [Google Scholar] [CrossRef]
Ong, J.C.L.; Chang, S.Y.H.; William, W.; Butte, A.J.; Shah, N.H.; Chew, L.S.T.; Liu, N.; Doshi-Velez, F.; Lu, W.; Savulescu, J.; et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit. Health 2024, 6, e428–e432. [Google Scholar] [CrossRef]
Annepaka, Y.; Pakray, P. Large language models: A survey of their development, capabilities, and applications. Knowl. Inf. Syst. 2025, 67, 2967–3022. [Google Scholar] [CrossRef]
Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.P.; Bing, L.; Xu, X.; Poria, S.; Lee, R.K.W. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv 2023, arXiv:2304.01933. [Google Scholar] [CrossRef]
Pragst, L.; Rach, N.; Minker, W.; Ultes, S. On the vector representation of utterances in dialogue context. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Kopru, S.; Liu, M.; SAWAF, H. Vector Representation of Descriptions and Queries. U.S. Patent 15/192,323, 28 December 2017. [Google Scholar]
Kouretas, I.; Paliouras, V. Hardware implementation of a softmax-like function for deep learning. Technologies 2020, 8, 46. [Google Scholar] [CrossRef]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. A simple and light-weight attention module for convolutional neural networks. Int. J. Comput. Vis. 2020, 128, 783–798. [Google Scholar] [CrossRef]
Kobayashi, G.; Kuribayashi, T.; Yokoi, S.; Inui, K. Attention is not only a weight: Analyzing transformers with vector norms. arXiv 2020, arXiv:2004.10102. [Google Scholar] [CrossRef]
Essam, M.; Eldawlatly, S.; Abbas, H. Contextualized Word Representations for Self-Attention Network. In Proceedings of the 2018 13th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, 18–19 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 116–121. [Google Scholar]
Wu, E.; Liu, X.; Chen, Y.; Zhang, T. A Self-Attention Based Joint Sequence Labeling Model. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 24–26 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 784–787. [Google Scholar]
Zheng, Z.; Huang, S.; Weng, R.; Dai, X.Y.; Chen, J. Improving self-attention networks with sequential relations. IEEE/ACM Trans. Audio Speech, Lang. Process. 2020, 28, 1707–1716. [Google Scholar] [CrossRef]
Lee, S.; Bakker, C.R.; Vitzthum, C.; Alver, B.H.; Park, P.J. Pairs and Pairix: A file format and a tool for efficient storage and retrieval for Hi-C read pairs. Bioinformatics 2022, 38, 1729–1731. [Google Scholar] [CrossRef] [PubMed]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Qu, G.; Chen, Q.; Wei, W.; Lin, Z.; Chen, X.; Huang, K. Mobile edge intelligence for large language models: A contemporary survey. IEEE Commun. Surv. & Tutor. 2025, 27, 3820–3860. [Google Scholar]
Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized compression for implementing convolutional neural networks on FPGA. Electronics 2019, 8, 295. [Google Scholar] [CrossRef]
Kibriya, H.; Khan, W.Z.; Siddiqa, A.; Khan, M.K. Privacy issues in large language models: A survey. Comput. Electr. Eng. 2024, 120, 109698. [Google Scholar] [CrossRef]
Huang, W.; Deng, X. Real-time tracking railway intruders using multiple-agent cooperated large language models with edge stream processing engine. J. Netw. Comput. Appl. 2025, 242, 104231. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Hasan, J. Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques. arXiv 2024, arXiv:2411.06084. [Google Scholar] [CrossRef]
Kodali, R.K.; Upreti, Y.P.; Boppana, L. A quantization approach for the reduced size of large language models. In Proceedings of the 2024 16th International Conference on Knowledge and Smart Technology (KST), Krabi, Thailand, 28 February–2 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 144–148. [Google Scholar]
Pandey, N.P.; Nagel, M.; van Baalen, M.; Huang, Y.; Patel, C.; Blankevoort, T. A practical mixed precision algorithm for post-training quantization. arXiv 2023, arXiv:2302.05397. [Google Scholar] [CrossRef]
Yu, C.; Yang, S.; Zhang, F.; Ma, H.; Wang, A.; Li, E.P. Improving quantization-aware training of low-precision network via block replacement on full-precision counterpart. arXiv 2024, arXiv:2412.15846. [Google Scholar] [CrossRef]
Chu, T.; Luo, Q.; Yang, J.; Huang, X. Mixed-precision quantized neural networks with progressively decreasing bitwidth. Pattern Recognit. 2021, 111, 107647. [Google Scholar] [CrossRef]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar] [CrossRef]
Lin, J.; Tang, J.; Tang, H.; Yang, S.; Xiao, G.; Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. GetMobile Mob. Comput. Commun. 2025, 28, 12–17. [Google Scholar] [CrossRef]
Vadera, S.; Ameen, S. Methods for pruning deep neural networks. IEEE Access 2022, 10, 63280–63300. [Google Scholar] [CrossRef]
Xia, H.; Zheng, Z.; Li, Y.; Zhuang, D.; Zhou, Z.; Qiu, X.; Li, Y.; Lin, W.; Song, S.L. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv 2023, arXiv:2309.10285. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, M.; Shi, J.Q. MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models. arXiv 2024, arXiv:2407.11681. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar] [CrossRef]
Gale, T.; Elsen, E.; Hooker, S. The state of sparsity in deep neural networks. arXiv 2019, arXiv:1902.09574. [Google Scholar] [CrossRef]
Alkhulaifi, A.; Alsahli, F.; Ahmad, I. Knowledge distillation in deep learning and its applications. PeerJ Comput. Sci. 2021, 7, e474. [Google Scholar] [CrossRef] [PubMed]
Thrivikram, G.; Ganesh, V.; Sethuraman, T.; Perepu, S.K. Efficient knowledge distillation of teacher model to multiple student models. In Proceedings of the 2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Bandung, Indonesia, 27–28 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 173–179. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar] [CrossRef]
Fang, L.; Yu, X.; Cai, J.; Chen, Y.; Wu, S.; Liu, Z.; Yang, Z.; Lu, H.; Gong, X.; Liu, Y.; et al. Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions. arXiv 2025, arXiv:2504.14772. [Google Scholar] [CrossRef]
Tang, J.; Shivanna, R.; Zhao, Z.; Lin, D.; Singh, A.; Chi, E.H.; Jain, S. Understanding and improving knowledge distillation. arXiv 2020, arXiv:2002.03532. [Google Scholar] [CrossRef]
Cho, J.H.; Hariharan, B. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4794–4802. [Google Scholar]
Liu, D. Contemporary model compression on large language models inference. arXiv 2024, arXiv:2409.01990. [Google Scholar] [CrossRef]
Zhuang, B.; Liu, J.; Pan, Z.; He, H.; Weng, Y.; Shen, C. A survey on efficient training of transformers. arXiv 2023, arXiv:2302.01107. [Google Scholar] [CrossRef]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar] [CrossRef]
Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts in large language models. IEEE Trans. Knowl. Data Eng. 2025, 37, 3896–3915. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Chinnakonduru, S.S.; Mohapatra, A. Weighted grouped query attention in transformers. arXiv 2024, arXiv:2407.10855. [Google Scholar] [CrossRef]
Fu, Z.; Song, W.; Wang, Y.; Wu, X.; Zheng, Y.; Zhang, Y.; Xu, D.; Wei, X.; Xu, T.; Zhao, X. Sliding Window Attention Training for Efficient Large Language Models. arXiv 2025, arXiv:2502.18845. [Google Scholar] [CrossRef]
Dhar, N.; Deng, B.; Lo, D.; Wu, X.; Zhao, L.; Suo, K. An empirical analysis and resource footprint study of deploying large language models on edge devices. In Proceedings of the 2024 ACM Southeast Conference, Marietta, GA, USA, 18–20 April 2024; pp. 69–76. [Google Scholar]
Barad, H.; Aidova, E.; Gorbachev, Y. Leveraging speculative sampling and kv-cache optimizations together for generative ai using openvino. arXiv 2023, arXiv:2311.04951. [Google Scholar] [CrossRef]
Spector, B.; Re, C. Accelerating llm inference with staged speculative decoding. arXiv 2023, arXiv:2308.04623. [Google Scholar] [CrossRef]
Shi, L.; Zhang, H.; Yao, Y.; Li, Z.; Zhao, H. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption. arXiv 2024, arXiv:2407.18003. [Google Scholar] [CrossRef]
Joshi, T.; Saini, H.; Dhillon, N.; i Martin, K.V.; Maghraoui, K.E. Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference. arXiv 2025, arXiv:2506.07311. [Google Scholar] [CrossRef]
Bernardini, L.; Bono, F.M.; Collina, A. Drive-by damage detection based on the use of CWT and sparse autoencoder applied to steel truss railway bridge. Adv. Mech. Eng. 2025, 17, 16878132251339857. [Google Scholar] [CrossRef]
Bhardwaj, S.; Singh, P.; Pandit, M.K. A survey on the integration and optimization of large language models in edge computing environments. In Proceedings of the 2024 16th International Conference on Computer and Automation Engineering (ICCAE), Melbourne, Australia, 14–16 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 168–172. [Google Scholar]
Jin, H.; Wu, Y. Ce-collm: Efficient and adaptive large language models through cloud-edge collaboration. arXiv 2024, arXiv:2411.02829. [Google Scholar] [CrossRef]
Hao, Z.; Jiang, H.; Jiang, S.; Ren, J.; Cao, T. Hybrid slm and llm for edge-cloud collaborative inference. In Proceedings of the Workshop on Edge and Mobile Foundation Models, Tokyo, Minato-ku, Japan, 3–7 June 2024; pp. 36–41. [Google Scholar]
Ji, C.; Hou, P.; Yu, J.; Wu, Y.; Tai, Y. Novel Adaptive DNN Partitioning Method Based on Image-Stream Pipeline Inference between the Edge and Cloud. In Proceedings of the 2022 3rd International Conference on Computing, Networks and Internet of Things (CNIOT), Qingdao, China, 20–22 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 75–82. [Google Scholar]
Al Maruf, M.; Azim, A. Optimizing DNNs Model Partitioning for Enhanced Performance on Edge Devices. In Proceedings of the Canadian AI, Montreal, QC, Canada, 5–9 June 2023. [Google Scholar]
Moon, S.; Kim, J.H.; Kim, J.; Hong, S.; Cha, J.; Kim, M.; Lim, S.; Choi, G.; Seo, D.; Kim, J.; et al. Lpu: A latency-optimized and highly scalable processor for large language model inference. IEEE Micro 2024, 44, 17–33. [Google Scholar] [CrossRef]
Wang, B.; Wang, C.; Huang, W.; Song, Y.; Qin, X. A survey and taxonomy on task offloading for edge-cloud computing. IEEE Access 2020, 8, 186080–186101. [Google Scholar] [CrossRef]
Shen, S.; Zhu, T.; Wu, D.; Wang, W.; Zhou, W. From distributed machine learning to federated learning: In the view of data privacy and security. Concurr. Comput. Pract. Exp. 2022, 34, e6002. [Google Scholar] [CrossRef]
Wagner, N.; Fan, D.; Jaggi, M. Personalized collaborative fine-tuning for on-device large language models. arXiv 2024, arXiv:2404.09753. [Google Scholar] [CrossRef]
Srihith, I.D.; Donald, A.D.; Srinivas, T.A.S.; Thippanna, G.; Anjali, D. Empowering Privacy-Preserving Machine Learning: A Comprehensive Survey on Federated Learning. Int. J. Adv. Res. Sci. Commun. Technol. 2023, 3, 133–144. [Google Scholar] [CrossRef]
Chandrasekaran, S.; Athinarayanan, S.; Masthan, M.; Kakkar, A.; Bhatnagar, P.; Samad, A. Edge Intelligence Paradigm Shift on Optimizing the Edge Intelligence Using Artificial Intelligence State-of-the-Art Models. In Advancing Intelligent Networks Through Distributed Optimization; IGI Global: Hershey, PA, USA, 2024; pp. 1–18. [Google Scholar]
Röbert, K.; Bornholdt, H.; Fischer, M.; Edinger, J. Latency-aware scheduling for real-time application support in edge computing. In Proceedings of the 6th International Workshop on Edge Systems, Analytics and Networking, Rome, Italy, 8 May 2023; pp. 13–18. [Google Scholar]
Peng, D.; Fu, Z.; Wang, J. Pocketllm: Enabling on-device fine-tuning for personalized llms. arXiv 2024, arXiv:2407.01031. [Google Scholar] [CrossRef]
Hayou, S.; Ghosh, N.; Yu, B. LoRA+: Efficient Low Rank Adaptation of Large Models. arXiv 2024, arXiv:2402.12354. [Google Scholar] [CrossRef]
Rücklé, A.; Geigle, G.; Glockner, M.; Beck, T.; Pfeiffer, J.; Reimers, N.; Gurevych, I. AdapterDrop: On the Efficiency of Adapters in Transformers. arXiv 2020, arXiv:2010.11918. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar] [CrossRef]
Zaken, E.B.; Ravfogel, S.; Goldberg, Y. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. arXiv 2021, arXiv:2106.10199. [Google Scholar] [CrossRef]
Hu, K.; Li, Y.; Xia, M.; Wu, J.; Lu, M.; Zhang, S.; Weng, L. Federated Learning: A Distributed Shared Machine Learning Method. Complexity 2021, 2021, 8261663. [Google Scholar] [CrossRef]
Yu, S.; Muñoz, J.P.; Jannesari, A. Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models. arXiv 2023, arXiv:2305.11414. [Google Scholar] [CrossRef]
Kaur, A.; Kaushal, C.; Hassan, M.M.; Aung, S.T. Federated Deep Learning for Healthcare: A Practical Guide with Challenges and Opportunities; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar] [CrossRef]
Prabhu, O.S.; Gupta, P.K.; Shashank, P.; Chandrasekaran, K.; Usha, D. Towards a Federated Learning Approach for NLP Applications. In Applications of Artificial Intelligence and Machine Learning; Springer: Singapore, 2021; pp. 157–167. [Google Scholar] [CrossRef]
Dasaradharami Reddy, K.; S, A. Security and privacy in federated learning: A survey. Trends Comput. Sci. Inf. Technol. 2023, 8, 29–37. [Google Scholar] [CrossRef]
Agarwal, A.; Rezagholizadeh, M.; Parthasarathi, P. Practical Takes on Federated Learning with Pretrained Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EACL, Dubrovnik, Croatia, 2–6 May 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 454–471. [Google Scholar] [CrossRef]
Almanifi, O.R.A.; Chow, C.O.; Tham, M.L.; Chuah, J.H.; Kanesan, J. Communication and computation efficiency in Federated Learning: A survey. Internet Things 2023, 22, 100742. [Google Scholar] [CrossRef]
Malan, E.; Peluso, V.; Calimera, A.; Macii, E. Communication-Efficient Federated Learning with Gradual Layer Freezing. IEEE Embed. Syst. Lett. 2023, 15, 25–28. [Google Scholar] [CrossRef]
Gao, D.; Yao, X.; Yang, Q. A Survey on Heterogeneous Federated Learning. arXiv 2022, arXiv:2210.04505. [Google Scholar] [CrossRef]
Qin, Z.; Chen, D.; Qian, B.; Ding, B.; Li, Y.; Deng, S. Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes. arXiv 2023, arXiv:2312.06353. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, H.; Wang, F.; Zhao, J.; Xu, Q.; Li, H. Security and Privacy Threats to Federated Learning: Issues, Methods, and Challenges. Secur. Commun. Netw. 2022, 2022, 1–24. [Google Scholar] [CrossRef]
Kachris, C. A survey on hardware accelerators for large language models. Appl. Sci. 2025, 15, 586. [Google Scholar] [CrossRef]
Kimm, H.; Paik, I.; Kimm, H. Performance comparision of tpu, gpu, cpu on google colaboratory over distributed deep learning. In Proceedings of the 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, Singapore, 20–23 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 312–319. [Google Scholar]
He, W. The promise of training deep neural networks on CPUs: A survey. J. Phys. Conf. Ser. 2023, 2649, 012017. [Google Scholar] [CrossRef]
Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021; pp. 1–15. [Google Scholar]
Tan, T.; Cao, G. Deep learning on mobile devices with neural processing units. Computer 2023, 56, 48–57. [Google Scholar] [CrossRef]
Babu, P.; Parthasarathy, E. Reconfigurable FPGA architectures: A survey and applications. J. Inst. Eng. Ser. B 2021, 102, 143–156. [Google Scholar] [CrossRef]
Google Coral. Coral: Efficient On-Device AI with Edge TPU. 2024. Available online: https://coral.ai/ (accessed on 25 October 2024).
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Wan, Z.; Liu, C.K.; Yang, H.; Raj, R.; Li, C.; You, H.; Fu, Y.; Wan, C.; Li, S.; Kim, Y.; et al. Towards efficient neuro-symbolic ai: From workload characterization to hardware architecture. IEEE Trans. Circuits Syst. Artif. Intell. 2024, 1, 53–68. [Google Scholar] [CrossRef]
Han, M.; Sun, X.; Wang, X.; Zhan, W.; Chen, X. Transformer-based Distributed Task Offloading and Resource Management in Cloud-Edge Computing Networks. IEEE J. Sel. Areas Commun. 2025, 43, 2938–2953. [Google Scholar] [CrossRef]
Yang, X.; Su, T. Efa-trans: An efficient and flexible acceleration architecture for transformers. Electronics 2022, 11, 3550. [Google Scholar] [CrossRef]
Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar]
Tanurhan, Y.; Paulin, P.; Michiels, T. Generative AI on a Budget: Processing Transformer-based Neural Networks at the Edge. In Proceedings of the 2023 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 9–13 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
Surianarayanan, C.; Lawrence, J.J.; Chelliah, P.R.; Prakash, E.; Hewage, C. A survey on optimization techniques for edge artificial intelligence (AI). Sensors 2023, 23, 1279. [Google Scholar] [CrossRef]
Orăşan, I.L.; Seiculescu, C.; Caleanu, C.D. Benchmarking tensorflow lite quantization algorithms for deep neural networks. In Proceedings of the 2022 IEEE 16th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 25–28 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 000221–000226. [Google Scholar]
Lin, W.F.; Tsai, D.Y.; Tang, L.; Hsieh, C.T.; Chou, C.Y.; Chang, P.H.; Hsu, L. Onnc: A compilation framework connecting onnx to proprietary deep learning accelerators. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 18–20 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 214–218. [Google Scholar]
Ray, P.P.; Pradhan, M.P. Llmedge: A novel framework for localized llm inferencing at resource constrained edge. In Proceedings of the 2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS), Bengaluru, India, 17–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar]
Daniel Bevenius, D.; Gerganov, G.; Devesa, D. GGUF. 2025. Available online: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md (accessed on 25 October 2024).
Turboderp. ExLlamaV2. 2025. Available online: https://github.com/turboderp-org/exllamav2 (accessed on 25 October 2024).
Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv 2022, arXiv:20222205. [Google Scholar] [CrossRef]
Xia, Y.; Zhang, J.; Jazdi, N.; Weyrich, M. Incorporating large language models into production systems for enhanced task automation and flexibility. arXiv 2024, arXiv:2407.08550. [Google Scholar] [CrossRef]
Ngo, D.; Park, H.C.; Kang, B. Edge Intelligence: A Review of Deep Neural Network Inference in Resource-Limited Environments. Electronics 2025, 14, 2495. [Google Scholar] [CrossRef]
Liu, Y.; Chen, C.; Zhang, R.; Qin, T.; Ji, X.; Lin, H.; Yang, M. Enhancing the interoperability between deep learning frameworks by model conversion. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Sacramento, CA, USA, 8–13 November 2020; pp. 1320–1330. [Google Scholar]
Bodor, A.; Hnida, M.; Najima, D. From development to deployment: An approach to MLOps monitoring for machine learning model operationalization. In Proceedings of the 2023 14th International Conference on Intelligent Systems: Theories and Applications (SITA), Mohammedia, Morocco, 19–20 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar]
Celik, A.; Mahmoud, Q.H. A Review of Large Language Models for Automated Test Case Generation. Mach. Learn. Knowl. Extr. 2025, 7, 97. [Google Scholar] [CrossRef]
Mao, Z.; Suzuki, S.; Nabae, H.; Miyagawa, S.; Suzumori, K.; Maeda, S. Machine learning-enhanced soft robotic system inspired by rectal functions to investigate fecal incontinence. Nat. Commun. 2024, 15, 482–494. [Google Scholar] [CrossRef]

Figure 1. Paper outline.

Figure 2. Taxonomy of Edge LLM Deployment Strategies.

Figure 3. Hardware and software considerations.

Table 1. Literature Inclusion and Exclusion Criteria.

Criteria	Inclusion Rules	Exclusion Rules
Timeframe	Primarily 2017–2025, focusing on the post-Transformer era.	Research published prior to 2017 (pre-Transformer architecture).
Topic Relevance	Must explicitly discuss Transformer-based models or edge hardware constraints.	General AI surveys that do not address LLM-specific bottlenecks.
Technical Depth	Papers providing quantitative data, novel architectures, or system-level trade-offs.	Short abstracts, non-peer-reviewed blog posts, or purely marketing materials.
Language	Documents published in English.	Non-English publications.

Table 2. Comparison of Quantization Methods.

Quantization Method	Overview	Impact on Model Size, Accuracy, and Challenges
Post-Training Quantization (PTQ)	PTQ quantizes a pre-trained model without additional training. It’s simple and fast, converting from FP32 to lower-bit formats (e.g., INT8) [42].	Model Size Reduction: Reduces model size by up to 75% from FP32 to INT8. Accuracy Trade-offs: Can cause slight degradation, especially in layers with high weight variance. Challenges: Layer sensitivity and calibration requirements.
Quantization-Aware Training (QAT)	QAT integrates quantization into the training process, using simulated quantization effects. This allows the model to learn to compensate for reduced precision [43].	Model Size Reduction: Similar to PTQ, QAT provides significant size reductions. Accuracy Preservation: Often leads to better accuracy than PTQ as the model adapts to quantization during training. Challenges: Can be complex for Transformers and other models sensitive to weight perturbations.
Mixed-Precision Quantization	This method uses different precision levels for various parts of the model (e.g., critical layers use FP16 while others use INT8) [44].	Model Size Reduction: Provides substantial reductions while maintaining accuracy. Accuracy Maintenance: Preserves accuracy in sensitive areas. Challenges: Complex to implement, as it requires careful consideration of which layers to quantize. Performance can be inconsistent.
Generalized Post-Training Quantization (GPTQ)	GPTQ quantizes weights layer-by-layer, using second-order information (the Hessian matrix) to adjust the remaining unquantized weights in a layer to compensate for the error introduced by quantizing others [45].	Model Size Reduction: Moving from FP16 (2 bytes per parameter) to 4-bit (0.5 bytes per parameter) reduces the memory required to store weights by 75%. Accuracy Maintenance: To prevent the significant performance drop typically caused by naive rounding, GPTQ uses sophisticated mathematical optimizations. Challenges: While effective, GPTQ involves specific trade-offs and deployment requirements.
Activation-aware Weight Quantization (AWQ)	AWQ identifies critical weights by observing activation magnitudes during a calibration phase. It then scales these salient weights to protect their precision before quantizing, without needing to keep them in mixed precision [46].	Model Size Reduction: By converting weights from 16-bit to 4-bit, AWQ reduces the model size by 75%. Accuracy Maintenance: In AWQ not all weights are equally important. It focuses on protecting the most critical 1% of the model. Challenges: AWQ requires specialized CUDA kernels that can handle the dequantization (4-bit back to 16-bit) during the matrix multiplication.

Table 3. Comparison of Pruning Categories.

Pruning Category	Overview	Advantages and Challenges
Unstructured Pruning	Involves removing individual weights from a neural network, often based on their magnitude. This is a fine-grained approach that retains the overall model architecture [48].	Advantages: High compression rates and performance maintenance by selectively removing less significant weights. Challenges: The resulting sparse matrix can be inefficient for standard hardware. Implementation is complex and may require fine-tuning.
Structured Pruning	Involves removing entire structures (e.g., neurons, channels, or layers) from the model. This results in a more regular and compact architecture [49].	Advantages: Leads to more efficient inference and is simpler to implement on existing hardware due to a dense representation. Challenges: Can result in a more significant drop in performance compared to unstructured pruning. Offers less flexibility in compression.

Table 4. Comparison of Transformer Variants.

Transformer Variant	Overview	Key Features
Linformer	Reduces time complexity from $O (N^{2})$ to $O (N)$ by projecting token embeddings into a lower-dimensional space [61].	Linear Complexity: Achieves linear complexity with respect to sequence length. Performance: Maintains competitive performance on various NLP tasks.
Performer	Uses a kernelized attention mechanism to approximate attention scores, enabling linear time complexity and significant speed improvements [62].	Kernelized Attention: Leverages kernel methods. Scalability: Can handle longer sequences efficiently, suitable for large datasets.
Mixture of Experts (MoE)	A sparsely activated set of experts increases model capacity without a proportional increase in computational cost. Only a subset of experts is active for each input [63].	Sparse Activation: A few experts are activated. Improved Performance: Demonstrates significant improvements on various benchmarks, especially in language modeling tasks.
Reformer	Reduces memory footprint by using locality-sensitive hashing (LSH) for attention and reversible layers [64].	LSH Attention: Uses LSH to significantly reduce complexity. Reversible Layers: Allows for memory savings during training as activations are not stored.
Longformer	Designed for long document processing, it uses a combination of local and global attention mechanisms [65].	Local and Global Attention: Uses a sliding window for local attention and incorporates global tokens. Efficiency: Efficient for processing documents much longer than standard Transformers.
Grouped Query Attention (GQA)	an optimization technique that serves as an intermediate middle ground between Multi-Head Attention (MHA) and Multi-Query Attention (MQA) [66].	Reduced KV Cache: significantly shrinks the size of the KV cache, allowing the model to handle much longer contexts on the same hardware. Interpolated Quality: maintains near-MHA levels of performance while being nearly as fast as MQA.
Sliding Window Attention (SWA)	a sparse attention mechanism that limits the attention span of each token to a fixed-size local context rather than the entire preceding sequence [67].	Linear Scaling: reduces computational complexity of window size, making the cost of processing a token constant regardless of total sequence length. Memory Efficiency: utilizes a rolling buffer that keeps memory usage constant per layer, enabling the processing of massive documents on consumer-grade GPUs.

Table 5. Comparison of Inference Optimization Techniques.

Inference Optimization	Overview	Key Features and Challenges
Speculative Decoding	An inference optimization technique that reduces the latency of text generation by generating multiple potential tokens in parallel [70].	Key Features: Achieves parallel generation and reduced latency, making it ideal for real-time applications. Challenges: Increased computational resource consumption, requiring a balance between speed and efficiency.
Efficient Key-Value (KV) Cache Management	Optimizes inference in LLMs, particularly during autoregressive generation tasks, by storing and reusing key-value pairs from previously generated tokens [71].	Key Features: The cache mechanism allows the model to access relevant information quickly without recomputing. It also improves memory efficiency, crucial for edge devices. Challenges: Requires careful management of the cache size to avoid excessive memory consumption, especially in long sequence tasks.
PagedAttention	A specialized memory management algorithm designed to solve the inefficiencies of storing Key-Value (KV) caches in Large Language Model (LLM) inference, particularly on memory-constrained hardware [72].	Key Features: By maximizing VRAM utilization, edge and server hardware can increase the concurrent batch size, leading to significantly higher token throughput. Challenges: Integrating PagedAttention requires deep modifications to the attention kernels of a model, making it more difficult to implement compared to standard contiguous cache management.

Table 6. Comparison of Automated Optimization Strategies for Edge Deployment.

Optimization Strategy	Overview	Key Features and Challenges
Neural Architecture Search (NAS)	Automates the design of the model architecture itself, such as layer depth and width, rather than manually tuning individual hyperparameters.	Key Features: Can be strictly constrained to search for architectures meeting a specific latency budget (e.g., <50 ms/token). Challenges: The search process is computationally expensive and requires significant initial resources to find the optimal architecture.
Hardware-Aware Hyperparameter Optimization (HPO)	Integrates hardware-specific metrics like energy per inference and peak VRAM directly into the objective function.	Key Features: Optimizes for Accuracy per Watt to ensure device longevity. Challenges: Requires accurate hardware-in-the-loop measurement, as seen in SHM applications [73], poor tuning can lead to failure in detecting critical signals against environmental noise.

Table 7. Comparison of Workload Distribution Strategies.

Workload Distribution	Overview	Methods and Key Concepts
Model Partitioning	Divides an LLM into segments that can be run on different devices (edge/cloud) to leverage the strengths of each [77].	Vertical Partitioning: Different layers of the model are allocated to either the edge or the cloud [78]. Horizontal Partitioning: The model is divided based on input data or task type, balancing the workload [79].
Task Offloading	A strategy where certain computational tasks are performed on edge devices while others are sent to the cloud [80].	Static Offloading: Predefined tasks are assigned to the edge or the cloud based on resource requirements. Dynamic Offloading: Tasks are offloaded dynamically based on current conditions (e.g., network bandwidth).
Federated Learning	A collaborative machine learning approach where models are trained across multiple edge devices without sharing raw data. Only model updates are shared with the cloud to improve the global model [81].	Collaborative Model Training: Different devices contribute to the training of a single global model [82]. Personalized Model Updates: Users can adapt a general model with personalized updates on their devices [83].
Edge-Cloud Hybrid Architectures	Combines both edge and cloud devices into a single, seamless workflow for deployment. These architectures dynamically allocate tasks [76].	Edge-Cloud Synergy: Emphasizes collaboration between edge and cloud systems [84]. Resource-Aware Scheduling: Intelligently distributes workloads based on resource availability and task requirements to reduce latency [85].

Table 8. Security and Privacy Mitigations for Model Partitioning.

Mitigation Technique	Mechanism	Key Benefits and Challenges
Differential Privacy (DP)	Adds mathematical noise to intermediate activations before transmission to the cloud.	Benefits: Provides a formal guarantee against input reconstruction. Challenges: Can degrade model accuracy if noise levels are too high.
Secure Multi-Party Computation (SMPC)	Computes partition layers across the edge and cloud using cryptographic shares.	Benefits: Neither party sees the actual latent representations. Challenges: Introduces significant communication and computational latency.
Adversarial Training	Trains the edge “stub” to minimize the information available for reconstruction attacks.	Benefits: Reduces the “leakage” of sensitive user attributes in the latent space. Challenges: Requires complex re-training of the base model.
Homomorphic Encryption (HE)	Performs cloud-side Transformer layers directly on encrypted activations.	Benefits: Data remains encrypted even during processing in the cloud. Challenges: Extremely high overhead; often too slow for real-time edge responses.

Table 9. Comparison of Fine-Tuning Methods.

Fine-Tuning Method	Overview	Key Features
Low-Rank Adaptation (LoRA)	A technique that enables fine-tuning by injecting low-rank matrices into the model’s layers, rather than updating all parameters [87].	Efficiency: Allows for fine-tuning with a small number of trainable parameters. Performance Preservation: Maintains performance while reducing computational cost.
PEFT: Adapter Layers	Adapter layers are small, task-specific neural network layers inserted into a pre-trained model. Only the adapter layers are trained during fine-tuning [88].	Modularity: Adapter layers can be easily added or removed for different tasks. Resource Efficiency: Reduces computational and memory requirements.
PEFT: Prompt Tuning	Involves optimizing a soft prompt or a small set of continuous, task-specific vectors that are prepended to the input. The main model weights remain frozen [89].	No Weight Updates: The original model weights are not updated, preserving the pre-trained knowledge. Task-Specific Guidance: The optimized prompt guides the model’s behavior for a specific task.
PEFT: BitFit	A lightweight fine-tuning technique that only updates the bias parameters of the model’s layers while freezing all other weights [90].	Minimal Parameter Updates: Only updates a small fraction of the total parameters (biases). Effective Adaptation: Despite minimal updates, it is effective for adapting to new tasks.

Table 10. Comparison of Hardware Platforms for AI.

Hardware Platform	Strengths	Weaknesses
Central Processing Unit (CPU)	Versatility: CPUs are general-purpose processors suitable for a wide range of tasks [104]. Ease of Programming: The software ecosystem is mature and well-supported.	Limited Parallelism: CPUs have fewer cores, limiting their ability to handle highly parallel tasks. Lower Throughput: Less efficient for large-scale matrix operations common in AI.
Graphics Processing Unit (GPU)	High Parallelism: GPUs are designed for massive parallel computations [105]. Optimized Libraries: Many deep learning frameworks are highly optimized for GPUs.	Power Consumption: GPUs consume significant power. Cost: High-performance GPUs can be expensive.
Neural Processing Unit (NPU)	Specialized for AI Tasks: NPUs are custom-designed for efficient AI/ML computations [106]. Energy Efficiency: NPUs generally consume less power than GPUs for similar AI tasks.	Limited General-Purpose Use: Less versatile than CPUs or GPUs. Development Complexity: The software ecosystem is less mature.
Field-Programmable Gate Array (FPGA)	Customization: FPGAs can be configured for specific tasks, offering high efficiency [107]. Low Latency: Can achieve extremely low latency for real-time applications.	Development Time: Programming FPGAs is complex and time-consuming. Performance Variability: Performance heavily depends on the specific design and implementation.
Mobile System-on-Chip (SoC)	Heterogeneous Integration: Combines CPU, GPU, and NPU on a single die with unified memory (e.g., Snapdragon 8 Gen 3, Apple A17 Pro) [12]. Efficiency: Optimized for INT4/INT8 quantization, reaching 15–20 TPS on 7B models.	Thermal Throttling: Compact form factors lead to heat accumulation, causing performance drops during sustained inference. Ecosystem Lock-in: Maximum performance often requires vendor-specific SDKs.
Edge GPU Accelerators	High Throughput: Dedicated hardware like NVIDIA Jetson Orin utilizes Tensor Cores to achieve 30–50 TPS on 7B models via AWQ [13]. Software Maturity: Leverages established CUDA and TensorRT ecosystems.	Power Envelope: High TDP (up to 60W) may exceed the capabilities of battery-powered or solar-powered edge nodes. Physical Size: Larger footprint compared to integrated SoCs.
Application-Specific Accelerators	Niche Efficiency: Devices like Google Coral (Edge TPU) offer high TOPS/Watt for specific vision tasks [108]. Cost-Effective: Generally lower price point for low-power IoT applications.	Memory Bottleneck: Limited on-chip SRAM and lack of dynamic attention support result in extremely low LLM throughput (<1 TPS). Not suitable for Transformer architectures.
Unified Memory Architecture (UMA)	Zero-Copy Transfer: Found in Apple M-series Silicon; allows the NPU and GPU to share the same high-speed RAM pool (up to 400 GB/s) [104]. Context Capacity: Enables running massive models that exceed traditional discrete VRAM limits.	Hardware Cost: High entry cost for professional-grade unified memory configurations. Fixed Hardware: RAM is non-upgradeable, limiting the long-term flexibility of the edge node.

Table 11. Quantitative Benchmarks of Edge-Optimized LLMs (7B Parameter Model).

Optimization Strategy	Hardware Platform	Throughput (TPS)	Memory Footprint	Energy Efficiency
FP16 (Baseline) [13]	NVIDIA Jetson Orin	12–15 TPS	14.0 GB	2.4 J/token
INT4 (GPTQ/AWQ) [12]	Snapdragon 8 Gen 3	18–22 TPS	3.8 GB	0.45 J/token
Pruning (50%) [104]	Apple M3 (MLX)	25–30 TPS	7.2 GB	0.8 J/token
KD (NanoBERT) [3]	Embedded CPU	5–8 TPS	<1.0 GB	1.2 J/token

Table 12. Hardware Accelerators for AI Workloads.

Hardware Accelerator	Overview	Key Features
Neural Processing Unit (NPU)	NPUs are specialized hardware designed to accelerate machine learning and AI tasks. They are often integrated into mobile devices and edge computing platforms [106].	Energy Efficiency: NPUs are generally more energy-efficient than CPUs and GPUs for AI workloads. Optimized for AI: They are built specifically for the demands of neural networks and are highly efficient at common AI operations.
Tensor Processing Unit (TPU)	TPUs, developed by Google, are application-specific integrated circuits (ASICs) optimized specifically for machine learning workloads, particularly in a data center environment [113].	High Throughput: TPUs are capable of performing a massive number of matrix multiplications per second. Optimized for Tensor Operations: Their architecture is custom-built to handle tensor computations with extreme efficiency.
Field-Programmable Gate Array (FPGA)	FPGAs are integrated circuits whose hardware can be reconfigured after manufacturing. They offer a unique balance of flexibility and performance [107].	Customization: FPGAs can be configured to create custom hardware circuits for specific applications and algorithms, offering great flexibility. Low Latency: The ability to achieve very low latency is a key advantage, making them suitable for real-time applications and specialized tasks.

Table 13. Comparison of Lightweight Inference Frameworks.

Library or Framework	Overview	Key Features
TensorFlow Lite (TFLite)	A lightweight version of TensorFlow designed for on-device machine learning inference, specifically for mobile, embedded, and IoT devices [116].	Model Optimization: Supports various techniques like quantization and pruning to reduce model size and latency. Cross-Platform Compatibility: Works across a wide range of devices and operating systems (e.g., Android, iOS, embedded Linux).
ONNX Runtime	An open-source inference engine for the Open Neural Network Exchange (ONNX) format, designed to accelerate machine learning models across different hardware and software [117].	Interoperability: Allows models trained in various frameworks (e.g., PyTorch, TensorFlow) to be run on a single platform. Performance Optimization: Provides a set of optimizations to achieve high performance on various hardware.
llama.cpp	A C++ library designed for efficient inference of large language models (LLMs) on consumer hardware, particularly CPUs [118].	Memory Efficiency: Highly optimized to run large models with limited memory resources. Speed: Engineered to provide fast inference speeds, even on less powerful hardware, through techniques like quantization.
GPT-Generated Unified Format (GGUF)	High CPU compatibility, memory mapping (mmap), and single-file distribution [119].	Device Suitability: Ideal for devices like Apple Silicon or laptops where VRAM and RAM are shared. Universal Quantization: supports a wide range of quantization levels (Q2_K, Q4_K_M, etc.), allowing a user to pick a file that fits exactly within their device’s specific RAM constraints.
ExLlamaV2(EXL2)	High-speed GPU inference with varying bit-rate quantization [120].	GPU-Dedicated: Best for edge GPUs with dedicated VRAM where maximizing token-per-second is the priority. Speed: utilizes specialized CUDA kernels that minimize the overhead of dequantization during the forward pass, significantly reducing the time to first token compared to general-purpose formats.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kristiani, E.; Verma, V.K.; Yang, C.-T. Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI 2026, 7, 15. https://doi.org/10.3390/ai7010015

AMA Style

Kristiani E, Verma VK, Yang C-T. Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI. 2026; 7(1):15. https://doi.org/10.3390/ai7010015

Chicago/Turabian Style

Kristiani, Endah, Vinod Kumar Verma, and Chao-Tung Yang. 2026. "Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions" AI 7, no. 1: 15. https://doi.org/10.3390/ai7010015

APA Style

Kristiani, E., Verma, V. K., & Yang, C.-T. (2026). Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI, 7(1), 15. https://doi.org/10.3390/ai7010015

Article Menu

Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions

Abstract

1. Introduction

1.1. Scope of the Review

1.2. Rationale and Contributions

1.3. Paper Organization

2. Methodology and Literature Selection

2.1. Search Strategy and Databases

2.2. Inclusion and Exclusion Criteria

2.3. The Foundation of LLMs, Transformers, and the Edge Computing

2.4. Large Language Models (LLMs)

2.4.1. Transformers

2.4.2. Edge Computing

3. Taxonomy of Edge LLM Deployment Strategies

3.1. Model Compression Techniques

3.1.1. Quantization Methods for Transformer Models

3.1.2. Pruning in Large Language Models (LLMs)

3.1.3. Knowledge Distillation

3.2. Architectural and Algorithmic Optimizations

3.2.1. Efficient Transformer Variants: A Survey of New Architectures

3.2.2. Inference Optimization Techniques for Large Language Models

3.2.3. Automated Search and Hardware-Aware Optimization

3.3. System-Level and Hybrid Approaches

3.3.1. Edge-Cloud Collaboration

3.3.2. On-Device Fine-Tuning

3.3.3. Federated Learning

4. Hardware and Software Considerations

4.1. Edge AI Hardware

4.1.1. Review of Hardware Platforms for Running Transformers

4.1.2. Evaluation Metrics for Edge LLMs

4.1.3. Hardware Accelerators for AI Workloads

4.2. Inference Frameworks and Toolkits

4.2.1. Key Software Libraries and Frameworks for Edge Deployment

4.2.2. Toolchains of Model Training and Deployment Gaps

5. Key Challenges

5.1. Performance vs. Accuracy Trade-Offs in Transformer Models on Edge Devices

5.2. Generalization and Long-Tail Problems in Edge Models

5.3. Data Privacy and Security in On-Device Processing

5.4. Energy Efficiency

5.5. Dynamic Resource Management

5.6. Cross-Disciplinary Applications of Edge Intelligence

6. Future Research Directions

6.1. Hardware-Software Co-Design

6.2. New Architectures of Transformer Variants for Edge Constraints

6.3. On-Device Lifelong Learning

6.4. Standardized Benchmarks for Comprehensive Evaluation of Edge LLM Performance

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI