A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms

Wang, Yiyang; Zhao, Jing

doi:10.3390/electronics14112188

Open AccessArticle

A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms

by

Yiyang Wang

^1,*

and

Jing Zhao

²

¹

Bell Honors School, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

²

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2188; https://doi.org/10.3390/electronics14112188

Submission received: 5 May 2025 / Revised: 23 May 2025 / Accepted: 26 May 2025 / Published: 28 May 2025

Download

Browse Figures

Versions Notes

Abstract

Efficient and scalable inference is essential for deploying large-scale generative models across diverse hardware platforms, especially in real-time or resource-constrained scenarios. To address this, we propose a novel unified and resource-aware inference optimization framework that uniquely integrates three complementary techniques: sensitivity-aware mixed-precision quantization, heterogeneous sparse attention for reducing attention complexity, and capacity-aware dynamic expert routing for input-adaptive computation. This framework distinctively achieves fine-grained adaptivity by dynamically adjusting computation paths based on token complexity and hardware conditions, offering substantial performance gains and execution flexibility across diverse platforms, including edge devices like Jetson Orin. Implemented using PyTorch 1.13 and ONNX Runtime, our framework demonstrates significant reductions in inference latency and memory usage, alongside substantial throughput improvements in language and image generation tasks, outperforming existing baselines even under constrained GPU environments. Qualitative analyses reveal its fine-grained adaptivity, while robustness tests confirm stable behavior under resource fluctuation and input noise, offering an interpretable optimization approach suitable for heterogeneous deployments. Future work will explore reinforcement-based routing and multimodal inference.

Keywords:

mixture-of-experts (MoE) inference; mixed-precision quantization; edge AI; sparse attention; low-latency inference; model compression and acceleration; resource-aware optimization; dynamic expert routing; edge deployment

1. Introduction

Generative models such as large language models (LLMs)—encompassing foundational works, specific architectures, and even explorations into combined modalities [1,2,3] and diffusion-based image generators, which have rapidly advanced from core underlying techniques to a diverse range of sophisticated applications [4,5,6]—have achieved remarkable progress in recent years, enabling breakthroughs in open-domain dialogue systems, image and video synthesis, and code generation. These models, exemplified by influential open-source foundation models like LLaMA [7], cutting-edge proprietary systems such as GPT-4 [8], and widely adopted generative frameworks like Stable Diffusion [9], rely on billions of parameters and massive computing resources during both training and inference. While these models are typically deployed in data centers equipped with high-end GPUs—environments where advanced optimization techniques for quantization, prompt compression, and efficient serving are extensively researched and applied [10,11,12]—there is increasing demand for deploying generative AI in edge environments—a trend supported by crucial innovations in efficient attention mechanisms, specialized quantization methods, and evolving evaluation methodologies [13,14,15], such as mobile devices benefiting from context adaptation strategies and rapid generation capabilities [16], embedded systems utilizing training-free optimization for acceleration [17], and industrial sensors where memory, compute, and power budgets are severely constrained [18].

Despite their impressive generative capabilities, current large-scale models suffer from critical limitations when used in real-time [19,20] or resource-constrained settings [21,22]. First, their inference latency remains high due to autoregressive decoding and deep attention stacks [23]. Second, the memory footprint required to store weights, activations, and key-value (KV) caches makes it difficult for them to run on edge accelerators [24]. Third, uniform computational scheduling across all tokens or regions leads to inefficiency [25], as different inputs require varying levels of complexity. Addressing these limitations is crucial for broadening the practical applicability of generative AI in real-world electronics and embedded systems [26,27,28,29].

Recent works have explored several isolated directions to tackle these bottlenecks. Mixed-precision quantization techniques [30] aim to compress model weights and activations using low-bit formats while preserving accuracy [31]. Sparse attention and local window-based compression reduce the quadratic complexity of self-attention operations [32,33]. Mixture-of-Experts (MoE) architectures [34,35,36,37,38] attempt to scale capacity while reducing per-token computation by activating only a subset of expert layers. However, most of these techniques have been studied independently, and their combined effect under edge hardware constraints has not been systematically evaluated [39]. Furthermore, dynamic scheduling based on input complexity remains underexplored, especially for diffusion models and multimodal generation tasks.

To bridge this gap, we propose a unified and adaptive inference optimization framework for generative models on edge devices. Our approach jointly integrates (1) layer-wise mixed-precision quantization, (2) heterogeneous sparse attention, and (3) capacity-aware expert routing to address multiple dimensions of inference cost—compute, memory, and adaptivity. We apply our method to both transformer-based LLMs and latent diffusion models and evaluate its performance on representative edge hardware platforms. The key contributions of this work are summarized as follows:

A unified inference optimization framework is presented that systematically integrates mixed-precision quantization, heterogeneous sparse attention, and dynamic expert routing. This framework is tailored for edge deployment scenarios and jointly addresses computational, memory, and adaptability challenges. Unlike prior approaches that focus on isolated optimization aspects, this work demonstrates the synergistic benefits of combining these strategies within a coherent architecture.
A token-aware adaptive execution strategy is introduced within the unified framework, enabling dynamic adjustment of bit-width, attention sparsity, and expert activation based on token complexity or generation stage. This mechanism supports fine-grained control of inference cost and allows flexible trade-offs between latency and output quality across diverse tasks and hardware profiles.
The proposed approach is implemented and validated across both autoregressive and diffusion-based generative models on real-world platforms. Experimental results show that the method achieves up to 2.4× end-to-end speed-up and 1.8× memory reduction on devices such as NVIDIA A100 and NVIDIA Jetson AGX Orin, with generation quality comparable to or better than full-precision baselines.

Therefore, the primary motivation of this work is to overcome the critical efficiency and adaptability bottlenecks that currently hinder the widespread deployment of large generative models on resource-constrained edge platforms. We address this by proposing a novel, unified framework that, unlike prior efforts that have largely focused on isolated optimization techniques, synergistically integrates the aforementioned strategies of mixed-precision quantization, heterogeneous sparse attention, and dynamic expert routing. This allows the framework to make dynamic, fine-grained adjustments to the computational graph based on both token-level input complexity and real-time hardware conditions—a level of dynamic scheduling and combined optimization that remains underexplored, especially for edge deployments. This holistic methodology facilitates flexible trade-offs between inference latency, memory footprint, and output quality. The main contributions of this work, which are detailed below, collectively demonstrate a practical and effective solution that significantly improves inference speed and memory efficiency while maintaining robust model performance across diverse generative tasks and heterogeneous edge hardware.

The experimental results on diverse benchmark datasets across both language and vision tasks demonstrate the effectiveness and generalizability of the proposed inference framework. The method consistently achieves substantial improvements in latency, memory usage, and throughput, while maintaining or exceeding the output quality of state-of-the-art baselines. These findings highlight the framework’s potential to support the efficient, scalable deployment of generative models across heterogeneous hardware environments, including embedded and edge platforms. This work not only advances the design of unified inference systems but also opens new perspectives for applications in fields such as autonomous robotics, intelligent terminals, and real-time content generation under constrained resources.

2. Related Work

2.1. Model Compression and Static Optimizations

Recent advances in efficient inference for generative models have focused on reducing latency [40,41,42] and computational cost [43], particularly in large language models (LLMs) and diffusion architectures [44,45]. Zhou et al. [46] categorize mainstream acceleration strategies and highlight the effectiveness of model-level optimizations such as quantization and expert sparsity, yet stop short of proposing integrated solutions. Yin et al. [47] and Ma et al. [48] improve diffusion efficiency via one-step distillation and feature caching, respectively, but both approaches are model-specific and lack generality across tasks and hardware. Xia et al. [31] review speculative decoding techniques that enable token-level parallelism in LLMs, though their scope remains limited to autoregressive models. Yuan and Qiao [49] introduce Diffusion-TS, enhancing interpretability in time series generation, but their focus lies in data decomposition rather than inference optimization. While these studies offer valuable building blocks, few efforts have attempted to unify multiple acceleration strategies—such as quantization, sparse attention, and dynamic expert routing—into a coherent and hardware-aware framework. This work addresses that gap by proposing an adaptive, multi-strategy inference optimization framework that generalizes across generative model types and is tailored for edge deployment constraints.

2.2. Efficient Architectures and Dynamic Execution Approaches

In addition to model-side optimizations, recent works have explored structural, routing, and memory efficiency strategies for accelerating generative model inference. Shi et al. [50] propose DiffMoE, a scalable diffusion transformer framework with dynamic expert routing based on noise level and input complexity. By leveraging a global token pool and a learned capacity predictor, DiffMoE significantly improves computation allocation in diffusion processes. Complementary to routing efficiency, Zhang et al. [51] introduce SageAttention2, which combines per-thread INT4 quantization and outlier smoothing to accelerate attention while preserving output quality, achieving up to 3× speedup over FlashAttention2. From a memory perspective, Prabhu et al. [52] design vAttention, a virtual memory-aware KV cache management system that avoids the fragmentation problems of PagedAttention, improving LLM throughput without modifying attention kernels. Liu et al. [53] tackle fine-grained preference optimization in LLM alignment with TIS-DPO, which introduces token-level importance sampling into DPO, revealing the value of reward-aware adaptive learning in language generation. Sun et al. [54] propose ReDeEP, a hallucination detection method in RAG that disentangles the effects of parametric and external knowledge using mechanistic interpretability, enabling more accurate detection under confounded reasoning paths. Finally, Yao et al. [40] address the reconstruction–generation trade-off in latent diffusion models by aligning tokenizers with pretrained vision foundation models (VA-VAE), enabling faster convergence and improved image fidelity in high-dimensional latent spaces.

2.3. System-Level Optimizations for Edge AI

Furthermore, complementing efforts in model-level and architectural optimizations, significant research has addressed system-level approaches to enhance AI performance at the edge, particularly for minimizing latency. A notable direction involves intelligent computational offloading in various edge computing paradigms. This includes strategies for optimizing overall task completion time, for instance, in wireless-powered mobile edge–cloud computing networks [55]. In these complex, often resource-constrained systems, sophisticated decision-making algorithms are crucial. Among these, Deep Reinforcement Learning (DRL) has particularly gained traction for dynamic task offloading, with studies demonstrating its effectiveness in minimizing computation delay, especially within wireless-powered multi-access edge computing scenarios [56]. These system-level orchestration techniques, while distinct from the on-device unified acceleration focus of our current work, are vital for achieving comprehensive end-to-end efficiency and low latency in broader edge AI services [57,58].

Despite meaningful progress in accelerating the individual components of generative models, those existing approaches are largely modular and task-specific. Most solutions target either language models or diffusion models in isolation, lack generalizability across modalities, and rarely consider how compute, memory, and adaptability constraints interact under real-world deployment conditions. Furthermore, a few methods are designed to be jointly integrated, resulting in duplicated effort and suboptimal coordination between acceleration strategies. The absence of a unified, resource-aware framework capable of combining quantization, sparsity, and dynamic computation routing represents a systematic gap in the current literature. This work directly addresses that gap by proposing a novel and modular inference framework that harmonizes these complementary techniques into a single architecture explicitly optimized for edge-level generative tasks. Unlike prior studies, our approach is adaptable across model types and hardware profiles and demonstrates that coordinated optimization across multiple inference bottlenecks yields superior performance-efficiency trade-offs. In doing so, this study not only extends existing acceleration paradigms but also opens a new direction for scalable, adaptive, and deployable generative AI.

3. Materials and Methods

3.1. System Architecture

To address the efficiency bottlenecks of large-scale generative models in edge deployment scenarios, this work proposes a unified inference optimization framework that integrates three complementary techniques: mixed-precision quantization, adaptive sparse attention, and dynamic expert routing. Rather than treating these strategies in isolation, the framework is designed to support a modular yet coordinated execution flow, allowing for flexible adaptation across different model types (autoregressive LLMs and diffusion-based image generators) and heterogeneous hardware platforms (NVIDIA Jetson Orin and A100).

Figure 1a shows the high-level framework design: The framework begins by assessing token complexity to guide the selection of three core optimization techniques: mixed-precision quantization, heterogeneous sparse attention, and dynamic expert routing. These modules are designed to adapt to diverse hardware conditions—mixed-precision targeting edge deployment and sparse attention and routing unified under a shared control scheme. Together, they enable the construction of an adaptive inference model capable of delivering robust performance across varied deployment environments.

Figure 1b shows the technical breakdown of optimization modules: This subfigure details the internal mechanisms of each optimization component. Mixed-precision quantization assigns different bit-widths per layer based on sensitivity analysis, balancing accuracy and efficiency. Sparse attention reduces unnecessary computation via custom attention masks. Dynamic routing uses lightweight capacity-aware checks to assign each input token to a suitable expert, reducing redundant computation while preserving task relevance.

Figure 1c shows the fine-grained expert routing strategy: The rightmost panel presents the structure of our Dynamic Expert Routing module. Tokens are dynamically routed through a DiffMoE layer, with routing scores computed based on token complexity and system constraints. A capacity predictor ensures that expert activations remain within resource budgets while still providing task-aligned expressivity. This mechanism enables fine-grained, token-level adaptivity, which is essential for efficient and stable inference under fluctuating runtime conditions.

The overall architecture consists of a three-stage inference pipeline:

Precision Optimization Stage: Model weights, activations, and KV caches are quantized in a layer-wise manner, guided by sensitivity profiling to preserve generation quality while reducing memory and computation cost.
Computation Reduction Stage: Sparse attention modules with heterogeneous masking strategies are applied to reduce attention overhead. This step dynamically adjusts the attention pattern per head or layer, depending on task type and sequence length.
Computation Allocation Stage: A capacity-aware expert routing mechanism is introduced to adaptively activate experts based on token-level or timestep-level complexity, enabling input-aware computation balancing.

The framework supports both offline configuration (static quantization planning and routing table initialization) and online adaptive execution (runtime sparsity and expert selection). These three modules are loosely coupled but tightly aligned, meaning that they can be independently optimized yet work synergistically during inference. For instance, tokens routed to lighter expert paths may also be served by lower-precision configurations and sparser attention, while more complex tokens receive higher-compute treatment.

In contrast to prior works that focus on one-dimensional optimization (e.g., only quantization or routing), this framework offers a holistic solution that addresses latency, memory, and adaptability simultaneously. Furthermore, it enables smooth trade-offs between generation quality and resource consumption, which are critical for real-world deployment in resource-constrained environments.

3.2. Mixed-Precision Quantization with Layer-Wise Sensitivity

Mixed-precision quantization is adopted in this framework as a fundamental strategy to reduce both the memory footprint and computational complexity during inference. Unlike uniform quantization schemes that apply a fixed bit-width across the entire model, this work leverages a layer-wise sensitivity-aware quantization method, which allocates bit-widths adaptively based on each layer’s impact on overall model fidelity.

3.2.1. Layer-Wise Sensitivity Profiling

Prior to deployment, a sensitivity analysis is conducted to determine the numerical robustness of each layer. Two profiling modes are supported:

Static profiling, where calibration is performed on a representative validation set using post-training quantization (PTQ).
Dynamic profiling, where layer sensitivity is inferred online during training or fine-tuning by monitoring quantization-induced degradation in intermediate outputs or end-task accuracy.

Layers are then grouped into precision tiers:

Non-critical layers (e.g., intermediate feed-forward blocks) can be quantized to 2.8-bit or 3.6-bit using logarithmic or non-uniform quantization grids.
High-impact layers (e.g., embedding layers, layer norms, and output heads) retain 8-bit or mixed FP16 precision.

3.2.2. Weight and Activation Quantization

Let

x \in ℝ^{n}

denote the original full-precision tensor, and

\tilde{x}

its quantized version. Quantization follows the affine transformation

\tilde{x} = clip ([\frac{x - μ}{s}], q_{\min}, q_{\max}) \cdot s + μ

where

s

is the scale factor,

μ

is the zero-point offset, and

[q_{\min}, q_{\max}]

is the quantized integer range determined by bit-width b. For sub-4-bit representations (e.g., 2.8-bit), non-linear quantization schemes such as log-domain quantization or per-channel asymmetric quantization are used to preserve distribution fidelity. The details are shown in Algorithm A1 in Appendix A.

To determine an appropriate bit-width

b_{i}

for each layer

i

, we first perform sensitivity profiling via either static calibration or dynamic loss monitoring. Each layer is assigned a sensitivity score

S_{i} \in [0, 1]

, which reflects the relative performance degradation incurred when quantizing only that layer.

Based on the empirically chosen thresholds

θ_{1}

and

θ_{2}

(

θ_{1}

= 0.05,

θ_{2}

= 0.01), bit-widths are assigned as follows: high-sensitivity layers

(S_{i} > θ_{1})

are quantized to 8-bit, medium-sensitivity layers

(θ_{2} < S_{i} < θ_{1})

to 3.6-bit or 4-bit, and low-sensitivity layers

(S_{i} < θ_{2})

to 2.8-bit. This tiered assignment allows the quantization budget to be concentrated on layers that have a greater impact on final output quality. For dynamic profiling, we optionally define the sensitivity score as the relative loss deviation between the full-precision model M and the model M_i with only the i-th layer quantized, computed as follows:

b_{i} = \{\begin{cases} 8, i f S_{i} > θ_{1} (high - sensitivity layer) \\ 3.6 or 4, i f θ_{2} < S_{i} < θ_{1} (medium) \\ 2 . 8, i f S_{i} \leq θ_{2} (low - sensitivity layer) \end{cases}

S_{i} = \frac{L (M_{i} (x)) - L (M (x))}{L (M (x))}

where

L (\cdot)

denotes the loss function. This metric enables a fine-grained evaluation of each layer’s contribution to overall performance and facilitates automated bit-width tuning. The resulting mixed-precision configuration is then applied to all relevant components and exported using ONNX or TensorRT format for deployment.

In addition to weights and activations, key-value (KV) cache quantization is applied to reduce runtime memory consumption in autoregressive generation. Cache tensors are quantized using low-bit grouped quantization, with minimal degradation in decoding accuracy.

3.2.3. Deployment and Compatibility

The quantized models are exported in ONNX format with embedded quantization parameters. The framework is compatible with TensorRT and ONNX Runtime for efficient deployment on both edge- (Jetson Orin) and server-class GPUs (A100). Bit-level tuning is exposed as a configurable parameter to support user-defined trade-offs between latency and generation quality. Compared to conventional full-precision or uniform 8-bit inference, the proposed layer-wise quantization achieves substantial reductions in both memory footprint and arithmetic operations. Moreover, when integrated with sparse attention and expert routing, the bit-width configuration can be tuned jointly with computational resource allocation strategies, enabling precision–compute co-optimization during runtime.

3.3. Sparse Attention with Adaptive Heterogeneous Masking

Self-attention mechanisms in transformer-based models exhibit quadratic computational complexity, which becomes a significant bottleneck during inference, especially for long-sequence generation tasks. To address this challenge, this work introduces an adaptive heterogeneous sparse attention mechanism, inspired by recent developments such as the Mixture of Attention (MoA). Unlike uniform sparse attention schemes, the proposed method dynamically adjusts attention patterns across different attention heads and layers, effectively reducing computational cost without significantly affecting generation quality.

3.3.1. Adaptive Sparse Mask Generation

In our framework, each attention head can independently adopt a sparse attention pattern tailored to its specific computational and representational characteristics. Specifically, attention heads are classified into multiple sparsity tiers based on their contribution to model outputs:

Local attention heads, which apply sparse attention within a predefined sliding window, significantly reducing complexity, typically set as $w$ = 8 or $w$ = 16 depending on sequence length:

$M_{i j}^{local} = \{\begin{cases} 0, i f |i - j| \leq w \\ - \infty, otherwise \end{cases}$
Sparse global heads (top-k routing), which selectively attend to global context via learned sparse patterns, ensuring long-range dependency modeling. In the following formula, the ${TopK}_{j} (\cdot)$ selects the top-k attention scores for each query $q_{i}$ . In our experiments, we set $k$ = 16 as a default value.

$L_{i} = {TopK}_{j} (\frac{q_{i} \cdot k_{j}^{T}}{\sqrt{d_{k}}}), M_{i j}^{global} = \{\begin{cases} 0, i f j \in L_{i} \\ - \infty, otherwise \end{cases}$
Dense heads, reserved for layers or tasks requiring high precision, fully attending to the entire input sequence, which is used for critical layers or final stages requiring a full information context.

$M_{i j}^{dense} = 0, \forall (i, j)$

Sparse attention masks M are dynamically generated based on task-specific and sequence-specific information, following the general form

A (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M) V

where

Q

,

K

,

V

denote query, key, and value matrices, and

d_{k}

is the key dimension. The mask matrix

M

is adaptively learned or constructed, containing zeros if position

j

is attended by query

i

, and negative infinity (−

\infty

) if otherwise. The process is shown in Algorithm 1.

3.3.2. Masking Strategy and Learning Mechanism

Two mechanisms are explored for determining sparse attention masks:

Rule-based construction: Masks are defined using deterministic heuristics, such as fixed sliding windows or block-wise patterns, providing predictable latency and efficiency gains.
Learnable gating mechanism: Attention sparsity is learned dynamically during fine-tuning via a small auxiliary network that predicts mask patterns, allowing the model to adapt attention sparsity to input content dynamically.

In practice, a hybrid strategy is adopted. During offline deployment preparation, rule-based initialization provides a strong baseline, and online adaptation via gating mechanisms refines sparsity patterns based on input data distributions observed during inference.

3.3.3. Implementation and Compatibility

Sparse attention mechanisms are implemented to be compatible with existing transformer backbones and inference frameworks such as PyTorch, ONNX Runtime, and TensorRT. Attention sparsity patterns are stored efficiently as binary or low-bit masks, ensuring minimal memory overhead during inference. Additionally, the sparse attention module can be seamlessly integrated with the quantization scheme (as discussed in Section 3.2), further compounding inference efficiency.

The proposed adaptive heterogeneous sparse attention method addresses computational inefficiencies in transformer-based generative models. It introduces significant inference speedups by reducing the quadratic complexity of self-attention, while maintaining adaptability to different generative tasks and data distributions. When combined with mixed-precision quantization and dynamic expert routing, sparse attention not only optimizes computational efficiency but also enhances the overall adaptability and scalability of the unified inference optimization framework.

Algorithm 1 The pseudo-code of the Adaptive Sparse Attention Algorithm.

Initialization:
Set attention type

T

∈ {Local, Sparse_Global, Dense},
Set

w i n d o w_s i z e

,

t o p_k

Upon Receiving Inputs

Q, K, V

:
1. Calculate raw attention scores:

s c o r e s = \frac{Q K^{T}}{\sqrt{d_{k}}}

2. Generate mask based on

T

:

If $T$ = Local:

Construct local sliding window mask

M_{l o c a l} (i, j)

with width

w i n d o w_s i z e

.

Else if $T$ = Sparse_Global:

Select

t o p_k

keys for each query to form sparse mask

M_{s p a r s e} (i, j)

.

Else if $T$ = Dense:

Set mask

M_{d e n s e} (i, j)

= 0 for all

(i, j)

.
3. Apply mask to scores:

s c o r e s = s c o r e s + M_{t y p e}

4. Compute attention probabilities:

a t t n = s o f t m a x (s c o r e s)

5. Compute output:

o u t p u t = a t t n \times V

Output:
Return

o u t p u t

3.4. Dynamic Expert Routing with Capacity-Aware Scheduling

Transformer-based generative models employing Mixture-of-Experts (MoE) architectures have significantly advanced model scalability and performance by dynamically allocating computation. However, traditional MoE methods typically use static gating mechanisms, such as uniform top-k selection, which fail to account for varying input complexities and dynamic inference conditions. To address this limitation, this paper proposes an innovative dynamic expert routing approach equipped with a novel capacity-aware scheduling mechanism. Distinct from conventional methods, our approach enables fine-grained, adaptive computation by dynamically adjusting the allocation of computational resources based on token-level complexity, global expert capacity constraints, and runtime inference context.

3.4.1. Design Principles and Motivation

The fundamental innovation of the proposed expert routing mechanism lies in its explicit incorporation of token- and timestep-level complexity, along with a global capacity control system. Existing methods, such as the standard top-k gating used by methods like DiffMoE, typically ignore token-specific computational requirements, leading to inefficient utilization of computational resources—over-provisioning for simple tokens and under-provisioning for complex ones. Our design overcomes these challenges through the following distinctive principles:

Complexity-Aware Expert Assignment: Tokens are routed dynamically based on their learned complexity profiles, rather than static gating probabilities alone. This complexity estimation incorporates contextual embeddings, enabling intelligent selection of experts aligned with actual token complexity.
Global Capacity Scheduling: We introduce a global scheduling constraint, termed the capacity predictor, that dynamically monitors and balances the total computational load across experts. Unlike traditional gating, which operates purely at the individual token level, our global capacity predictor ensures balanced expert utilization and prevents computational hotspots or overloads.
Adaptive Runtime Optimization: The routing strategy dynamically interacts with the mixed-precision quantization and adaptive sparse-attention strategies outlined earlier. This integrative design allows for real-time joint optimization of expert selection, precision configuration, and attention sparsity, significantly enhancing inference flexibility and efficiency.

3.4.2. Routing Module and Capacity Predictor

Formally, each input token embedding

x_{i}

is processed by a lightweight gating network, defined as

p_{i} = Softmax (W_{r} x_{i} + b_{r})

where

W_{r} \in ℝ^{E \times d}

and

b_{r} \in ℝ^{E}

are learnable parameters, and

E

denotes the total number of experts. Subsequently, the top-k experts for each token are determined by

ε_{i} = TopK (p_{i}, k)

The global capacity predictor calculates the expected load per expert as follows:

C_{e} = \sum_{i} ∥ (e \in ε_{i}), e \in [1, E]

In our implementation, the total number of experts

E

is set to 8, and each input token dynamically selects the top-2 experts (

k

= 2) during inference to enable fine-grained computation allocation. The input embedding dimension d is configured as 256 to strike a balance between expressiveness and computational efficiency in the gating network. The expert capacity threshold

C_{\max}

is defined as 15% of the total number of tokens, serving as the upper limit for load balancing; when exceeded, tokens with lower routing logits are reassigned to less loaded experts to maintain stable runtime behavior. Additionally, a temperature parameter

τ

= 0.5 is applied in the softmax gating function to enhance differentiation among expert scores. All hyperparameters are empirically tuned to ensure robust performance across both edge devices and server-class hardware platforms.

In summary, the proposed dynamic expert routing mechanism introduces a capacity-aware rebalancing strategy that addresses key limitations of static gating methods used in prior MoE architectures such as GShard or Switch Transformer. By incorporating a global capacity predictor and dynamic reassignment mechanism, our approach ensures stable and balanced expert utilization, which is particularly critical for edge and embedded deployment scenarios with strict inference constraints. The routing system detects expert overload conditions—when the number of tokens assigned to a particular expert exceeds a threshold

C_{\max}

—and adaptively reroutes lower-priority tokens based on learned routing logits, thereby preserving runtime stability. Furthermore, this capacity-aware design is inherently compatible with our token-level sparse attention- and precision-aware execution strategies, forming a unified and extensible framework. It also lays the foundation for future enhancements, such as time-aware routing, task-specific adaptation, and multimodal expert scheduling.

3.4.3. Implementation and Technical Novelty

The proposed dynamic expert routing module is implemented as a lightweight and backend-compatible component, supporting both TensorRT and ONNX Runtime. Unlike conventional static MoE routing, which relies on fixed top-k selection, our method introduces a learnable token complexity estimator that uses token embeddings and intermediate attention statistics to guide routing decisions. This enables input-adaptive expert activation, aligned with the real-time computational demands of each token.

To further improve runtime efficiency and avoid expert overload, a global capacity constraint is incorporated, which dynamically monitors expert utilization and redistributes tokens when necessary. This mechanism ensures stable throughput and prevents resource contention—issues commonly seen in standard MoE systems. What sets our design apart is its seamless integration with mixed-precision quantization and sparse attention modules. Routing-aware decisions are jointly optimized with bit-width selection and attention sparsity, forming a coherent inference pipeline. This results in significantly improved computation efficiency, reduced latency, and enhanced adaptability across diverse tasks and hardware environments.

3.5. Integrated Framework and Execution Strategy

To address the multifaceted demands of efficient inference in real-world generative model deployment, we present a unified optimization framework that integrates three core techniques—layer-wise mixed-precision quantization (Section 3.2), adaptive sparse attention (Section 3.3), and dynamic expert routing with capacity-aware scheduling (Section 3.4). Unlike prior methods that apply these strategies independently, our framework coordinates them into a cohesive, runtime-adaptive inference pipeline designed to optimize performance across diverse hardware platforms and input conditions.

3.5.1. Coordinated Three-Stage Architecture

The integrated pipeline operates in three interdependent stages:

Precision Assignment: Prior to deployment, sensitivity profiling assigns optimal bit-widths to individual layers, yielding a quantized model with minimal quality degradation. These configurations are exported using formats such as ONNX and TensorRT for cross-platform deployment.
Sparse Computation: During inference, attention modules select heterogeneous sparse masking patterns based on token context and model precision settings, enabling dynamic trade-offs between computation, memory, and accuracy.
Dynamic Routing: Input tokens are routed through expert modules based on real-time complexity scores and global capacity constraints. Lightweight routing heads ensure balanced expert utilization without introducing significant overheads.

These stages are coordinated through a shared complexity-aware controller, allowing decisions made at one stage (e.g., expert assignment) to influence others (e.g., sparsity or quantization level), resulting in token-level execution paths that are jointly optimized across multiple dimensions.

3.5.2. Adaptive Inference Flow

At runtime, the system loads pre-profiled quantization, sparsity, and routing configurations. Incoming token embeddings are analyzed to estimate complexity, informing precision adjustment, mask selection, and expert assignment dynamically. Tokens with low complexity are processed using low-precision weights and sparse attention heads routed to lightweight experts. In contrast, tokens requiring high fidelity are processed with denser attention, higher-precision layers, and more expressive experts. This enables token-level control of compute allocation while preserving output quality.

The framework supports adaptive behavior across deployment scenarios. On edge devices like NVIDIA Jetson Orin, it prioritizes aggressive quantization and sparse routing to reduce latency and memory usage. On high-end GPUs such as A100, it seamlessly adapts to higher compute budgets by activating denser paths and full expert capacity. These adaptations occur automatically based on runtime resource detection and input statistics.

3.5.3. Innovation and Impact

The central innovation lies in the tight integration of three orthogonal optimization strategies within a single, runtime-adaptive execution pipeline. Rather than statically applying precision, sparsity, or routing in isolation, our system leverages shared token-level complexity estimations to coordinate them. For instance, a token predicted as “simple” is simultaneously routed to low-compute experts, evaluated under a sparse mask, and quantized at sub-4-bit precision. This joint decision-making avoids conflicting optimization behaviors and ensures global efficiency.

Additionally, the incorporation of a capacity-aware expert scheduler mitigates the bottlenecks and instabilities often observed in traditional MoE frameworks. By ensuring a balanced computation load across experts while retaining adaptive flexibility, the system achieves robust performance under variable conditions.

In summary, our integrated framework translates theoretical optimization principles into a practical, deployable engine that adapts to hardware constraints, input variability, and task complexity. It delivers consistent improvements in inference latency, memory usage, and throughput across both language and vision tasks, representing a substantial advancement over prior disjointed approaches.

3.6. Complexity Analysis and Theoretical Efficiency Gains

To quantify the computational efficiency improvements achieved by the proposed integrated framework, a detailed theoretical complexity analysis is performed. This analysis explicitly evaluates the computational and memory reduction brought about by each optimization module—mixed-precision quantization, adaptive sparse attention, and dynamic expert routing—and then assesses their integrated synergy in a practical inference scenario.

3.6.1. Complexity Analysis for Mixed-Precision Quantization

First, the computational complexity benefits of mixed-precision quantization are formally analyzed. Given a transformer model with L layers, each containing parameter matrices

W \in ℝ^{n \times d}

, the standard full-precision complexity per layer typically scales as

O (n d)

. In contrast, the layer-wise mixed-precision approach reduces the effective arithmetic complexity by adjusting the bit-width

b_{l}

adaptively for each layer

l

. Specifically, complexity reduction can be quantified by the arithmetic operations saved due to lower-precision multiplications and additions, computed as follows:

C_{quant} = \sum_{i = 1}^{L} \frac{b_{l}}{32} \times C_{full - precision}

For instance, layers with minimal sensitivity are quantized to 2.8-bit or 3.6-bit precision, significantly decreasing arithmetic intensity, while sensitive layers remain at 8-bit precision. Consequently, practical tests reveal approximately 1.5× to 3× computational throughput gains depending on bit-width distributions.

3.6.2. Complexity Reduction from Adaptive Sparse Attention

The complexity of standard dense attention scales quadratically with sequence length

N

, expressed as

O (N^{2} d)

, making it computationally expensive for long sequences. The adaptive sparse attention module proposed in this framework addresses this quadratic bottleneck by introducing sparse masking strategies tailored dynamically to token-level complexity and attention-head characteristics. Specifically, local-window sparse attention reduces the complexity from

O (N^{2} d)

to approximately

O (N \times w \times d)

, where

w ≪ N

represents the local window size. Meanwhile, selective sparse-global patterns scale complexity roughly as

O (N \times k \times d)

, where

k ≪ N

. By dynamically switching between these sparsity modes according to runtime complexity predictions, the proposed mechanism achieves substantial computational complexity reduction, typically 2× to 4×, compared to standard dense attention operations.

C_{sparse - attn} = \{\begin{cases} O (N w d) (local sparse) \\ O (N k d) (sparse global) \end{cases}, k, w ≪ N

3.6.3. Complexity Benefits of Dynamic Expert Routing

Traditional Mixture-of-Experts (MoE) methods using static top-k gating have complexity scaling linearly with the number of experts

E

, often resulting in significant computational waste due to suboptimal expert assignments. The proposed dynamic expert routing approach introduces global capacity-aware scheduling, dramatically reducing unnecessary computations by dynamically adjusting the number of active experts based on real-time token complexity estimation and global computational constraints. Formally, assuming that each token

i

activates only the top-

k_{i}

experts dynamically selected, the total computational complexity scales as

C_{routing} = O (\sum_{i = 1}^{N} k_{i} \times \frac{d}{E}), k_{i} \leq k_{\max}

Through global capacity constraints, tokens identified with lower complexity activate fewer and lighter experts (low

k_{i}

), substantially reducing total inference complexity compared to traditional static expert gating methods. Empirical evaluations confirm complexity reductions ranging from 1.5× to 2.5×, depending on runtime token distribution.

3.6.4. Integrated Complexity and Efficiency Gains

Integrating the complexity analysis results from the three optimization modules above yields a comprehensive theoretical complexity reduction model:

C_{integrated} = C_{quant} \times C_{sparse - attn} \times C_{routing}

In our experiments, we set the number of transformer layers

L

= 32. Based on sensitivity profiling, low-sensitivity layers are quantized to 2.8-bit and medium-sensitivity layers to 3.6-bit, and high-sensitivity layers (e.g., embeddings, layer norms) retain 8-bit precision. For sparse attention, we use input sequence length

N

= 128, hidden dimension

d

= 4096, local window size

w

= 32, and top-

k

sparse global heads with

k

= 8. In the expert routing module, each token activates up to

k_{\max}

= 2 experts out of a total

E

= 16, with each expert handling

d / E

= 256 dimensions. These hyperparameters are consistently applied across both language and vision tasks for all experiments and profiling.

Due to the orthogonal and complementary nature of quantization, sparse attention, and dynamic expert routing, our framework achieves multiplicative rather than additive complexity reductions. This synergy results in a theoretical computational speedup of approximately 3× to 6× compared to full-precision, dense-attention, and static MoE baselines. Complexity is analyzed through detailed FLOPs and memory profiling using tools such as NVIDIA Nsight Systems and PyTorch Profiler. These tools capture layer-wise arithmetic cost, memory access, and latency, enabling accurate modeling of runtime behavior. The profiling data are directly used by adaptive modules to dynamically adjust bit-widths, attention sparsity, and expert allocation during inference.

Unlike traditional approaches, which optimize each component independently, our framework jointly models and co-optimizes complexity across all modules. This coordinated strategy captures interaction effects—such as how token complexity impacts both expert selection and sparsity level—and leads to more efficient and adaptive execution. In summary, the complexity-aware co-design of all inference components provides substantial efficiency gains in practice. It serves not only as a theoretical justification for our methods but also as a principled guide for real-time adaptation in diverse hardware environments.

4. Experimental Results and Discussion

4.1. Experimental Setting

The proposed framework is implemented using the PyTorch 2.1.0 backend, combined with ONNX Runtime 1.16.1 and TensorRT 8.6.1, for efficient deployment. Experiments are conducted on two hardware environments: (1) a high-performance server equipped with an Intel(R) Xeon Gold 6226R CPU @ 2.90 GHz, an NVIDIA A100 GPU with 40 GB VRAM, and 256 GB system memory; (2) a resource-constrained embedded device—NVIDIA Jetson Orin NX—with 8-core ARM Cortex-A78AE CPU, 1024-core Ampere GPU, and 8 GB shared RAM. All models are tested under CUDA 12.2 with cuDNN 8.9 acceleration.

For fine-tuning and routing module training, the AdamW optimizer is used with a base learning rate of 6 × 10⁻⁵ for transformer encoder layers and 6 × 10⁻⁴ for expert routing heads, and a weight decay coefficient of 0.01. Mixed-precision training (AMP) is enabled during expert initialization to accelerate convergence.

For evaluation, accuracy and BLEU scores are used for text generation tasks, while FID and LPIPS metrics are used for image generation tasks. Latency (ms/token or ms/image), memory usage (peak GPU memory), and FLOPs are recorded using NVIDIA Nsight Systems and PyTorch Profiler tools to quantify inference efficiency across model variants and deployment scenarios.

4.2. Datasets and Models

All experiments are conducted to evaluate the performance of the proposed unified inference optimization framework across both natural language and image generation tasks. For language modeling, we select LLaMA-7B and GPT-NeoX 20B as representative large language models. For diffusion-based image generation, Stable Diffusion v1.5 and DiT-XL-2 are employed as backbone architectures. These models are tested under various optimization configurations described in previous sections, including mixed-precision quantization, adaptive sparse attention, and dynamic expert routing.

Benchmark datasets are selected according to the model type and task. For language tasks, OpenBookQA, TriviaQA, and GSM8K are used to evaluate reasoning accuracy and generation consistency. For diffusion-based generation, image fidelity is assessed using MS-COCO Captions, CelebA-HQ, and FFHQ datasets. All experiments are executed on two types of hardware platforms to validate cross-device efficiency and deployment scalability. The first is an edge-class device—NVIDIA Jetson Orin NX—with 8 GB shared memory and integrated Ampere GPU cores. The second is a high-performance server equipped with an NVIDIA A100 GPU (40 GB VRAM) paired with Intel Xeon CPUs and 256 GB system memory.

Model inference is implemented with PyTorch 2.1.0 and further compiled into ONNX format for deployment. Execution is accelerated using ONNX Runtime 1.16.1, with support for quantized neural networks and sparse kernel optimizations. On server platforms, TensorRT 8.6.1 is utilized for low-level inference kernel optimization. For quantization, INT8 inference with custom sub-byte resolution (2.8-bit, 3.6-bit) is implemented on top of TensorRT. Sparse attention is deployed using a modified version of FlashAttention2 that supports heterogeneous attention masking per head. The dynamic expert routing module is implemented using PyTorch JIT and embedded within the ONNX graph via custom operator nodes, enabling runtime routing decisions during inference.

This setup ensures a standardized and reproducible evaluation environment across tasks, allowing fair comparison of individual module effects, combined optimization benefits, and performance scalability across hardware profiles.

4.3. Ablation Study

To quantify the individual and combined contributions of our three optimization components—quantization (Q), sparse attention (S), and expert routing (R)—we perform systematic ablation experiments using the LLaMA-7B model on both A100 GPU and Jetson Orin NX. As summarized in Table 1, each module independently improves efficiency, with minimal impact on accuracy. On A100, quantization alone reduces latency by 28% (from 100 ms to 72 ms) and memory by 32%, with accuracy dropping slightly to 86.8%. Sparse attention and routing yield similar moderate gains, maintaining accuracies above 86.9%.

When combining modules, improvements become more pronounced. For instance, Q + S lowers latency to 56.2 ms and memory to 5210 MB, while Q + R and S + R achieve further gains. The full Q + S + R configuration yields the best performance—latency drops to 41.3 ms (−58.7%) and memory to 3900 MB (−57.2%), with accuracy restored to 87.0%, matching the baseline.

Results for Jetson Orin NX show consistent trends. Latency reduces from 85.0 ms to 37.5 ms, and memory usage drops from 7902 MB to 3050 MB under Q + S + R. Notably, baseline memory is lower on Orin due to TensorRT’s built-in mixed-precision and kernel fusion optimizations. These findings confirm that each component is effective in isolation and that their combination leads to synergistic improvements—delivering substantial speedups and memory savings without sacrificing model performance. The framework proves to be robust and beneficial across diverse hardware setups.

Due to architectural and runtime differences, baseline memory usage varies across platforms. On A100, the full FP16 model runs with minimal backend optimization, leading to a footprint of 9119 MB. In contrast, Jetson Orin applies automatic mixed-precision and kernel fusion via TensorRT, reducing baseline memory to 7902 MB, despite using the same model logic.

We conduct comparative ablation studies on LLaMA-7B and GPT-NeoX-20B. As summarized in Table 2, this experiment is designed to assess whether the observed gains in efficiency hold consistently across both medium- and large-scale language models. Results show that all three modules individually reduce latency and memory consumption with minimal accuracy degradation (<0.4%), while combined configurations (Q + S, Q + R, S + R) yield further improvements due to their complementary nature. Notably, the full integration (Q + S + R) delivers the most significant acceleration—reducing latency by over 58% and memory by over 57% for LLaMA-7B and by 65.6% and 60.9%, respectively, for GPT-NeoX-20B—while maintaining near-baseline accuracy. These findings confirm that the proposed framework scales effectively across model sizes and that its modular design offers both robustness and efficiency in large model inference.

4.4. Comparison with Existing Methods

To further validate the effectiveness and generalizability of our unified inference optimization framework, our method of image generation uses Stable Diffusion v1.5 as the baseline. Table 2 presents a comprehensive evaluation of various inference acceleration techniques applied to the Stable Diffusion v1.5 model across two hardware platforms: NVIDIA A100 GPU (high-performance server) and Jetson Orin NX (edge device). The metrics considered include latency, memory usage, CLIP score (semantic alignment), FID (visual quality), and throughput (images per second).

On the A100 platform, our proposed method Qurs (Q + S + R) achieves the best overall performance, with the lowest latency (52.8 ms), smallest memory footprint (3180 MB), and highest throughput (18.9 images/s). Despite aggressive optimization, it maintains high output quality, with a CLIP score of 0.815 and the lowest FID of 11.7, outperforming other methods such as DiffMoE, GPTQ, and SparseGPT. This confirms the synergy of quantization, sparse attention, and dynamic routing when applied jointly.

On Jetson Orin NX, a resource-constrained edge device, the trend remains consistent. Qurs still achieves the best latency (865.3 ms), the lowest memory consumption (3182 MB), and the highest throughput (1.16 images/s), while preserving competitive quality (CLIP Score 0.814, FID 12.1). Compared to the baseline, this represents a ~60% reduction in latency and a ~2.5× throughput increase, demonstrating the effectiveness of the framework in edge deployments.

Notably, while methods like SparseGPT and FlashAttention2 offer moderate speedups, they suffer from visual degradation (FID > 13) and lower semantic consistency (CLIP < 0.80), especially on the Orin device. DiffMoE shows a better balance, but is still outperformed by Qurs in all aspects. These results strongly support the idea that the proposed unified inference framework (Q + S + R) provides a robust and scalable solution, achieving significant acceleration and efficiency gains without sacrificing output quality, and generalizes well across both cloud and edge environments. The results are shown in Table 3.

To ensure fair and reproducible comparisons in Section 4.4, we provide detailed settings for evaluating each baseline method. The results of GPTQ are reproduced using the official open source library with PyTorch 1.13 and CUDA 11.7. The quantization scheme uses 4-bit weights per channel and group optimization, and inference is performed on the LLaMA-7B model with a batch size of 4 and an input length of 128 tokens. For SparseGPT, we follow the settings of Frantar et al.[43] and jointly apply structured pruning (50%) and 4-bit quantization. All results are obtained by local reproduction under matching preprocessing and runtime conditions. FlashAttention2 is integrated into the baseline Transformer using the Triton 2.0 kernel and evaluated on the corresponding hardware with FP16 precision, with a token length of 128 and a batch size of 4. The DiffMoE method is re-implemented based on the original design with 4-bit experts and top-2 gating, and ONNX-TensorRT compatibility is deployed to support unified inference measurement. The batch size of the DiffMoE test is reduced to 2 due to memory overhead. Our method (Q + S + R) is deployed on A100 and Jetson Orin NX platforms using the same LLaMA-7B model, integrating layer-by-layer mixed-precision quantization (ranging from 3.6 bits to 8 bits), heterogeneous sparse attention mechanism, and dynamic expert routing. All measurements are averaged over 100 inference runs with a standard deviation below 5%, and greedy decoding is used for inference decoding without temperature scaling. This setting ensures consistency across methods and accurately reflects the actual performance of the proposed unified inference framework.

To comprehensively evaluate the effectiveness and generalizability of various optimization methods on diffusion-based image generation tasks, we visualize their performance on both the A100 and Jetson Orin NX platforms using three key axes: quality-latency trade-off, memory-efficiency vs. fidelity, and throughput-latency characteristics.

On the A100 server-class GPU, as shown in Figure 1a–c, our proposed method (Q + S + R) achieves the best balance across all metrics. In Figure 2a, it achieves a CLIP score of 0.815—comparable to the baseline (0.819)—while reducing latency from 114.0 ms to 52.8 ms. Figure 1b demonstrates a significant drop in memory consumption (from 5402 MB to 3180 MB) along with the lowest FID (11.7), indicating improved structural realism. Moreover, Figure 1c shows a dramatic boost in inference efficiency, with throughput increasing from 8.8 to 18.9 images/s, outperforming all other baselines, including FlashAttention2, DiffMoE, and SparseGPT.

On the Jetson Orin NX edge device, Figure 2d–f illustrate consistent trends under resource-constrained conditions. As shown in Figure 2d, our framework again achieves a near-optimal CLIP score (0.811) with the lowest latency (865.3 ms), representing a 60.6% reduction from the baseline. In Figure 2e, Qurs maintains the best memory-FID balance with only 3182 MB memory usage and a FID of 12.1, closely matching the A100 results. Figure 2f confirms that our method sustains the highest throughput (1.16 images/s) while maintaining latency well below that of the other methods.

Together, these results validate that the Q + S + R framework is not only performant on high-end GPUs but also robust and efficient on low-power hardware. Its ability to retain generation quality while significantly improving latency and resource consumption underlines its practical viability for deployment across diverse hardware platforms.

4.5. Generalization Across Hardware Platforms

To further examine the portability and robustness of our unified inference framework, we conduct comprehensive cross-platform experiments spanning both high-performance and resource-constrained devices. While Section 4.3 and Section 4.4 validate performance gains under fixed hardware configurations, this section focuses on evaluating generalization under divergent deployment environments, model types, and batch workloads.

Specifically, we benchmark the Stable Diffusion v1.5 (image generation) and LLaMA-7B (language modeling) models on two representative hardware platforms: an NVIDIA A100 GPU server and a Jetson Orin NX edge device. By analyzing runtime behavior across varying batch sizes and input complexities, we highlight how our modular framework adapts to the available compute and memory budget while retaining high generation quality. This cross-hardware study not only complements prior ablation analyses but also demonstrates the practical deployability of the proposed framework across a wide spectrum of application scenarios—from real-time edge inference to large-scale cloud generation pipelines.

The experimental results summarized in Table 4 and visualized in Figure 3 comprehensively demonstrate the scalability and hardware adaptability of our proposed unified inference framework across both high-end and edge deployment platforms. Specifically, we evaluate the performance of Stable Diffusion v1.5 on an NVIDIA A100 GPU and a Jetson Orin NX under varying batch sizes.

As shown in Figure 3a, latency increases with batch size on both platforms, which is expected due to higher computational workload. However, the A100 maintains consistently low latency (e.g., 152 ms at batch size 8), while the Orin NX, though significantly slower (4571 ms), still enables practical inference, particularly in batch-1 or low-throughput scenarios. This highlights the adaptability of our framework, even under limited computational resources. Figure 3b shows that throughput scales efficiently with increasing batch sizes on both platforms. The A100 reaches a peak throughput of 44.2 images/s at batch size 8, while the Orin NX achieves 1.92 images/s, which is sufficient for many edge-level applications. Notably, throughput on both devices demonstrates near-linear growth up to the hardware saturation point, confirming that our method supports effective batching without incurring exponential latency penalties. In terms of memory usage (Figure 3c), both platforms exhibit steady, predictable growth with larger batch sizes. Importantly, memory consumption remains within acceptable bounds—under 6 GB on Orin NX and under 8.2 GB on A100—ensuring deployment feasibility across constrained and unconstrained environments.

Taken together, these results validate that our optimization strategy (Q + S + R) not only achieves substantial efficiency gains but also generalizes well across diverse hardware architectures. The framework supports adaptive scaling, balances throughput and latency, and adheres to memory constraints, making it highly suitable for real-world applications spanning both cloud-based services and edge deployment scenarios.

Based on the results summarized in Table 5, we observe a clear correlation between input complexity and both expert activation and inference latency across modalities and hardware platforms. For the LLaMA-7B model, as text length increases from 64 to 1024 tokens, the number of activated experts on the A100 rises from 1.4 to 3.4, with latency increasing from 4.2 ms to 14.3 ms and accuracy slightly decreasing from 87.3% to 85.8%. Similar trends are observed on the Jetson Orin NX, albeit with higher latency due to limited computational resources. For Stable Diffusion v1.5, increasing image resolution results in more experts being engaged—from 1.1 (224 × 224) to 2.6 (521 × 512) on A100—and increased latency, while maintaining competitive FID scores. These patterns highlight the adaptive routing behavior of our framework: more computational capacity is dynamically allocated for complex inputs to preserve quality, demonstrating the model’s input-aware scalability and resource efficiency.

4.6. Qualitative Case Studies

To complement the quantitative evaluations, we conduct a series of qualitative case studies on both language generation and image synthesis tasks. These visual and interpretive analyses provide intuitive insights into how the proposed framework enhances inference quality, efficiency, and robustness in real-world scenarios.

Figure 4a,c,e,g highlight the key structural differences between the different model variants. In Figure 4a, the Q + S + R output preserves the cat’s outline more clearly than Q + S. In Figure 4c, the rooftop antenna in the cityscape remains sharp under Q + S + R, while the details in Q + S become blurred. In Figure 4e, the shadow transitions on the sofa cushions are more consistent and realistic under Q + S + R. In Figure 4g, the ground reflections and the subject’s lower-body details are significantly sharper. These examples confirm that the proposed Q + S + R framework maintains high structural fidelity in visually complex regions while significantly reducing inference cost and achieving a better quality–efficiency trade-off than previous baselines.

For text generation, we test the LlaMA-7B model on multi-hop reasoning prompts from the GSM8K dataset. As shown in Figure 5a–h, the token-level visualizations demonstrate that our inference framework adaptively allocates computational resources based on token complexity. Arithmetic operators and numerals consistently trigger more expert activations and lower attention sparsity, while semantically simpler tokens exhibit sparser attention and activate fewer experts. For example, across these subfigures, we observe a recurring trend: tokens related to computation or reasoning steps (e.g., “5”, “3”, “+”, “=“) typically activate 2–3 experts and maintain sparsity in the range of 0.6–0.75, whereas common or function words (e.g., “is”, “Q:”, “Ans”) activate only 1 expert with attention sparsity exceeding 0.85. This pattern is stable across various prompts, confirming that our method generalizes well and maintains efficiency–accuracy trade-offs under different reasoning conditions.

These visualizations and metrics highlight a key advantage of our framework: structured adaptivity. Rather than applying uniform compression or routing, our method adjusts computation dynamically at the token and region levels, ensuring that resources are allocated where they are most impactful. This results in inference that is not only faster and more memory-efficient but also better aligned with the structural demands of the data, a property not achieved by previous static or heuristic-based acceleration techniques.

4.7. Robustness, Adaptivity, and Stability

To further evaluate the real-world deployability of the proposed framework, we conduct a series of robustness and adaptivity experiments, focusing on dynamic runtime perturbations and input-level noise sensitivity. These tests aim to assess whether the system can maintain high performance when subjected to unstable execution environments or non-ideal input conditions—both common in practical deployment.

4.7.1. Adaptivity Under Resource Fluctuation

We simulate hardware-level interference by artificially introducing memory pressure and GPU resource contention during inference. This includes capping available memory, reducing GPU frequency, and injecting concurrent background workloads: (1) capping available GPU memory by pre-allocating 40–50% of total memory using background CUDA tensors, (2) reducing GPU core frequency to 60% of its default using nvidia-smi, and (3) injecting lightweight concurrent GPU workloads (matrix multiplication kernels) to simulate scheduling contention. In these settings, static optimization methods, such as GPTQ and FlashAttention2, exhibit significant performance degradation—latency variance exceeding 28% and throughput drops up to 40%. In contrast, our framework dynamically triggers fallback behaviors: it reduces expert activation width, increases sparsity levels, and switches to lower bit-width quantization (from 8-bit to 3.6-bit), depending on available resources. As shown in Table 6, under constrained GPU conditions, our method maintains 85.2% of baseline throughput and keeps latency variance within ±9.3%, showcasing superior runtime self-adjustment capabilities.

4.7.2. Robustness to Input Perturbation

To evaluate input-level robustness, we inject controlled noise into input prompts and visual conditions, including the following:

Text perturbation: introducing irrelevant clauses, numerical distractors, or reordering tokens in arithmetic questions.
Image prompt degradation: low-resolution blur, occlusion masks, and mismatched conditioning embeddings.

The results in Table 7 demonstrate that our proposed framework (Q + S + R) exhibits superior robustness under both textual and visual input perturbations. While static baselines like GPTQ and FlashAttention2 suffer from failure modes such as hallucinations, misalignments, or structural distortions, our method maintains high text accuracy (96.3%) and the lowest FID score (11.7) under compound noise. This is enabled by dynamic expert activation and selective attention expansion, which adaptively allocate computation to ambiguous or noisy regions. Overall, the framework consistently delivers accurate and stable outputs, confirming its resilience and practicality for real-world deployment scenarios.

4.7.3. Token-Level Stability Visualization

To further substantiate the robustness analysis in Section 4.7.2, we conduct fine-grained token-level stability experiments under noisy conditions. As shown in Figure 5, the left panel compares overall accuracy, visual fidelity, and expert activation across methods under input perturbations, confirming that our Q + S + R framework consistently outperforms the baselines. As shown in Figure 6, The right panel illustrates the variance in expert activation for each token when exposed to input noise. While baseline and Q + S exhibit high instability in ambiguous tokens (e.g., “Noise”, “Distractor”), our method maintains significantly lower variance (<0.13), especially on semantically critical tokens. These findings highlight that, beyond average accuracy, our model preserves stable, task-aligned expert pathways, a key factor in achieving reliable inference in dynamic real-world environments.

Figure 7 illustrates the token-level expert activation variance across a range of perturbed inputs. Across all 10 test cases, we observe a consistent trend: noisy or semantically ambiguous tokens (e.g., “Noise”, “Distractor”, “Mistake”) exhibit significantly higher variance under the baseline and Q + S configurations, often exceeding 0.25. In contrast, our Q + S + R method maintains low variance (<0.15) across these difficult tokens, while still preserving minimal variance on core reasoning tokens such as numbers and operators.

These visualizations collectively demonstrate that our method exhibits robust routing consistency under perturbation, selectively increasing attention capacity on ambiguous inputs while keeping activation stable elsewhere. The token-level evidence confirms the framework’s structural adaptivity and localized resilience, validating its suitability for real-world applications where input noise and ambiguity are prevalent.

5. Conclusions

Extensive experiments confirm that our unified inference framework—combining mixed-precision quantization, adaptive sparse attention, and dynamic expert routing—achieves significant and consistent improvements across efficiency, adaptability, and robustness.

Compared to the baseline, our full configuration (Q + S + R) reduces inference latency by 58.7%, memory usage by 57%, and maintains near-original accuracy (87.0%). It outperforms state-of-the-art methods like GPTQ, SparseGPT, and DiffMoE in throughput (+35–55%), FID (↓1.0–4.2), and runtime stability (latency variance ↓3×). On edge devices such as Jetson Orin, our method enables real-time inference with latency reduced to 865 ms/image. Visual and token-level analyses demonstrate the framework’s fine-grained control and interpretability, as well as its robustness under noisy inputs and dynamic system constraints—retaining 85.2% throughput and stable output quality.

Looking ahead, future research will focus on further enhancing expert routing granularity via reinforcement learning, integrating model compression-aware training for better co-optimization, and expanding the framework to support multimodal and instruction-following tasks under streaming or online deployment constraints. These directions aim to strengthen the framework’s ability to support increasingly diverse real-world applications with guaranteed efficiency and reliability.

Author Contributions

Conceptualization, Y.W.; data curation, Y.W.; investigation, Y.W. and J.Z.; methodology, Y.W. and J.Z.; project administration, Y.W.; software, Y.W.; writing—original draft, Y.W. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Jiangsu Province University Innovation Training Program (202410293087Z), Jiangsu Provincial Department of Education.

Data Availability Statement

The data presented in this study are derived from publicly available resources. We appreciate the contributions of these institutions and research teams in terms of the data. Specifically: 1. OpenBookQA: https://huggingface.co/datasets/allenai/openbookqa. 2. TriviaQA: https://nlp.cs.washington.edu/triviaqa/. 3. GSM8K: https://huggingface.co/datasets/openai/gsm8k (all accessed on 25 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FID	Fréchet Inception Distance
GPTQ	Post-Training Quantization for Generative Pre-trained Transformers
CLIP	Contrastive Language-Image Pre-Training
DPO	Direct Preference Optimization
KV cache	Key-Value cache
LLM	Large Language Model

Appendix A

Practical Deployment of Custom Sub-Byte Quantization and Modules

To practically deploy our framework’s advanced features, such as custom sub-byte quantization (e.g., 2.8-bit, 3.6-bit) and specialized modules like dynamic expert routing, a sequenced integration with ONNX and TensorRT is performed. First, sub-byte quantized model weights are packed into standard integer data types like INT8 for efficient memory handling. Second, the core logic for these sub-byte operations, including non-linear dequantization (e.g., logarithmic mapping, potentially using look-up tables) and the primary mathematical computations, is implemented within custom TensorRT plugins developed using CUDA; these plugins manage runtime data unpacking, conversion to proxy computational precision, execution of the operation, and any necessary requantization. Third, custom architectural components, such as the dynamic expert routing module, are exported from PyTorch to the ONNX graph as unique custom operator nodes. Fourth, for each distinct custom ONNX operator, a corresponding custom TensorRT plugin is developed to replicate its functionality on the GPU. Finally, these plugins are registered with TensorRT, allowing its ONNX parser to identify the custom nodes during model import and delegate their execution to the appropriate plugin, thereby enabling the construction of an end-to-end accelerated inference engine. Critical considerations throughout this process include meticulous bit-level manipulation for data packing/unpacking, ensuring the robustness and correctness of custom plugin implementations, and precisely matching custom operator names between the ONNX graph and the registered TensorRT plugins.

Algorithm A1. Conceptual N-bit Sub-byte Dequantization.

Input:
     packed_int8_data: Byte array containing packed N-bit integer values.
     scales: Array of floating-point scale factors (per-tensor or per-channel).
     zero_points: Array of integer zero-points (for asymmetric quantization, per-tensor or per-channel).
     num_total_elements: Integer, total number of N-bit elements to dequantize.
     N: Integer, the bit-width of the sub-byte elements (e.g., 3 for 2.8-bit, 4 for 3.6-bit).
Output:
dequantized_fp_data: Floating-point array to store dequantized elements.
Procedure Dequantize_SubByte_Elements(packed_int8_data, scales, zero_points,
dequantized_fp_data, num_total_elements, N):
     element_idx = 0
     BITS_PER_BYTE = 8
     for byte_pos from 0 to (length of packed_int8_data) − 1:
          current_byte = packed_int8_data[byte_pos]
          num_elements_in_this_byte = BITS_PER_BYTE / N // Assuming N divides 8 for simplicity
          for k from 0 to num_elements_in_this_byte − 1:
               if element_idx >= num_total_elements:
                    return // All elements processed
               //1. Extract the k-th N-bit integer value from current_byte.
               //This requires bitwise shift and mask operations.
               //Example: shift = N × k; mask = (1 << N) − 1;
               //(Actual extraction order depends on LSB/MSB packing convention)
               n_bit_integer = (current_byte >> (BITS_PER_BYTE-(k + 1) × N)) & ((1 << N) − 1)
               //2. Interpret the N-bit integer (e.g., map to signed range if necessary).
               //Example for signed: if n_bit_integer >= (1 << (N − 1)):
               //effective_int_value = n_bit_integer-(1 << N)
               //else: effective_int_value = n_bit_integer
               effective_int_value = interpret_N_bit_value(n_bit_integer, N) // Placeholder for interpretation logic
               //3. Apply dequantization formula (affine example).
               //For non-linear (e.g., log): effective_float_value = apply_inverse_log_map(effective_int_value)
               //current_scale and current_zp might be indexed if per-channel.
               current_scale = scales[get_scale_index_for_element(element_idx)]
               current_zp = zero_points[get_zp_index_for_element(element_idx)]
               dequantized_fp_data[element_idx] =
                    current_scale × (float(effective_int_value)-float(current_zp))
               element_idx = element_idx + 1

References

Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv 2022, arXiv:2211.01324. [Google Scholar]
Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; Teh, Y.W. Set transformer: A framework for attention-based permutationinvariant neural networks. in International conference on machine learning. In In Proceedings of the 36th International Conference on Machine Learning, PMLR 97, Long Beach, CA, USA, 9–15 June 2019; pp. 3744–3753. [Google Scholar]
Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M.-H.; Murphy, K.P.; Freeman, W.T.; Rubinstein, M.; et al. Muse: Textto-image generation via masked generative transformers. In Proceedings of the 40th International Conference on Machine Learning, ICML, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Esser, P.; Chiu, J.; Atighehchian, P.; Germanidis, A. Structure and content-guided video synthesis with diffusion models. In Proceedings of the 2023 CVPR, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the design space of diffusion-based generative models. In Proceedings of the 2022 NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Fang, G.; Ma, X.; Wang, X. Structural pruning for diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Li, S.; Ning, X.; Hong, K.; Liu, T.; Wang, L.; Li, X.; Zhong, K.; Dai, G.; Yang, H.; Wang, Y. Llm-mq: Mixed-Precision Quantization for Efficient llm Deployment. In Proceedings of the NeurIPS 2023 Efficient Natural Language and Speech Processing Workshop, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Jiang, H.; Wu, Q.; Lin, C.-Y.; Yang, Y.; Qiu, L. Llmlingua:Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
Griggs, T.; Liu, X.; Yu, J.; Kim, D.; Chiang, W.-L.; Cheung, A.; Stoica, I. Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv 2024, arXiv:2404.14527. [Google Scholar]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Re, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16.344–16.359. [Google Scholar]
He, Y.; Liu, L.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. Ptqd: Accurate post-training quantization for diffusion models. arXiv 2023, arXiv:2305.10657. [Google Scholar]
Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv 2021, arXiv:2104.08718. [Google Scholar]
Chevalier, A.; Wettig, A.; Ajith, A.; Chen, D. Adapting language models to compress contexts. arXiv 2023, arXiv:2305.14788. [Google Scholar]
Li, L.; Li, H.; Zheng, X.; Wu, J.; Xiao, X.; Zheng, M.; Pan, X.; Chao, F.; Ji, R. Autodiffusion: Training-free optimization of time steps and architectures for automated diffusion model acceleration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7105–7114. [Google Scholar]
Li, Y.; Wang, H.; Jin, Q.; Hu, J.; Chemerys, P.; Fu, Y.; Wang, Y.; Tulyakov, S.; Ren, J. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv 2023, arXiv:2306.00980. [Google Scholar]
Wingate, D.; Shoeybi, M.; Sorensen, T. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. arXiv 2022, arXiv:2210.03162. [Google Scholar]
Wang, X.; Zheng, Y.; Wan, Z.; Zhang, M. Svd-llm: Truncationaware singular value decomposition for large language model compression. arXiv 2024, arXiv:2403.07378. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Lyu, Z.; Xu, X.; Yang, C.; Lin, D.; Dai, B. Accelerating diffusion models via early stop of the diffusion process. arXiv 2022, arXiv:2205.12524. [Google Scholar]
Li, S.; Ning, X.; Wang, Y.; Lin, Z. LLM-MQ: Mixed-Precision Quantization for Efficient LLM Deployment. Available online: https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/5c805adc-b555-499f-9882-5ca35ce674b5.pdf (accessed on 25 May 2025).
Zhao, T.; Ning, X.; Fang, T.; Liu, E.; Huang, G.; Lin, Z.; Yan, S.; Dai, G.; Wang, Y. MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization. arXiv 2024, arXiv:2405.17873. [Google Scholar] [CrossRef]
Mu, J.; Li, X.L.; Goodman, N. Learning to compress prompts with gist tokens. arXiv 2023, arXiv:2304.08467. [Google Scholar]
Yuan, Z.; Lu, P.; Zhang, H.; Ning, X. DiTFastAttn: Attention Compression for Diffusion Transformer Models. arXiv 2024, arXiv:2406.08552. [Google Scholar] [CrossRef]
Fu, T.; Huang, H.; Ning, X.; Zhang, G.; Chen, B.; Wu, T.; Wang, H.; Huang, Z.; Li, S.; Yan, S.; et al. MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression. arXiv 2024, arXiv:2406.14909. [Google Scholar] [CrossRef]
Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Opensource Chatbot Impressing gpt-4 with 90%* Chatgpt Quality. 2023. Available online: https://vicuna.lmsys.org (accessed on 14 April 2023).
Lepikhin, D.; Lee, H.; Xu, Y.; Chen, Z.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar] [CrossRef]
Shang, Y.; Yuan, Z.; Xie, B.; Wu, B.; Yan, Y. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1972–1981. [Google Scholar]
Xia, H.; Yang, Z.; Dong, Q.; Wang, P.; Li, Y.; Ge, T.; Liu, T.; Li, W.; Sui, Z. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. arXiv 2024, arXiv:2401.07851. [Google Scholar] [CrossRef]
Liang, C.; Zuo, S.; Zhang, Q.; He, P.; Chen, W.; Zhao, T. Less is more: Task-aware layer-wise distillation for language model compression. in International Conference on Machine Learning. In Proceedings of the 40th International Conference on Machine Learning, PMLR, 2023, Honolulu, HI, USA, 23–29 July 2023; pp. 20.852–20.867. [Google Scholar]
Xu, J.; Tan, X.; Luo, R.; Song, K.; Li, J.; Qin, T.; Liu, T.-Y. Nasbert: Task-agnostic and adaptive-size bert compression with neural architecture search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 1933–1943. [Google Scholar]
Gao, Z.-F.; Liu, P.; Zhao, W.X.; Lu, Z.-Y.; Wen, J.-R. Parameterefficient mixture-of-experts architecture for pre-trained language models. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 3263–3273. [Google Scholar]
Zoph, B.; Bello, I.; Kumar, S.; Du, N.; Huang, Y.; Dean, J.; Shazeer, N.; Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv 2022, arXiv:2202.08906. [Google Scholar]
Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. Glam: Efficient scaling of language models with mixture-of-experts. in International Conference on Machine Learning. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 5547–5569. [Google Scholar]
Hwang, C.; Cui, W.; Xiong, Y.; Yang, Z.; Liu, Z.; Hu, H.; Wang, Z.; Salas, R.; Jose, J.; Ram, P.; et al. Tutel: Adaptive mixture-of-experts at scale. Proc. Mach. Learn. Syst. 2023, 5, 269–287. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Chaplot, D.S.; Casas, D.D.L.; Hanna, E.B.; Bressand, F.; Lengyel, G. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
Kong, J.; Wang, J.; Yu, L.-C.; Zhang, X. Accelerating inference for pretrained language models by unified multi-perspective early exiting. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4677–4686. [Google Scholar]
Yao, J.; Yang, B.; Wang, X. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. arXiv 2024, arXiv:2501.01423. [Google Scholar] [CrossRef]
Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient contentbased sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
Agrawal, A.; Kedia, N.; Panwar, A.; Mohan, J.; Kwatra, N.; Gulavani, B.; Tumanov, A.; Ramjee, R. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, USA, 10–12 July 2024; pp. 117–134. Available online: https://www.usenix.org/conference/osdi24/presentation/agrawal (accessed on 25 May 2025).
Frantar, E.; Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. arXiv 2023, arXiv:2301.00774. [Google Scholar]
Wei, X.; Zhang, Y.; Li, Y.; Zhang, X.; Gong, R.; Guo, J.; Liu, X. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv 2023, arXiv:2304.09145. [Google Scholar]
Kurtic, E.; Campos, D.; Nguyen, T.; Frantar, E.; Kurtz, M.; Fineran, B.; Goin, M.; Alistarh, D. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4163–4181. [Google Scholar]
Zhou, Z.; Ning, X.; Hong, K.; Lin, Z.; Wang, J.; Han, H.; Liu, J.; Yang, D. A Survey on Efficient Inference for Large Language Models. arXiv 2024, arXiv:2404.14294. [Google Scholar] [CrossRef]
Yin, T.; Gharbi, M.; Zhang, R.; Sinha, U.; Puri, R.; Anandkumar, A.; Vincent, P. One-step Diffusion with Distribution Matching Distillation. arXiv 2023, arXiv:2311.18828. [Google Scholar] [CrossRef]
Ma, X.; Fang, G.; Wang, X. DeepCache: Accelerating Diffusion Models for Free. arXiv 2023, arXiv:2312.00858. [Google Scholar] [CrossRef]
Yuan, X.; Qiao, Y. Diffusion-TS: Interpretable Diffusion for General Time Series Generation. arXiv 2024, arXiv:2403.01742. [Google Scholar] [CrossRef]
Shi, M.; Yuan, Z.; Yang, H.; Wang, X.; Zheng, M.; Tao, X.; Zhao, W.; Zheng, W.; Zhou, J.; Lu, J.; et al. DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers. arXiv 2025, arXiv:2503.14487. [Google Scholar] [CrossRef]
Zhang, J.; Huang, H.; Zhang, P.; Wei, J.; Zhu, J.; Chen, J. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization. arXiv 2024, arXiv:2411.10958. [Google Scholar] [CrossRef]
Prabhu, R.; Nayak, A.; Mohan, J.; Ramjee, R.; Panwar, A. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. In Proceedings of the 29th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘25), Rotterdam, The Netherlands, 30 March–3 April 2025. [Google Scholar] [CrossRef]
Liu, A.; Bai, H.; Lu, Z.; Sun, Y.; Kong, X.; Wang, S.; Shan, J.; Jose, A.M.; Liu, X.; Wen, L.; et al. TIS-DPO: Token-Level Importance Sampling for Direct Preference Optimization with Estimated Weights. arXiv 2024, arXiv:2410.04350. [Google Scholar] [CrossRef]
Sun, Z.; Zang, X.; Zheng, K.; Xu, J.; Zhang, X.; Yu, W.; Song, Y.; Li, H. ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability. arXiv 2024, arXiv:2410.11414. [Google Scholar] [CrossRef]
Zheng, K.; Ye, Q.; Chi, K.; Liu, X.; Saad, A.; Yu, K. Minimization of Task Completion Time in Wireless Powered Mobile Edge–Cloud Computing Networks. IEEE Internet Things J. 2024, 11, 38068–38085. [Google Scholar] [CrossRef]
Zheng, K.; Jiang, G.; Liu, X.; Chi, K.; Yao, X.; Liu, J. DRL-Based Offloading for Computation Delay Minimization in Wireless-Powered Multi-Access Edge Computing. IEEE Trans. Commun. 2023, 71, 1755–1770. [Google Scholar] [CrossRef]
Zheng, K.; Luo, R.; Liu, X.; Qiu, J.; Liu, J. Distributed DDPG-Based Resource Allocation for Age of Information Minimization in Mobile Wireless-Powered Internet of Things. IEEE Internet Things J. 2024, 11, 29102–29115. [Google Scholar] [CrossRef]
Mai do, H.; Tran, T.P.; Yoo, M. Quality of Experience Optimization for AR Service in an MEC Federation System. IEEE Access 2025, 13, 69821–69839. [Google Scholar] [CrossRef]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar]
Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv 2022, arXiv:2205.14135. [Google Scholar]

Figure 1. Unified inference optimization framework for adaptive inference acceleration on edge AI. (a) Adaptive Inference Optimization Framework; (b) Core Optimization Techniques Explained; (c) Token Complexity-based Dynamic Expert Routing.

Figure 2. Evaluation of optimization methods on the A100 GPU and Jetson Orin NX platform for Stable Diffusion v1.5: (a) Latency vs. CLIP score on A100: Measures generation quality versus inference delay. (b) Memory vs. FID on A100: Assesses memory efficiency and visual fidelity. (c) Throughput vs. Latency on A100: Compares generation speed and responsiveness. (d) Latency vs. CLIP score on Jetson Orin NX: Performance-quality trade-off under edge constraints. (e) Memory vs. FID on Jetson Orin NX: Visual quality and memory footprint on resource-limited hardware. (f) Throughput vs. latency on Jetson Orin NX: Efficiency evaluation of image generation on edge devices.

Figure 3. Scaling analysis of Stable Diffusion v1.5 across different batch sizes and hardware platforms: (a) Latency vs. Batch Size: Evaluates inference time growth with increasing batch size on A100 GPU and Jetson Orin NX. (b) Throughput vs. Batch Size: Measures the scalability of image generation throughput as batch size increases. (c) Memory Usage vs. Batch Size: Analyzes memory footprint variation to assess platform-specific memory scalability.

Figure 4. Structural quality and latency. (a) Optimization Stage Visual Comparison: Cat; (b) Performance Evaluation (Cat): Structure Quality & Latency; (c) Optimization Stage Visual Comparison: Cyberpunk Streetscape; (d) Performance Evaluation (Cyberpunk Streetscape): Structure Quality & Latency; (e) Optimization Stage Visual Comparison: Cat and Dog; (f) Performance Evaluation (Cat and Dog): Structure Quality & Latency; (g) Optimization Stage Visual Comparison: Man in Suit on a Rainy Night; (h) Performance Evaluation (Man in Suit on a Rainy Night): Structure Quality & Latency.

Figure 5. Token-level visualizations of expert activation and attention sparsity. (a) Addition Problem (“...5 + 3?”): Attention Sparsity & Active Experts; (b) Subtraction Application (“Tom has 7-4 = ?”): Attention Sparsity & Active Experts; (c) Mixed Operation (“3 × 2 + 1 = ?”): Attention Sparsity & Active Experts; (d) Division Problem (“...6/2 = ?”): Attention Sparsity & Active Experts; (e) Remaining Quantity Follow-up (“...remain?”): Attention Sparsity & Active Experts; (f) Subtraction Question (“...9-6?”): Attention Sparsity & Active Experts; (g) Multiplication Operation (“2*2 = Ans”): Attention Sparsity & Active Experts; (h) Conditional Sentence Start (“If Tom has...”): Attention Sparsity & Active Experts.

Figure 6. Evaluation of robustness (a) and stability (b) under input perturbations.

Figure 7. Comparison of Attention Distribution across Optimization Stages. (a) Arithmetic Question with trailing Noise & Distractor words; (b) Arithmetic Command with trailing Extra Text; (c) Arithmetic Question with trailing Distractor word; (d) Conditional Statement with trailing Noise; (e) Arithmetic Command with trailing Irrelevant word; (f) Arithmetic Problem with trailing Random word; (g) Arithmetic Command with trailing Distractor word; (h) Arithmetic Command with trailing Note word; (i) Arithmetic Command with embedded Noise word; (j) Arithmetic Question with trailing Mistake word.

Table 1. Cross-platform ablation study of Q/S/R modules on A100 and Jetson Orin NX.

Configuration (LLaMA-7B)	Quantization (Q)	Sparse Attention (S)	Routing (R)	Latency (ms) A100	Memory (MB) A100	Accuracy (%) A100	Latency (ms) Jetson Orin NX	Memory (MB) Jetson Orin NX	Accuracy (%) Jetson Orin NX
Baseline	×	×	×	100.0	9119	87.2	85.0	7902	87.0
Q	√	×	×	72.0 (↓28.0%)	6218 (↓31.8%)	86.8 (↓0.5%)	61.5 (↓27.6%)	5314 (↓32.8%)	86.7 (↓0.3%)
S	×	√	×	78.5 (↓21.5%)	7125 (↓21.9%)	86.9 (↓0.3%)	66.2 (↓22.1%)	6116 (↓22.6%)	86.8 (↓0.2%)
R	×	×	√	80.0 (↓20.0%)	7073 (↓22.4%)	87.1 (↓0.1%)	67.5 (↓20.6%)	6153 (↓22.1%)	86.9 (↓0.1%)
Q + S	√	√	×	56.2 (↓43.8%)	5210 (↓42.9%)	86.5 (↓0.8%)	49.0 (↓42.4%)	4147 (↓47.5%)	86.3 (↓0.8%)
Q + R	√	×	√	54.1 (↓45.9%)	5012 (↓45.0%)	86.7 (↓0.6%)	47.8 (↓43.8%)	4057 (↓48.7%)	86.6 (↓0.5%)
S + R	×	√	√	53.8 (↓46.2%)	4931 (↓45.9%)	86.6 (↓0.7%)	46.0 (↓45.9%)	4080 (↓48.4%)	86.5 (↓0.6%)
Q + S + R	√	√	√	41.3 (↓58.7%)	3900 (↓57.2%)	87.0 (↓0.2%)	37.5 (↓55.9%)	3050 (↓61.4%)	86.8 (↓0.2%)

Table 2. Ablation results of optimization modules on LLaMA-7B and GPT-NeoX-20B (A100 platform).

Configuration	Quantization (Q)	Sparse Attention (S)	Routing (R)	LLaMA-7B			GPT-NeoX-20B
Configuration	Quantization (Q)	Sparse Attention (S)	Routing (R)	Latency (ms)	Memory (MB)	Accuracy (%)	Latency (ms)	Memory (MB)	Accuracy (%)
Baseline	×	×	×	100.0	9119	87.2	280.0	19130	86.8
Q	√	×	×	72.0 (↓28.0%)	6218 (↓31.8%)	86.8 (↓0.5%)	202.1 (↓27.8%)	13293 (↓30.5%)	85.3 (↓1.7%)
S	×	√	×	78.5 (↓21.5%)	7125 (↓21.9%)	86.9 (↓0.3%)	225.5 (↓19.5%)	14583 (↓23.8%)	85.5 (↓1.5%)
R	×	×	√	80.0 (↓20.0%)	7073 (↓22.4%)	87.1 (↓0.1%)	232.1 (↓17.1%)	14381 (↓24.8%)	85.6 (↓1.4%)
Q + S	√	√	×	56.2 (↓43.8%)	5210 (↓42.9%)	86.5 (↓0.8%)	172.4 (↓38.4%)	10943 (↓42.8%)	85.1 (↓2.0%)
Q + R	√	×	√	54.1 (↓45.9%)	5012 (↓45.0%)	86.7 (↓0.6%)	169.3 (↓39.5%)	10753 (↓43.8%)	85.3 (↓1.7%)
S + R	×	√	√	53.8 (↓46.2%)	4931 (↓45.9%)	86.6 (↓0.7%)	175.3 (↓37.4%)	10890 (↓43.1%)	85.2 (↓1.8%)
Q + S + R	√	√	√	41.3 (↓58.7%)	3900 (↓57.2%)	87.0 (↓0.2%)	96.3 (↓65.6%)	7489 (↓60.9%)	86.2 (↓0.7%)

Table 3. Evaluation of stable diffusion v1.5 across optimization methods and hardware platforms.

	Method	A100
	Method	Latency (ms)	Memory (MB)	CLIP Score	FID	Throughput (Images/s)
1	Baseline [21]	114.0	5402	0.819	12.0	8.8
2	GPTQ [59]	82.5	4120	0.801	13.5	12.1
3	SparseGPT [43]	71.1	3951	0.794	16.2	13.6
4	FlashAttention2 [60]	85.3	4213	0.808	13.1	11.3
5	DiffMoE [50]	66.4	3871	0.813	12.5	14.2
6	Qurs (Q + S + R)	52.8	3180	0.815	11.7	18.9
		NVIDIA Jetson Orin NX
		Latency (ms)	Memory (MB)	CLIP Score	FID	Throughput (images/s)
7	Baseline	2200.1	7901	0.812	12.4	0.47
8	GPTQ	1450.2	5311	0.796	13.9	0.76
9	SparseGPT	1250.1	5031	0.761	16.8	0.82
10	FlashAttention2	1500.2	5470	0.772	13.8	0.71
11	DiffMoE	1150.1	4721	0.798	12.9	0.93
12	Qurs (Q + S + R)	865.3	3182	0.811	12.1	1.16

Table 4. Inference result for stable diffusion v1.5 on A100 and Orin NX.

	Platform	Model	Batch Size	Latency (ms)	Throughput (Images/s or Tokens/s)	Memory (MB)
1	Jetson Orin NX	Stable Diffusion v1.5	1	865	1.16	3180
2	Jetson Orin NX	Stable Diffusion v1.5	2	1530	1.31	3504
3	Jetson Orin NX	Stable Diffusion v1.5	4	3200	1.64	4300
4	Jetson Orin NX	Stable Diffusion v1.5	8	4571	1.92	5721
5	A100 GPU	Stable Diffusion v1.5	1	114	8.8	5421
6	A100 GPU	Stable Diffusion v1.5	2	132	24.8	6728
7	A100 GPU	Stable Diffusion v1.5	4	147	37.3	7631
8	A100 GPU	Stable Diffusion v1.5	8	152	44.2	8134

Table 5. Expert activation behavior under varying input complexity on A100 and Jetson Orin NX.

	Input Complexity	Model	Platform	Avg Active Experts	Avg Latency (ms)	FID/Acc
1	Text Length 64	LLAMA-7B	A100	1.4	4.2	87.3%
2	Text Length 256		A100	2.3	8.5	86.4%
3	Text Length 1024		A100	3.4	14.3	85.8%
4	Text Length 64		Jetson Orin NX	1.2	11.5	86.1%
5	Text Length 256		Jetson Orin NX	2.1	25.7	85.3%
6	Text Length 1024		Jetson Orin NX	3.1	41.2	86.2%
7	Resolution 224 × 224	Stable Diffusion v1.5	A100	1.1	96.5	11.9
8	Resolution 256 × 256		A100	1.7	112.4	11.6
9	Resolution 521 × 512		A100	2.6	129.4	11.3
10	Resolution 224 × 224		Jetson Orin NX	1.2	865.7	12.5
11	Resolution 256 × 256		Jetson Orin NX	2.2	1231.4	12.1
12	Resolution 521 × 512		Jetson Orin NX	3.1	1610.3	11.8

Table 6. Inference robustness comparison under simulated resource constraints.

	Method	Baseline Throughput (Tokens/s)	Constrained Throughput (Tokens/s)	Throughput Retention	Latency Variance (%)
1	GPTQ	34.3	20.5	59.8	28.4
2	FlashAttention2	33.5	19.7	58.8	32.1
3	DiffMoE	39.2	29.8	76.0	18.7
4	Ours (Q + S + R)	52.8	45.0	85.2	9.3

Table 7. Detailed robustness evaluation under text and image input perturbations.

	Method	Perturbation Type	Task Type	Accuracy (%)	FID (↓)	Avg. Expert Activation	Recovery Strategy	Observed Failure Mode
1	GPTQ	Irrelevant Clause	Arithmetic	80.5	N/A	1.3	None	Wrong numeric reference
2	GPTQ	Token Reorder	Arithmetic	83.6	N/A	1.5	N/A	Early stop
3	GPTQ	Low-res blur	Image	N/A	16.1	1.4	N/A	Blurred edges, broken shapes
4	FlashAttention2	Numeric distractor insertion	Arithmetic	84.0	N/A	1.2	N/A	Random hallucination
5	FlashAttention2	Occlusion mask	Image	N/A	17.0	1.3	N/A	Misses occluded object parts
6	DiffMoE	Token reorder	Arithmetic	87.1	N/A	2.4	Re-balances MoE routing	Minor errors
7	DiffMoE	Blur + occlusion	Image	N/A	14.3	2.5	Partial recovery via fallback experts	Shape distortion
8	Ours (Q + S + R)	Distractor clause + reorder	Arithmetic	96.3	N/A	2.9	Selective routing + sparse expansion	Rare misalignment on punctuation
9	Ours (Q + S + R)	Blur + occlusion + mismatch	Image	N/A	11.7	3.1	Enhanced expert activation on details	Structure preserved

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhao, J. A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms. Electronics 2025, 14, 2188. https://doi.org/10.3390/electronics14112188

AMA Style

Wang Y, Zhao J. A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms. Electronics. 2025; 14(11):2188. https://doi.org/10.3390/electronics14112188

Chicago/Turabian Style

Wang, Yiyang, and Jing Zhao. 2025. "A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms" Electronics 14, no. 11: 2188. https://doi.org/10.3390/electronics14112188

APA Style

Wang, Y., & Zhao, J. (2025). A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms. Electronics, 14(11), 2188. https://doi.org/10.3390/electronics14112188

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms

Abstract

1. Introduction

2. Related Work

2.1. Model Compression and Static Optimizations

2.2. Efficient Architectures and Dynamic Execution Approaches

2.3. System-Level Optimizations for Edge AI

3. Materials and Methods

3.1. System Architecture

3.2. Mixed-Precision Quantization with Layer-Wise Sensitivity

3.2.1. Layer-Wise Sensitivity Profiling

3.2.2. Weight and Activation Quantization

3.2.3. Deployment and Compatibility

3.3. Sparse Attention with Adaptive Heterogeneous Masking

3.3.1. Adaptive Sparse Mask Generation

3.3.2. Masking Strategy and Learning Mechanism

3.3.3. Implementation and Compatibility

3.4. Dynamic Expert Routing with Capacity-Aware Scheduling

3.4.1. Design Principles and Motivation

3.4.2. Routing Module and Capacity Predictor

3.4.3. Implementation and Technical Novelty

3.5. Integrated Framework and Execution Strategy

3.5.1. Coordinated Three-Stage Architecture

3.5.2. Adaptive Inference Flow

3.5.3. Innovation and Impact

3.6. Complexity Analysis and Theoretical Efficiency Gains

3.6.1. Complexity Analysis for Mixed-Precision Quantization

3.6.2. Complexity Reduction from Adaptive Sparse Attention

3.6.3. Complexity Benefits of Dynamic Expert Routing

3.6.4. Integrated Complexity and Efficiency Gains

4. Experimental Results and Discussion

4.1. Experimental Setting

4.2. Datasets and Models

4.3. Ablation Study

4.4. Comparison with Existing Methods

4.5. Generalization Across Hardware Platforms

4.6. Qualitative Case Studies

4.7. Robustness, Adaptivity, and Stability

4.7.1. Adaptivity Under Resource Fluctuation

4.7.2. Robustness to Input Perturbation

4.7.3. Token-Level Stability Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Practical Deployment of Custom Sub-Byte Quantization and Modules

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI