Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms

Yang, Tianchou; Guo, Hongjie; Zhao, Zhengyu; Zhu, Donglin

doi:10.3390/drones10060410

Open AccessArticle

Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms

School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 410; https://doi.org/10.3390/drones10060410

Submission received: 17 April 2026 / Revised: 15 May 2026 / Accepted: 21 May 2026 / Published: 26 May 2026

(This article belongs to the Special Issue Distributed Control, Optimization, and Game of UAV Swarm Systems (2nd Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Orion reduces LLM prefill latency by 78-81% on heterogeneous UAV swarm nodes compared to cloud-UAV baselines, and is the only framework that successfully runs a 70B-parameter LLM entirely on memory-constrained UAV onboard computers.
Adaptive sequence partitioning and predictive decoding eliminate pipeline bubbles and load imbalance, enabling near-linear scaling of inference latency with sensor sequence length.

What are the implications of the main findings?

Real-time, privacy-preserving LLM inference becomes feasible for autonomous UAV swarms in bandwidth-limited or disconnected environments (e.g., disaster response, surveillance).
The proposed collaborative edge framework provides a practical pathway to deploy large models on heterogeneous UAV fleets without cloud dependency, enhancing mission robustness and responsiveness.

Abstract

Unmanned aerial vehicle (UAV) swarms generate massive multi-modal sensor data streams from onboard payloads such as RGB cameras, LiDAR, and thermal sensors. Large language models (LLMs) can interpret these data for natural language-based swarm coordination. However, deploying LLMs directly on resource-constrained UAV nodes faces a critical bottleneck. Long-context textual sensor logs (e.g., continuous status reports with GPS, altitude, and detection events) lead to high prefill latency. Existing distributed inference frameworks suffer from load imbalance and pipeline bubbles, violating real-time mission requirements. To address these issues, we propose Orion, an edge-only collaborative inference framework for LLM-based sensor data processing in heterogeneous UAV swarms. Orion incorporates three innovations: (1) optimal model partitioning via dynamic programming, (2) adaptive sequence partitioning that balances causal attention load across pipeline stages, and (3) a predictive decoding mechanism that speculatively generates the first token during idle intervals. Experiments on a comprehensive simulation framework ((using Meta’s Llama-2 (Large Language Model Meta AI)) 7B/13B/70B and simulated UAV swarm sensor traces) show that Orion reduces end-to-end latency by 81% (7B) and 78% (13B) compared to the best cloud–UAV baseline. Orion is the only framework capable of running the full 70B model on memory-constrained UAV nodes, enabling real-time sensor-aware LLM inference.

Keywords:

collaborative inference; edge computing; large language models; unmanned aerial vehicle

1. Introduction

Unmanned aerial vehicle (UAV) swarms are increasingly deployed for real-time surveillance, disaster response, and precision agriculture, generating massive multi-modal sensor data streams from onboard payloads such as RGB cameras, LiDAR, infrared/thermal sensors [1], inertial measurement units (IMUs), and acoustic arrays [2]. Large language models (LLMs) are renowned for their capabilities in natural language understanding and generation. They offer novel opportunities for interpreting and reasoning over heterogeneous sensor data. This enables natural language-based swarm coordination, autonomous decision-making [3], and human–swarm interaction [4]. For example, an LLM-powered command interpreter can process a high-level operator instruction such as “inspect the northern perimeter and report any anomalies” and translate it into a sequence of low-level control actions (waypoint navigation, camera zoom, thermal anomaly detection) across multiple UAVs [5,6].

Conventional LLM deployments rely heavily on cloud computing. This approach introduces significant challenges, including sensor-to-cloud latency and bandwidth saturation from continuous data uploads. It also raises privacy risks concerning raw, mission-sensitive measurements [7,8,9,10]. Edge computing has emerged as a promising alternative by processing sensor data streams closer to their source, thereby reducing response time and preserving the confidentiality of in situ sensing information [11]. Furthermore, recent surveys indicate that over 80% of industry experts believe personal or mission-critical LLMs should be fully or primarily hosted at the edge to ensure privacy-preserving inference on sensitive sensor data [12,13,14]. UAV swarms represent a typical edge environment, comprising various underutilized onboard computing nodes (e.g., NVIDIA Jetson modules, Raspberry Pi-class devices, and custom FPGA accelerators) interconnected via ad hoc wireless networks [15]. These devices can be aggregated into a collaborative edge resource pool to support efficient, in situ LLM inference on streaming sensor data. As illustrated in Figure 1, such a system enables seamless task coordination and execution: when a network of visual and thermal sensors detects a potential threat, an LLM processes the sensor event sequence and issues coordinated responses across the swarm.

However, deploying computationally intensive LLMs on resource-constrained UAV nodes remains challenging. These nodes suffer from limited computational power and memory capacity. Furthermore, real-world sensor data streams are inherently bursty and variable in length. Unlike cloud-based text inputs, sensor sequences in UAV missions can accumulate long contexts over time (e.g., continuous status reports with GPS, altitude, and detection events), requiring the LLM to process lengthy sensor traces. Existing solutions typically follow two paradigms: single-device deployment, which is constrained by onboard capabilities, and distributed frameworks such as edge–cloud collaboration [16,17]. Yet, these methods often suffer from high communication overhead and inefficient resource utilization, especially when processing long sensor input sequences, which significantly degrades user experience in real-time sensing applications.

To overcome these limitations, researchers have proposed various optimization techniques. Early efforts focused primarily on model compression and quantization, but these often sacrifice model accuracy on complex sensor reasoning tasks [18,19]. Subsequent distributed inference frameworks (e.g., EdgeShard [20] and QoS-aware routing approaches [21]) partially alleviated computational burdens but showed limited efficiency for single-sequence sensor data requests, which are a dominant pattern in UAV swarm interactions (e.g., processing a continuous sensor trace as one input sequence). More recently, systems like Jupiter [22] introduced intra-sequence pipeline parallelism but suffer from “prefill-to-decode transition bubbles” due to rigid phase segregation.

Motivated by these gaps, we propose Orion, a novel collaborative inference framework for sensor-aware LLM deployment at the edge. Named after the Orion constellation for its ability to harmonize scattered stars, our system integrates heterogeneous edge devices, including UAV onboard computers, ground control stations, and smart gateways, into a coherent and efficient inference network specifically optimized for real-time processing of multi-sensor data. Orion is designed to optimize the prefill stage of LLM inference, which is computationally demanding and critical for overall performance. By incorporating an optimal model partitioning strategy, an adaptive sequence splitting technique, and a predictive decoding mechanism, Orion significantly accelerates inference while maintaining model performance. Experiments on a comprehensive simulation framework (using Llama-2 7B/13B/70B and simulated UAV swarm sensor traces) demonstrate that Orion significantly outperforms existing cloud–edge and collaborative baselines, achieving major latency reductions while being the only framework capable of running the full 70B model on memory-constrained UAV nodes.

The main contributions of this paper are summarized as follows:

We propose Orion, an end-to-end collaborative inference framework for heterogeneous edge environments. It is the first to effectively resolve the computational idling and load imbalance issues during the LLM prefill phase caused by conventional pipeline architectures, specifically tailored to the characteristics of single-sequence sensor requests in UAV swarms.
We design a theoretically sound adaptive sequence partitioning algorithm and a predictive decoding mechanism, completely eliminating the inherent prefill pipeline bubbles found in state-of-the-art systems, thereby achieving a high degree of overlap between the computational resources of the prefill and decoding phases.
Extensive experiments on a comprehensive simulation framework (using Llama-2 7B/13B/70B and simulated UAV sensor traces) demonstrate Orion’s superior efficiency and scalability. Orion achieves end-to-end latency reductions of 81% (7B) and 78% (13B) over the best baseline, and uniquely supports the 70B model on resource-constrained UAV nodes.

2. Background and Motivation

In this section, we provide the necessary technical background to understand the challenges of deploying LLMs at the edge. We first deconstruct the core architecture of generative LLMs and their two-stage inference process in Section 2.1 and Section 2.2, respectively. Based on this foundation, in Section 2.3, we provide a detailed summary and analysis of the specific optimization challenges and motivations that arise in resource-constrained edge environments, thereby clearly justifying the need for our proposed Orion framework.

2.1. Core Architecture of Generative LLMs

The remarkable capabilities of modern generative LLMs are underpinned by the Transformer decoder architecture [23,24], which typically consists of tens or even hundreds of identical layers stacked upon one another. As shown in Figure 2, each decoder layer is a complex module designed to model long-range dependencies within sequential data and integrates several fundamental components:

QKV Projection: The input tokens are first projected into three distinct vector representations: Query (Q), Key (K), and Value (V). These vectors form the foundation of the self-attention mechanism. They allow the model to compute contextual relationships and relevance scores among all tokens in the sequence.
Masked Multi-Head Self-Attention: The attention scores are calculated between each token and all preceding tokens. A crucial masking mechanism is applied to prevent the model from attending to future tokens, preserving the autoregressive property essential for text generation. The “multi-head” aspect enables the model to simultaneously attend to information from different representation subspaces.
Feed-Forward Network (FFN): A position-wise feed-forward network applies a non-linear transformation (e.g., SwiGLU) to each token independently. This network often constitutes a large portion of the model’s parameters and is responsible for refining attended representations.
Residual Connections and Layer Normalization: Each sub-layer is wrapped with a residual connection and followed by layer normalization. This stabilizes training and enables the construction of very deep networks.
KV cache: A critical performance optimization during inference. The Key and Value states generated for all previous tokens are stored in a dynamic cache, avoiding extremely costly recomputation for every new generated token.

Collectively, these components impose immense computational (FLOPs) and memory (bandwidth and capacity) demands, which create a significant tension when deploying these models on resource-constrained edge devices.

2.2. Generative Inference Process

The inference process of autoregressive LLMs can be distinctly divided into two stages, as visualized in Figure 2:

Prefill Stage: This stage begins once a user’s input prompt (represented as initial inputs X) is provided. The model performs a forward pass through all L layers for every token in the prompt to produce the first output token and populate the KV cache. The prefill stage is compute-bound. The prefill stage is compute-bound. Due to the causal self-attention mechanism, its computational cost scales quadratically with the prompt length, making it exceptionally demanding for processing long-context sensor traces.
Autoregressive Decoding Stage: After prefill, the model enters a generation loop, producing tokens one by one. In each iteration, the latest generated token $y_{t}$ is fed back as input for the next step, illustrated by the autoregressive feedback loop $y_{t} \to y_{t + 1}$ . During this forward pass, the current hidden state $h_{i n}^{(l - 1)}$ is projected to calculate the new query $Q_{t}^{(l)}$ , while the newly generated key $K_{t}^{(l)}$ and value $V_{t}^{(l)}$ are appended to the KV cache at position t. Utilizing this dynamic KV cache avoids costly recomputation for historical tokens. This stage is primarily memory-bound. Its latency is dominated by the time required to read model parameters and the continuously growing KV cache from memory, rather than by raw computation.

While KV caching is indispensable, two bottlenecks persist: high prefill latency for long prompts and the inherently sequential nature of decoding. Both are severely magnified in edge environments with scarce resources and heterogeneous devices.

2.3. Summary and Optimization Motivation

The analysis reveals that the prefill stage is the dominant contributor to latency for long-context sensor traces, contradicting the requirement for real-time mission responsiveness in autonomous UAV swarms. Furthermore, the heterogeneous nature of edge device pools exacerbates these inefficiencies. Traditional strategies either fail to utilize aggregate capacity or lead to severe load imbalance and crippling communication overhead. The limitations of existing solutions are clear:

Model Compression: Sacrifices generation quality, which is unacceptable for core user interactions [18,19].
Cloud Offloading: Introduces significant latency and raises severe privacy concerns [10].
Distributed Frameworks: Existing solutions like EdgeShard [20] are designed for batch processing and are ill-suited for single-sequence, heterogeneous edge environments.

2.4. Relation to Alternative Inference Paradigms in Distributed Sensing

While our work focuses on LLM-based inference for UAV swarm coordination, we acknowledge that the broader literature on distributed sensing systems offers alternative paradigms that do not rely on language models. For instance, in visual sensor networks, probabilistic methods such as Gaussian mixture models combined with deep embedded features have been successfully applied to stimulation model identification [25]. These approaches excel at handling low-dimensional, structured sensor data (e.g., pixel intensities or feature vectors) and provide well-calibrated uncertainty estimates. Similarly, in networked multi-agent systems, opinion dynamics models have been used to infer latent states from biased or indirect observations [26], offering insights into how decentralized agents reach consensus under social pressure or noisy communication.

However, these alternative methods share a common limitation: they assume that sensor observations can be represented as fixed-dimensional vectors or predefined features, and that the inference task can be cast as a statistical estimation or classification problem. In contrast, our UAV swarm scenario requires processing heterogeneous, multi-modal sensor logs (e.g., free-text descriptions of thermal anomalies, GPS tracks, and camera detections) and generating natural language commands for human–swarm interaction. LLMs are uniquely suited to this task because they (1) accept variable-length, unstructured text inputs that can serialize arbitrary sensor modalities, (2) perform complex reasoning across long contexts, and (3) produce human-readable instructions. Therefore, rather than competing with probabilistic or opinion-dynamics methods, our LLM-based framework addresses a complementary class of problems where natural language understanding is essential. Hybrid approaches that combine LLMs with uncertainty-aware probabilistic inference remain an exciting direction for future work.

3. Proposed Solution: The Orion Framework

To address the aforementioned challenges, we propose Orion, an edge-only collaborative inference framework specifically optimized for UAV swarms. The overall architecture and workflow of Orion are illustrated in Figure 3. The framework consists of three synergistic components:

Optimal LLM Partitioning (Section 3.1), which determines the layer-to-device mapping to minimize stage bottlenecks;
Adaptive Sequence Partitioning (Section 3.2), which flattens the latency curve of causal attention through dynamic programming;
Predictive Decoding (Section 3.3), which hides prefill-decode bubbles via speculative execution.

Together, these modules enable real-time, sensor-aware LLM inference by maximizing the utilization of heterogeneous edge resources.

3.1. Optimal LLM Partitioning Strategy

Deploying LLMs in heterogeneous edge environments faces severe memory walls and computational bottlenecks. To achieve efficient inference while preserving user privacy, this paper proposes a dynamic programming algorithm that jointly optimizes device selection and model layer partitioning. This strategy aims to divide the LLM layers into contiguous blocks and map them to an optimal combination of edge devices, thereby minimizing the pipeline bottleneck latency while strictly satisfying memory constraints.

3.1.1. System Model and Problem Formulation

(a) Edge-Only Device Filtering: Mission-critical UAV and edge scenarios have strict data security requirements. Therefore, model inference must be executed entirely locally. Data offloading to the cloud is strictly prohibited. Assuming an initial mixed device set

D_{a l l}

within the network, we first apply a rigorous filtering strategy to exclude all high-performance cloud nodes, thereby constructing a pure-edge candidate device set

D = {d_{1}, d_{2}, \dots, d_{K}}

. Each device

d_{j} \in D

possesses a limited available memory capacity

C_{j}

and a heterogeneous computational performance profile

p_{j}

.

(b) Inference Cost Model: Suppose the given LLM consists of N layers, denoted as

L = {l_{0}, l_{1}, \dots, l_{N - 1}}

. For any contiguous layer interval

[u, v]

(where

0 \leq u \leq v < N

), the total required memory

M_{r e q} (u, v)

and the computation latency

T_{c o m p} (u, v, d_{j})

when deployed on device

d_{j}

can be formalized as:

M_{r e q} (u, v) = \sum_{l = u}^{v} m_{l},

(1)

T_{c o m p} (u, v, d_{j}) = \sum_{l = u}^{v} τ (l, p_{j}),

(2)

where

m_{l}

represents the memory requirement of the l-th layer, and

τ (l, p_{j})

denotes the single-layer computation time on device

d_{j}

. To accelerate the algorithm, we leverage the prefix sum technique to compute these costs in

O (1)

time.

Furthermore, when the pipeline is partitioned between devices at the i-th layer, let

a_{i}

(measured in megabytes) denote the size of the output activation tensor produced by layer i. This tensor must be transmitted to the next device. If the layer interval

[u, i]

is assigned to device

d_{k}

and

[i + 1, v]

is assigned to

d_{j}

, the network communication latency between them is expressed as:

T_{c o m m} (d_{k}, d_{j}, a_{i}) = \frac{a_{i}}{B (d_{k}, d_{j})},

(3)

where

B (d_{k}, d_{j})

denotes the effective network bandwidth between the two specific devices.

(c) Optimization Objective: Based on the principle of Pipeline Parallelism (PP), the overall system throughput is dictated by the slowest stage (i.e., the bottleneck stage) [27]. Therefore, our objective is to determine a device allocation and layer partitioning scheme

P

that minimizes the maximum latency (computation plus communication) across all participating stages, subject to the memory capacity constraints of each individual device. Let K denote the total number of pipeline stages, where the s-th stage executes layers

[u_{s}, v_{s}]

on device

d_{s}

. The optimization problem can be formally defined as:

\begin{matrix} min_{P} max_{s \in {1, \dots, K}} (T_{c o m p} (u_{s}, v_{s}, d_{s}) + T_{c o m m} (d_{s - 1}, d_{s}, a_{u_{s} - 1})), \end{matrix}

(4)

\begin{matrix} s . t . M_{r e q} (u_{s}, v_{s}) \leq C_{s}, \forall s \in {1, \dots, K}, \end{matrix}

(5)

where

T_{c o m p} (u_{s}, v_{s}, d_{s})

is the computation latency of the layer interval on device

d_{s}

,

T_{c o m m} (d_{s - 1}, d_{s}, a_{u_{s} - 1})

is the transmission latency of the activation tensor from the preceding stage (which is 0 for

s = 1

),

M_{r e q} (u_{s}, v_{s})

represents the memory requirement of the assigned layers, and

C_{s}

is the memory capacity constraint of the assigned device.

3.1.2. Joint Optimization via Dynamic Programming

Given the heterogeneity of the devices and the prerequisite that a single device should not be recurrently assigned within the pipeline (to avoid complex micro-batch scheduling and potential deadlocks), this problem exhibits an optimal substructure. It can thus be solved utilizing dynamic programming with state compression (State-Compressed DP).

(a) State Representation: We define

O (i, S, d_{k})

as the minimum achievable pipeline bottleneck latency when the prefix layers of the LLM, from

l_{0}

to

l_{i}

, are partitioned and assigned to a device subset

S \subseteq D

, with the final contiguous block of layers allocated to device

d_{k} \in S

.

(b) Base Case: When only a single device

d_{j}

is utilized to process the layers from

l_{0}

to

l_{i}

, the latency is purely the computation time if the memory is sufficient; otherwise, it is set to infinity ∞ (indicating an infeasible state):

O (i, {d_{j}}, d_{j}) = \{\begin{matrix} T_{c o m p} (0, i, d_{j}), & if M_{r e q} (0, i) \leq C_{j}, \\ \infty, & otherwise . \end{matrix}

(6)

Note: in implementation, ∞ is replaced by a large sentinel value, e.g.,

10^{9}

ms.

(c) State Transition: Assume we have computed the optimal substructure for processing the first i layers. We now consider allocating the subsequent layer interval

[i + 1, m]

to a novel device

d_{j}

(where

d_{j} \in D ∖ S

). Let

T_{s t a g e}

denote the local latency of this newly formed stage, which consists of the computation time on

d_{j}

and the communication time required to receive the activation tensor from the predecessor

d_{k}

:

T_{s t a g e} = T_{c o m p} (i + 1, m, d_{j}) + T_{c o m m} (d_{k}, d_{j}, a_{i}) .

(7)

The state transition equation is formulated as a “Min-Max” problem:

O (m, S \cup {d_{j}}, d_{j}) = min_{\begin{matrix} i < m \\ d_{k} \in S \end{matrix}} max {O (i, S, d_{k}), T_{s t a g e}} .

(8)

Constraint: The memory requirement must be strictly satisfied, i.e.,

M_{r e q} (i + 1, m) \leq C_{j}

. This equation signifies that the new pipeline bottleneck is the greater of two values: the previous maximum bottleneck latency

O (i, S, d_{k})

, and the local latency of the current new stage

T_{s t a g e}

.

(d) Termination and Backtracking: Upon the allocation of all N layers, the globally optimal bottleneck latency

T_{b e s t}

can be derived by iterating over all possible device subsets and their corresponding terminal devices:

T_{b e s t} = min_{S \subseteq D, d_{k} \in S} O (N - 1, S, d_{k}) .

(9)

To retrieve the exact partitioning scheme, we maintain a tracking array

Ψ (i, S, d_{k}) = (p r e v_i, p r e v_k)

during the DP execution. After the optimization concludes, reverse backtracking is employed to output the precise sequence of (start_layer, end_layer, device_index) mappings.

3.1.3. Summary of Algorithm 1

The complete execution process of the optimal joint device selection and LLM partitioning is detailed in Algorithm 1. The algorithm operates through three distinct phases. First, it filters out cloud nodes to ensure strict edge-only deployment (Lines 1–4) and performs prefix sum precomputation for layer-wise memory and computational costs (Lines 5–6). Next, during the forward dynamic programming phase (Lines 7–32), it systematically populates the state-compressed DP table

O (i, S, d_{k})

by evaluating all valid device subsets and memory-compliant layer intervals. The optimal routing decisions for each state transition are simultaneously recorded in the tracking table

Ψ

. Finally, it executes a termination and backtracking phase (Lines 33–34) to evaluate the minimum global bottleneck latency across all terminal states and reconstruct the precise layer-to-device mapping sequence R.

Algorithm 1 Optimal Joint Device Selection and LLM Partitioning

Require:: Number of LLM layers N, mixed device set $D_{a l l}$ , network profile $n e t$
Ensure:: Optimal layer-to-device mapping sequence R
1:: $D \leftarrow$ Filter out cloud devices from $D_{a l l}$
2:: if $D$ is empty then
3:: return ∅
4:: end if
5:: Precompute prefix sums for $M_{r e q}$ and $T_{c o m p}$ across all $d_{j} \in D$
6:: Initialize DP table $O (i, S, d_{k}) \leftarrow \infty$ and choice table $Ψ \leftarrow \emptyset$
7:: for $i = 0$ to $N - 1$ do
8:: for $d_{j} \in D$ do
9:: if $M_{r e q} (0, i) \leq C_{j}$ then
10:: $O (i, {d_{j}}, d_{j}) \leftarrow T_{c o m p} (0, i, d_{j})$
11:: end if
12:: end for
13:: end for
14:: for $i = 0$ to $N - 2$ do
15:: for each valid subset $S \subseteq D$ and $d_{k} \in S$ do
16:: $c u r \leftarrow O (i, S, d_{k})$
17:: if $c u r = = \infty$ then
18:: continue
19:: end if
20:: for $m = i + 1$ to $N - 1$ do
21:: for $d_{j} \in D ∖ S$ do
22:: if $M_{r e q} (i + 1, m) \leq C_{j}$ then
23:: $T_{s t a g e} \leftarrow T_{c o m p} (i + 1, m, d_{j}) + T_{c o m m} (d_{k}, d_{j}, a_{i})$
24:: $T_{m a x} \leftarrow max (c u r, T_{s t a g e})$
25:: if $T_{m a x} < O (m, S \cup {d_{j}}, d_{j})$ then
26:: $O (m, S \cup {d_{j}}, d_{j}) \leftarrow T_{m a x}$
27:: Update tracking table $Ψ$ for backtracking
28:: end if
29:: end if
30:: end for
31:: end for
32:: end for
33:: end for
34:: Execute backtracking via $Ψ$ starting from $min O (N - 1, S, d_{k})$
35:: Reverse the sequence and return R

Unlike prior frameworks relying on heuristic or greedy allocation rules, this joint optimization algorithm mathematically guarantees the minimum pipeline latency under heterogeneous memory constraints. A critical bottleneck in naive DP approaches is the repetitive calculation of layer-wise costs; however, by strategically incorporating prefix sum precomputation, the query complexity for

T_{c o m p}

and

M_{r e q}

is drastically reduced from

O (N)

to strictly

O (1)

. This optimization bounds the overall algorithmic time complexity to

O (N^{2} \cdot 3^{| D |})

. Because the number of collaborative devices

| D |

in a practical aerial edge group is inherently constrained to a compact cluster (e.g.,

| D | \leq 8

), the algorithm’s search space remains highly tractable. Because the number of collaborative devices

| D |

in a practical aerial edge group is inherently limited to a compact cluster (e.g.,

| D | \leq 8

) due to FANET physical constraints [28,29] (empirical flight tests show that IEEE 802.11 [30] protocols suffer severe channel contention beyond 5–8 UAVs for tightly-coupled tasks [29]; beyond this, communication penalties outweigh computational gains), the algorithm’s search space remains tractable. Constrained by this bound, our DP solver outputs the optimal partitioning strategy within milliseconds, satisfying real-time requirements. (For swarms larger than 8, we discuss scalable extensions in Section 5.5).

3.2. Adaptive Sequence Partitioning Strategy

In the prefill phase, a typical UAV sensor reasoning request usually accumulates a long contextual sequence of continuous status reports (length S). Although the layer-wise partitioning strategy balances static memory and base computational loads across heterogeneous devices, it still processes the input sequence sequentially. As pointed out by existing works such as Jupiter [22], the serial processing of long sequences severely restricts the concurrency of the pipeline. However, due to the inherent algorithmic complexity of LLMs, simply dividing the sequence into uniform chunks introduces severe load imbalance. To address this critical pain point, we propose an Adaptive Sequence Partitioning strategy based on a precise Analytical Cost Model.

3.2.1. Quadratic Inference Cost Model for Causal Attention

(a) Non-constant Complexity of the Attention Mechanism: In Transformer-based LLMs, the computation time for a specific sequence chunk is not constant. Suppose the model processes a new chunk of length x with an existing historical KV cache of length y. The computation process consists of two main parts. First, linear transformations (e.g., Q, K, V projections and FFNs) exhibit computational complexity that grows linearly with x. Second, the causal self-attention mechanism computes attention within the new chunk as well as against the historical KV cache. Its complexity is proportional to

x^{2}

and the cross-term

x \cdot y

.

(b) Latency Function of Sequence Chunks: By conducting hardware performance profiling of these underlying operators on real edge devices, we strictly formulate the inference latency of a sequence chunk of length x with history y using a quadratic function

q (x, y)

:

q (x, y) = a \cdot x + b \cdot x \cdot y + c \cdot x^{2} + d .

(10)

The coefficients a, b, and c are device- and model-specific, while d represents the constant framework overhead. They can be obtained in two ways.

Theoretical derivation (used in our simulation): Based on the FLOPs count of each operation and the device’s peak throughput (TFLOPS). For a Transformer decoder layer, the linear terms (QKV projections, FFN) contribute to a, the cross-attention between new tokens and the KV cache contributes to

b \cdot x \cdot y

, and the intra-chunk causal attention contributes to

c \cdot x^{2}

. For example, on an NVIDIA Jetson AGX Orin (1.88 TFLOPS) running Llama-2 7B (FP16), the per-layer coefficients are derived as:

a = 2.56 ms

,

b = 1.32 \times 10^{- 4} ms

,

c = 7.50 \times 10^{- 5} ms

,

d = 2.0 ms

. In pipeline execution, a device hosting L layers multiplies these per-layer coefficients by L.

Empirical fitting: Profile the target device with a grid of

(x, y)

pairs (e.g., 100 random combinations) and perform least-squares regression on the measured latencies to obtain calibrated coefficients for that specific hardware.

In pipeline parallelism, the system’s overall throughput is governed by the slowest bottleneck stage. Therefore, the cost function

q (x, y)

employed in our subsequent sequence partitioning strategy is strictly fitted utilizing the hardware performance parameters of the bottleneck device, conditional on the layer partitioning results established in Algorithm 1. Because the length of the historical KV cache y monotonically increases during inference, adopting a uniform static partitioning (i.e., fixed chunk length x) will inevitably cause the latency of subsequent chunks to escalate, triggering structural load imbalance.

(c) Optimization Objective: Our goal is to partition the input sequence of total length S into K non-uniform sub-sequences (chunks)

P = {x_{1}, x_{2}, \dots, x_{K}}

. Let

h_{i} = q (x_{i}, y_{i - 1})

denote the latency of the i-th chunk. In a pipeline comprising D device stages, the theoretical end-to-end prefill latency can be approximated as the sum of the sequential processing time on a single device and the pipeline flush time:

T_{p i p e l i n e} (P) \approx \sum_{i = 1}^{K} h_{i} + (D - 1) \cdot max_{1 \leq i \leq K} (h_{i}) .

(11)

Our ultimate objective is to find the optimal partition sequence

P^{*}

that minimizes

T_{p i p e l i n e}

.

3.2.2. Min-Max DP for Sequence Partitioning

To completely eliminate the load imbalance within the sequence, we design a Min-Max Dynamic Programming (DP) algorithm.

(a) State Representation: We define the state

W [y] [k]

as the minimum bottleneck latency (i.e.,

max h_{i}

) achievable when exactly k chunks are used to process a sequence prefix of length y.

(b) Base Case: When both the sequence length and the number of chunks are 0, the latency is 0; otherwise, the state is initialized to infinity ∞:

W [y] [k] = \{\begin{matrix} 0, & if y = 0 and k = 0, \\ \infty, & otherwise . \end{matrix}

(12)

(c) State Transition: Assuming we have computed the optimal bottleneck latency for processing prefix l, we currently evaluate the k-th chunk (with length

y - l

). The new bottleneck latency is the greater of the “maximum bottleneck of the first

k - 1

chunks” and the “local latency of the current k-th chunk”. The state transition equation is expressed as:

W [y] [k] = min_{l < y} max {W [l] [k - 1], q (y - l, l)} .

(13)

Constraint: To prevent the generation of infinitesimally small chunks that introduce excessive system scheduling overhead, we enforce a chunk length constraint

y - l \geq m i n_l e n

(except for the final residual chunk at the sequence end).

(d) Termination and Global Search: Since

W [y] [k]

only optimizes the bottleneck latency

max (h_{i})

, after traversing all prefixes, the algorithm does not arbitrarily preset the total number of chunks K. Instead, it iterates over all feasible chunk counts

k \in [1, m a x_s e q]

, backtracks to calculate the corresponding total cost

T_{p i p e l i n e}

, and outputs the partition scheme that minimizes the global cost:

T_{b e s t} = min_{1 \leq k \leq m a x_s e q} \{\sum_{i = 1}^{k} h_{i} + (D - 1) \cdot W [S] [k]\} .

(14)

3.2.3. Summary of Algorithm 2

The complete execution process of the aforementioned optimization is summarized in Algorithm 2. The algorithm first performs forward computation (Lines 4–20) to populate the DP table

W [y] [k]

and records the optimal split points via the tracking array

Φ [y] [k]

. Subsequently, it executes a global search and backtracking phase (Lines 21–32) to evaluate every possible chunk count k and reconstruct the precise chunk sequence

P_{k}

. Unlike prior frameworks relying on rigid uniform splitting rules, this adaptive algorithm mathematically guarantees that the chunk length

x_{i}

progressively shrinks as the historical KV cache grows, perfectly flattening the latency curve across the entire prefill phase.

Algorithm 2 Adaptive Sequence Partitioning via Min-Max DP

Require:: Total sequence length S, number of pipeline stages D, cost function $q (x, y)$ , minimum chunk length $m i n_l e n$
Ensure:: Optimal sequence partition scheme $P^{*}$
1:: Initialize DP table $W [0 \dots S] [0 \dots m a x_s e q] \leftarrow \infty$
2:: Initialize tracking array $Φ [0 \dots S] [0 \dots m a x_s e q] \leftarrow - 1$
3:: $W [0] [0] \leftarrow 0$
4:: for $y = 1$ to S do
5:: for $k = 1$ to $min (m a x_s e q, y)$ do
6:: $b e s t_v a l \leftarrow \infty$ , $b e s t_l \leftarrow - 1$
7:: for $l = 0$ to $y - 1$ do
8:: $x \leftarrow y - l$ {Length of the current chunk}
9:: if $x < m i n_l e n$ and not is the last chunk then
10:: continue
11:: end if
12:: $c a n d \leftarrow max (W [l] [k - 1], q (x, l))$
13:: if $c a n d < b e s t_v a l$ then
14:: $b e s t_v a l \leftarrow c a n d$
15:: $b e s t_l \leftarrow l$
16:: end if
17:: end for
18:: $W [y] [k] \leftarrow b e s t_v a l$
19:: $Φ [y] [k] \leftarrow b e s t_l$
20:: end for
21:: end for
22:: $g l o b a l_b e s t_c o s t \leftarrow \infty$ , $P^{*} \leftarrow \emptyset$
23:: for $k = 1$ to $m a x_s e q$ do
24:: if $W [S] [k] = = \infty$ then
25:: continue
26:: end if
27:: Backtrack using tracking array $Φ$ starting from $(S, k)$ to reconstruct chunk sequence $P_{k}$
28:: Calculate the latency set ${h_{i}}$ for this sequence, where $h_{i} = q (x_{i}, y_{i - 1})$
29:: $T_{c o s t} \leftarrow \sum h_{i} + (D - 1) \cdot max (h_{i})$
30:: if $T_{c o s t} < g l o b a l_b e s t_c o s t$ then
31:: $g l o b a l_b e s t_c o s t \leftarrow T_{c o s t}$
32:: $P^{*} \leftarrow P_{k}$
33:: end if
34:: end for
35:: return $P^{*}$

3.2.4. Theoretical Boundary Analysis of Inter-Stage Transmission Latency

To further rigorize the theoretical evaluation of the macroscopic total pipeline latency

T_{p i p e l i n e}

, we define the global communication bottleneck of the pipeline as

T_{c o m m_m a x}

(i.e., the maximum transmission latency across all inter-device hops) and compare it against the system’s global computational bottleneck

max (h_{i})

. Depending on their magnitude relationship, the system operates in one of two distinct physical boundary regimes:

(a) Compute-Bound Regime (

T_{c o m m_m a x} < max (h_{i})

): When the maximum cross-device network transmission latency is strictly less than the maximum computational bottleneck of the pipeline, the entire system is compute-bound. In this ideal state, only the transmission of the first chunk

x_{1}

across stages incurs an unavoidable initial network transmission latency (i.e., the communication “cold-start” overhead). As the pipeline advances, the cross-device transmission of all subsequent chunks is perfectly overlapped by the longer computation times. Consequently, the refined end-to-end total cost formula should be modified as:

T_{c o s t} = \sum_{i = 1}^{K} h_{i} + (D - 1) \cdot max_{1 \leq i \leq K} (h_{i}) + \sum_{j = 1}^{D - 1} T_{c o m m}^{j \to j + 1} (x_{1}),

(15)

where

T_{c o m m}^{j \to j + 1} (x_{1})

represents the transmission latency for transferring the activation tensor of the first chunk

x_{1}

from the j-th device to the

(j + 1)

-th device. Under this mechanism, apart from the initial transmission latency of

x_{1}

, the vast majority of subsequent network transmission latencies are virtually negligible due to their high degree of overlap with computation.

(b) Communication-Bound Regime (

T_{comm_\max} \geq \max (h_{i})

): Conversely, if severe congestion occurs in the edge network, causing the transmission latency to exceed the computational bottleneck, the network bandwidth morphs into the new system bottleneck (i.e., the “Network Wall”). Under this extreme condition, the traditional pipeline overlap mechanism fails, and the total latency degrades to being dominated by network throughput. A single sequence partitioning strategy alone cannot fundamentally bypass this physical limitation.

(c) Real-World Feasibility in UAV Swarms: Although the communication-bound regime reveals the theoretical limit of this strategy, the vast majority of real-world UAV swarm inference scenarios naturally reside in the compute-bound regime. The prefill phase involves intensive matrix multiplications, and the computational complexity of causal attention explodes quadratically (

O (N^{2})

), resulting in an extremely high computation density. In contrast, the data relayed between different devices are merely the intermediate hidden states at single-layer slices, which are relatively limited in data volume. Thus, in physical deployments, the system’s computation-to-communication ratio is typically far greater than 1. This implies that

T_{c o m m_m a x} < max (h_{i})

is the norm, proving that the Orion adaptive partitioning algorithm possesses strong practical physical significance and robustness.

3.3. Predictive Decoding Mechanism

After Optimal LLM Partitioning and Adaptive Sequence Partitioning, our pipeline is fully established. Our analysis of decoder-based LLM applications reveals that their inherent autoregressive nature introduces significant pipeline bubbles during the transition between the prefill and decoding stages. Specifically, let

T_{p r e f i l l}^{(1)}

denote the prefill completion time of the first device in the pipeline, and

T_{p r e f i l l}^{(D)}

denote the completion time of the final device (the D-th device). The autoregressive decoding stage must wait for the global prefill phase to fully execute before it can generate the ground-truth first token,

t_{1}^{*}

. This creates a strict period of device idle time between the completion of the first device’s computation and the availability of the first generated token. We define this pipeline bubble as

T_{b u b b l e} = T_{p r e f i l l}^{(D)} - T_{p r e f i l l}^{(1)}

, which results in extremely low resource utilization on the first device.

To address this limitation, Orion introduces an innovative predictive decoding mechanism. The core insight involves leveraging intermediate hidden states computed during the prefill stage to speculatively predict the first output token before the prefill stage fully completes. We deploy a lightweight prediction module,

H_{θ}

(implemented as a multi-layer perceptron or a linear classification head), on the first device in the pipeline. This module takes the hidden state representation of the local final subsequence,

H_{l o c a l}

, as input and generates a probability distribution over the vocabulary for the first token. During inference, this module produces a candidate token with a negligible computational overhead,

T_{p r e d i c t}

:

{\hat{t}}_{1} = arg max (H_{θ} (H_{l o c a l})) .

(16)

This operation initiates the decoding stage speculatively in advance, effectively hiding the massive latency,

T_{b u b b l e}

, that would otherwise be spent waiting for prefill completion.

When the actual prefill stage completes on device D and produces the verified ground-truth first token,

t_{1}^{*}

, the system performs a validation check. If the predicted candidate token matches the actual token (i.e.,

{\hat{t}}_{1} = t_{1}^{*}

), the system seamlessly accepts the speculative execution and continues generating subsequent tokens, having effectively and completely eliminated the pipeline bubble. If the prediction proves incorrect (i.e.,

{\hat{t}}_{1} \neq t_{1}^{*}

), the system simply discards the speculative computation (e.g., by truncating the speculatively generated KV cache) and restarts decoding from the correct token,

t_{1}^{*}

. This rollback process introduces a very minor overhead for communication and state cleanup,

T_{r o l l b a c k}

. This design strategically trades off the occasional penalty of recomputation against the consistent benefit of latency reduction, significantly decreasing overall inference time without compromising generation quality. The lightweight nature of the prediction module ensures that the rollback overhead,

T_{r o l l b a c k}

, remains minimal compared to the substantial benefits of successfully eliminating

T_{b u b b l e}

. To formally quantify the trade-off of our predictive decoding mechanism, we define the expected latency savings,

E [Δ T]

, for a single decoding cycle as follows:

E [Δ T] = P_{a c c} \cdot T_{b u b b l e} - T_{p r e d i c t} - (1 - P_{a c c}) \cdot T_{r o l l b a c k},

(17)

where

P_{a c c}

is the prediction accuracy, and

T_{b u b b l e}

denotes the massive idle pipeline stall time saved upon a correct prediction. In our cross-UAV framework,

T_{b u b b l e}

is relatively large due to the inherent sequential stage dependency.

Crucially, the computational and systemic penalties are highly asymmetric and extremely low. First, the prediction overhead

T_{p r e d i c t}

is negligible. Following the head-based speculative decoding paradigm [31], similar to Medusa [32], adding a lightweight MLP head introduces trivial computational overhead compared to the backbone model. Second, the rollback penalty

T_{r o l l b a c k}

remains extremely low even in a distributed setting. Locally, as established by foundational speculative decoding principles [33], rejecting a speculated token merely requires an

O (1)

KV cache truncation without extra recomputation burden. In a distributed cross-UAV pipeline,

T_{r o l l b a c k}

additionally entails cross-device state synchronization (e.g., broadcasting the rollback command and the corrected token ID) and token correction scheduling. However, unlike the prefill phase which transmits massive activation tensors, this rollback synchronization merely involves a few bytes of control metadata. Thus, even when accounting for this minimal network broadcasting overhead, the total

T_{r o l l b a c k}

remains on the order of milliseconds, which is orders of magnitude smaller than the macro-pipeline bubble we aim to hide. This asymmetric cost–benefit structure mathematically guarantees that the system can achieve net-positive speedups even at low accuracy thresholds.

4. Case Study

4.1. Experimental Setup

We implemented a comprehensive simulation framework in Python 3.13.5 to evaluate Orion’s performance rigorously. Our custom simulation accurately models UAV onboard capabilities, such as compute performance and memory capacity. It also captures aerial ad hoc network characteristics, including bandwidth and latency. Finally, it simulates the exact LLM inference process on UAV nodes. This comprehensive design enables a realistic reproduction of aerial edge computing conditions and provides reliable performance estimation.

Testbed: Our experimental testbed comprises five heterogeneous devices: three NVIDIA AGX Orin modules (each with 1.88 TFLOPS, 16 GB RAM) and one NVIDIA Orin NX module (3.33 TFLOPS, 32 GB RAM) acting as the UAV swarm, alongside one cloud base station (with an RTX 3090 GPU, 36 TFLOPS, 32 GB RAM). To evaluate internal network sensitivity, the ad hoc bandwidth between UAV nodes was varied from 400 to 1000 Mb/s, with a $\pm 20 %$ fluctuation applied to all bandwidth settings to simulate realistic aerial link instability. This fluctuation approximates the severe channel non-stationarity in FANETs caused by trajectory changes, antenna masking, and ISM interference, as documented in recent aerial measurements [34]. The air-to-ground (UAV-to-cloud) bandwidth was also varied from 100 to 400 Mb/s to assess the impact of external network conditions on the performance of cloud-reliant baselines.
Benchmarks: We evaluated Orion using the Llama-2 model family (7B, 13B, and 70B parameters) to test scalability. Experiments included varying sensor log lengths (16–128 tokens) and air-to-ground network bandwidths (100–400 Mb/s) to assess adaptability. Prefill latency was the primary metric.
Baselines: We compare Orion against 5 baselines to demonstrate its superiority:
–
Edge-Solo: The entire LLM runs on the single UAV node (AGX Orin) without any partitioning. This represents the typical setup for independent onboard deployment.
–
Cloud-Edge-Even: The LLM is evenly split between the most powerful UAV node and the cloud base station. This represents a naive cloud-offloading strategy.
–
Cloud-Edge-Opt: The LLM is optimally partitioned between the most powerful UAV node and the cloud base station using our dynamic programming algorithm. This represents the best possible cloud-assisted baseline.
–
EdgeShard [20]: A collaborative edge computing framework that utilizes dynamic programming for joint device selection and model partitioning to orchestrate pipeline-parallel LLM inference.
–
Jupiter [22]: A state-of-the-art resource-efficient collaborative inference system that leverages intra-sequence pipeline parallelism for the prefill phase and speculative decoding for autoregressive generation.

The complete set of simulation parameters is listed in Table 1.

4.2. Latency on Llama-2 Models of Varying Scales

In this experiment, we fixed the sensor input length to 64 tokens to evaluate the scalability of Orion across different model sizes. The network environment was configured with a stable inter-UAV bandwidth of 1000 Mb/s and an air-to-ground bandwidth of 200 Mb/s.

The comprehensive evaluation of Orion across the Llama-2 model family (7B, 13B, and 70B parameters), as shown in Figure 4, demonstrates its superior performance and scalability in aerial edge environments. For the 7B model, Orion achieved a latency of 8864 ms, representing a substantial reduction compared to Cloud-Edge-Opt (46,522 ms). While slightly higher than Edge-Solo (8047 ms), Orion provides significantly stronger mission data protection and scalability, and outperforms distributed baselines Jupiter (11,260 ms) and EdgeShard (24,629 ms).

With the 13B model, Edge-Solo failed with out-of-memory (OOM) errors, and cloud-edge baselines exhibited severe delays above 50,000 ms (Cloud-Edge-Even at 54,485 ms, Cloud-Edge-Opt at 53,166 ms). In contrast, Orion successfully executed inference at 11,824 ms, not only massively outperforming cloud-edge alternatives but also maintaining a clear lead over Jupiter (16,870 ms) and EdgeShard (29,045 ms).

At the massive 70B scale, cloud-edge baselines entirely failed due to memory constraints, underscoring the inherent limitations of existing offloading approaches. Among all successful distributed solutions, Orion remained the most efficient with an 18,665 ms latency, significantly outperforming Jupiter (27,264 ms) and EdgeShard (43,058 ms). These findings highlight Orion’s unique capability to deploy increasingly large LLMs on memory-constrained UAV devices while maintaining a substantial performance advantage over existing state-of-the-art collaborative frameworks.

4.3. Latency on Llama-2 13B of Varying Sensor Log Lengths

In this experiment, we fixed the model to Llama-2 13B and evaluated latency under varying sensor log lengths (16, 32, 64, and 128 tokens). To ensure a consistent communication environment, the inter-UAV bandwidth was maintained at 1000 Mb/s (±20% fluctuation), while the air-to-ground bandwidth was set to 200 Mb/s.

As shown in Figure 5, Orion demonstrates remarkable scalability compared to all baselines. With a shorter sensor trace of 16 tokens, Orion achieved a latency of only 3229 ms, whereas Cloud-Edge-Opt required 13,118 ms, and EdgeShard and Jupiter required 7396 ms and 7770 ms, respectively. As the continuous sensor log length increased, Orion maintained highly stable growth, reaching 22,076 ms at 128 tokens. In stark contrast, cloud-edge baselines exhibited dramatic latency escalation (Cloud-Edge-Even reaching 107,646 ms and Cloud-Edge-Opt reaching 104,017 ms). Similarly, the collaborative frameworks EdgeShard and Jupiter experienced latency increases up to 44,870 ms and 25,187 ms, respectively, showing larger performance degradation than Orion. Overall, these results validate Orion’s effectiveness and adaptability in handling accumulated multi-modal inputs of varying lengths, proving its practical advantage for long-context sensor interactions in autonomous UAV swarms.

4.4. Latency on Llama-2 13B of Varying Air-to-Ground Bandwidths

In this experiment, we fixed the model to Llama-2 13B with a sensor input length of 64 tokens and evaluated performance under varying air-to-ground bandwidths (100, 150, 200, and 400 Mb/s).During these tests, the internal inter-UAV bandwidth was maintained at a stable 1000 Mb/s (±20% fluctuation).

The results in Figure 6 reveal that cloud-reliant baselines are highly sensitive to aerial network conditions. When bandwidth increased from 100 Mb/s to 400 Mb/s, Cloud-Edge-Even latency plummeted from 105,424 ms to 30,037 ms, and Cloud-Edge-Opt showed a similar drastic fluctuation (from 101,155 ms down to 28,959 ms). This pronounced variance demonstrates the inherent vulnerability of cloud-based approaches when air-to-ground bandwidth is constrained or unstable during flight operations.

Conversely, Orion, EdgeShard, and Jupiter exhibited minimal sensitivity to external network variations due to localized swarm collaboration. Yet, EdgeShard and Jupiter still experienced slight latency fluctuations as they maintain partial cloud-device involvement. Orion, completely decoupled from the cloud during the swarm inference process, achieved the lowest and most consistent latency (ranging from 12,294 ms at 100 Mb/s to 11,163 ms at 400 Mb/s). While Jupiter and EdgeShard performed well, their overall latencies remained higher than Orion’s. These findings emphasize that, beyond its low-latency benefits, Orion possesses exceptional robustness, ensuring reliable deployment under dynamic and heterogeneous aerial network environments.

4.5. Latency on Llama-2 13B of Varying Inter-UAV Bandwidths

In this experiment, we fixed the model to Llama-2 13B with a sensor input length of 64 tokens and evaluated performance under varying inter-UAV bandwidths (400, 600, 800, and 1000 Mb/s). During these tests, the air-to-ground bandwidth was maintained at a constant 200 Mb/s.

The results in Figure 7 reveal that internal swarm network conditions significantly impact the performance of collaborative inference frameworks. When the inter-UAV bandwidth increased from 400 Mb/s to 1000 Mb/s, EdgeShard’s latency dropped sharply from 54,768 ms to 30,079 ms. This pronounced variance demonstrates that traditional collaborative frameworks like EdgeShard are highly sensitive to internal communication capacity, often suffering from severe pipeline bottlenecks when bandwidth is constrained.

Conversely, Orion and Jupiter exhibited exceptional stability across the tested bandwidth range. As discussed in Section 3.2.4, this is primarily because both frameworks implement a mechanism where computation and communication are partially overlapped, significantly reducing their sensitivity to bandwidth fluctuations. Jupiter maintained a relatively consistent latency, fluctuating slightly around 17,000 ms (ranging from 18,211 ms at 400 Mb/s to 16,984 ms at 1000 Mb/s). Orion, leveraging optimal model partitioning and adaptive sequence parallelism, achieved the lowest and most consistent latency, ranging from 16,746 ms at 400 Mb/s to 12,432 ms at 1000 Mb/s. While Jupiter performed reliably, its overall latency remained higher than Orion’s across all scenarios. These findings emphasize that Orion effectively maximizes available inter-UAV link resources, ensuring robust real-time inference in dynamic and heterogeneous aerial network environments.

4.6. Ablation Study

To evaluate the contribution of each core innovation in Orion, we conducted an ablation study using the Llama-2 model family (7B, 13B, and 70B). The experimental settings remained consistent with Section 4.2: a 64-token sensor input length, 1000 Mb/s inter-UAV bandwidth, and 200 Mb/s air-to-ground bandwidth. We compared the full Orion framework against two variants: (1) Orion—no SP, which disables the Adaptive Sequence Partitioning strategy; and (2) Orion—no pre, which removes the predictive decoding mechanism.

The results in Figure 8 highlight the critical impact of each component on reducing inference latency. The exclusion of the Adaptive Sequence Partitioning strategy (Orion—no SP) led to the most significant performance degradation across all scales. Specifically, for the Llama-2 70B model, the latency surged from 16,793 ms to 42,092 ms, a nearly 150% increase. This confirms that without adaptive partitioning, the pipeline suffers from severe load imbalance and structural bubbles caused by the non-constant complexity of the causal attention mechanism.

The predictive decoding mechanism also plays a vital role in optimizing end-to-end responsiveness. Removing this component (Orion—no pre) resulted in a visible latency increase, with the 70B model’s performance rising to 26,305 ms compared to the full Orion’s 16,793 ms. Similar trends were observed for the 7B and 13B models, where the absence of predictive decoding increased latency to 10,592 ms and 15,962 ms respectively. These findings demonstrate that by speculatively generating tokens during idle intervals, Orion effectively hides the prefill-decode bubbles that typically hinder conventional distributed inference. Overall, the synergy between adaptive sequence partitioning and predictive decoding is what enables Orion to achieve its superior real-time performance on resource-constrained UAV nodes.

4.7. Investigation of Prediction Accuracy on End-to-End Latency

To investigate the impact of prediction accuracy on end-to-end latency, particularly concerning the potential rollback overhead caused by lower accuracy in early layers (e.g., 17.5% as reported in Section 5.2), we conducted a break-even microbenchmark using our comprehensive simulation framework. In this experiment, the inter-UAV wireless bandwidth was set to 1000 Mb/s with a ±20% fluctuation to simulate the dynamic nature of aerial communication channels. To ensure statistical robustness and account for stochastic environmental factors, we performed 300 Monte Carlo simulations for each configuration and took the average values as the final results.

Figure 9 illustrates the end-to-end latency of Orion under varying prediction accuracies for the Llama-2 13B model with a 64-token sensor input sequence. As demonstrated, the asymmetric nature of the massive pipeline bubble (

T_{b u b b l e}

) versus the negligible prediction and rollback overheads (

T_{p r e d i c t}

and

T_{r o l l b a c k}

) results in an extremely low break-even threshold. With the support of high-bandwidth (1000 Mb/s) links, broadcasting several bytes of control metadata for rollback incurs only microsecond-level network latency, maintaining a minimal

T_{r o l l b a c k}

.

The simulation results show that the system intersects with the baseline (Orion-no-pre) and achieves a net-positive latency reduction (i.e.,

E [Δ T] > 0

) at an accuracy threshold of merely 8.94%. Consequently, even at the lowest observed early-layer accuracy of 17.5%, Orion operates well within the net-positive gain regime. This confirms that our predictive decoding strictly reduces overall latency rather than degrading it, providing stable performance gains regardless of early-layer uncertainty under realistic UAV communication conditions.

4.8. Robustness Analysis Under Environmental Uncertainty

To evaluate the robustness of the proposed static DP layer partitioning strategy against uncertainties in real-world deployments, we conducted extensive Monte Carlo noise injection experiments. The experiments were performed using the Llama-2 13B model (prefill sequence length

S = 64

) on a collaborative inference system comprising four heterogeneous edge devices (3× AGX Orin and 1× Orin NX). Environmental non-determinism was mathematically modeled along two physical dimensions: computation noise, where a multiplicative Gaussian perturbation

ϵ_{c o m p} \sim N (0, σ_{c o m p}^{2})

was applied to each layer’s base latency to simulate chip load fluctuations and thermal throttling; and communication noise, where a multiplicative perturbation

ϵ_{c o m m} \sim N (0, σ_{c o m m}^{2})

was applied to inter-device bandwidth to simulate edge network jitter and channel quality variations. Two scenarios were established: nominal noise (

σ_{c o m p} = 10 %, σ_{c o m m} = 20 %

) and extreme noise (both

σ = 30 %

), with

N = 100

independent trials executed for each. In each trial, the static partition

R_{s t a t i c}

(derived under a clean baseline of

16.79 s

) was re-evaluated to obtain

L_{s t a t i c}

, which was then compared against the theoretical optimal latency

L_{o p t}

obtained by re-running the DP solver under the same perturbed parameters. We define the latency degradation rate as

(L_{s t a t i c} - L_{o p t}) / L_{o p t}

.

As illustrated in Figure 10, Orion’s static partitioning exhibits exceptional system resilience. Under nominal noise, the latency distribution is highly concentrated with a mean of

\approx 17.04 s

(Figure 10a), yielding a coefficient of variation (CV) of merely

0.067

, which indicates highly predictable performance. The per-trial latency curves (Figure 10b) show that despite environmental fluctuations,

L_{s t a t i c}

closely tracks the theoretical optimum

L_{o p t}

without catastrophic performance collapse. Although the continuous perturbations result in the static strategy remaining the absolute theoretical optimum (i.e., a Match) in only

15.0 %

of trials (Figure 10c), the average latency degradation is merely

9.3 %

. Even under the severely perturbed extreme scenario, the degradation distribution remains predominantly in the low-value region (Figure 10d), with the average performance loss firmly bounded within

12.7 %

. The summary metrics (Figure 10e) intuitively contrast the CV and match rates across scenarios, while the Cumulative Distribution Function (CDF) plot (Figure 10f) demonstrates that the nominal latency curve closely adheres to the clean baseline. These results empirically confirm that the static DP baseline effectively absorbs physical environmental jitter, implying that the system does not require frequent, high-overhead runtime re-partitioning to chase marginal theoretical limits, thereby maximizing framework stability and minimizing scheduling overhead without sacrificing the user experience.

5. Future Work

In future work, we plan to explore several key directions to further improve the efficiency, scalability, and robustness of our collaborative inference framework for autonomous UAV swarms.

5.1. Decoding Stage Optimization

While this work primarily focuses on prefill optimization, autoregressive decoding latency remains a significant barrier to responsive real-time swarm coordination and decision-making. We plan to investigate speculative decoding techniques [22,31,35,36,37,38] that generate multiple candidate tokens in parallel using lightweight draft models, thereby reducing the number of sequential decoding steps. Another promising direction is distributed reasoning exploration. By leveraging Chain-of-Thought prompting [39], multiple UAV nodes can simultaneously generate and evaluate different intermediate reasoning paths for complex mission planning, resembling a swarm-level brainstorming process. These approaches could significantly improve decoding throughput in aerial edge environments.

5.2. Enhanced Predictive Decoding

Our experiments with linear and MLP prediction heads showed prediction accuracy ranging from 17.5% in early layers to 85% in later layers. Future work will optimize prediction head placement within the model architecture and develop UAV node allocation strategies that balance prediction accuracy, latency reduction, rollback cost, and onboard resource constraints.

We will also explore extending predictive decoding beyond the first token. While single-token prediction is simplest to implement, block-based prediction (2–4 tokens) could further reduce decoding steps [40]. Tree-based multi-branch prediction could increase hit rates but requires more complex KV cache management [41]. For dynamic UAV swarm scenarios, we will pursue a practical approach beginning with single-token prediction and gradually advancing to small-block prediction with adaptive sizing and confidence-based gating.

5.3. Toward Uncertainty-Aware Robust Partitioning

The current Orion framework assumes deterministic computation and communication latencies. While the Monte Carlo sensitivity analysis in Section 4.8 shows that the deterministic baseline tolerates moderate environmental noise, a theoretically grounded robust formulation is desirable for highly unpredictable deployments. We outline two promising directions.

First, interval-based robust optimization. Model the per-device computation time

T_{comp}^{(i)}

and per-link communication time

T_{comm}^{(i, j)}

as bounded intervals

[\underset{̲}{T}, \bar{T}]

rather than point estimates. The partitioning problem then becomes a minimax optimization that minimizes the worst-case end-to-end latency. This can be solved by extending the dynamic programming formulation with interval arithmetic or by converting to a deterministic equivalent using worst-case propagation rules.

Second, chance-constrained optimization. Given estimated distributions of the uncertain parameters (e.g., from historical flight logs or online estimation), enforce probabilistic constraints of the form

Pr (latency \leq L_{0}) \geq 1 - ϵ

, where

ϵ

is a small tolerance. This stochastic formulation can be tackled via sample average approximation or by relaxing to a conditional value-at-risk (CVaR) objective. Integrating either robust variant with our static DP baseline is an important next step toward deployment in highly dynamic flying ad hoc networks.

5.4. Extension to Multi-Modal Sensor Data

While Orion currently processes textual sensor logs, real-world UAV swarms often generate images and point clouds. To bridge this gap without redesigning the core partitioning strategy, we propose a lightweight sensor-to-text translation layer as a future extension. For RGB frames, an onboard detector (e.g., YOLOv8) outputs bounding boxes with class labels and confidence scores, which are serialized into short text lines such as “pedestrian at (320,240), confidence 0.92”. For LiDAR point clouds, voxel grid downsampling (e.g., 0.1 m resolution) reduces the point count, and each remaining point is converted to a range-angle text representation. The resulting text streams then feed seamlessly into Orion’s existing inference pipeline. As a longer-term direction, we plan to integrate vision-language models (e.g., LLaVA) that directly accept image tokens, enabling richer multimodal understanding without intermediate text conversion.

5.5. Real-System Deployment and Physical Validation

While our current simulation framework accurately models heterogeneous computing and aerial network conditions, moving to physical deployment is essential to fully evaluate Orion’s potential. Real-world flying ad hoc networks (FANETs) are highly dynamic, involving unpredictable wireless interference, mobility-induced topology changes, and hardware thermal throttling. We plan to implement Orion on a physical UAV testbed (e.g., quadcopters equipped with NVIDIA Jetson modules) to capture real system overheads and validate robustness.

Furthermore, while our DP-based partitioning assumes a compact cluster of at most 8 UAVs (due to FANET physical constraints), larger swarms can be supported via two lightweight extensions. First, hierarchical grouping: partition the swarm into subgroups of ≤8 nodes based on network proximity, run Algorithm 1 within each subgroup, then pipeline across subgroups. Second, a greedy heuristic that sorts devices by compute/memory ratio and assigns layers incrementally, followed by local DP refinement on sliding windows of 3–5 devices.

6. Conclusions

This paper presents Orion, an edge-only collaborative inference framework that enables efficient LLM deployment for multi-sensor data processing in autonomous UAV swarms. By leveraging distributed onboard computing resources, Orion addresses key challenges in resource-constrained aerial LLM deployment. Our three core innovations, dynamic LLM partitioning, adaptive sequence partitioning, and predictive decoding, effectively reduce prefill latency for long-context sensor logs, mitigate serialization bottlenecks, and improve overall swarm resource utilization. Compared to traditional air-to-ground cloud-offloading approaches, Orion demonstrates consistently lower latency, better adaptability to UAV device heterogeneity, and enhanced mission data confidentiality through strictly local deployment. We also outline potential avenues for future research to further advance this collaborative inference framework for flying ad hoc networks (FANETs). This work provides a promising direction for deploying LLMs in resource-constrained, highly dynamic aerial environments such as heterogeneous UAV swarms.

Author Contributions

Conceptualization, T.Y. and H.G.; methodology, T.Y. and H.G.; software, T.Y., Z.Z. and D.Z.; validation, T.Y., H.G., Z.Z. and D.Z.; resources, T.Y.; data curation, Z.Z.; writing—original draft preparation, T.Y. and H.G.; writing—review and editing, H.G. and D.Z.; supervision, H.G.; project administration, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62402450) and the Zhejiang Provincial Natural Science Foundation of China (Grant No. LQ24F020024).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
LLM	Large Language Model
DP	Dynamic Programming
FLOPs	Floating Point Operations
KV cache	Key-Value cache
FANET	Flying Ad hoc Network
OOM	Out-Of-Memory
MLP	Multi-Layer Perceptron

References

Ahmed, A.; Wang, L.; Kim, J.; Jin, J.; Cho, K.; Kwon, C.; Lee, D.J. LLM-guided distributed model predictive control for decentralized UAV formations. IEEE Access 2026, 14, 15226–15240. [Google Scholar] [CrossRef]
Yan, G.C.; Du, J.; Chen, S.; Tian, X.G. Study on the Path Optimization Method of Autonomous Navigation of Uncrewed Aerial Vehicles Integrating Multi-Sensor Data. IEEE Access 2025, 13, 173016–173034. [Google Scholar] [CrossRef]
Han, B.; Chen, Y.T.; Li, J.R.; Li, J.; Su, J.S. SwarmChain: Collaborative LLM Inference for UAV Swarm Control. IEEE Internet Things Mag. 2025, 8, 64–71. [Google Scholar] [CrossRef]
Javaid, S.; Fahim, H.; He, B.; Saeed, N. Large Language Models for UAVs: Current State and Pathways to the Future. IEEE Open J. Veh. Technol. 2024, 5, 1166–1192. [Google Scholar] [CrossRef]
Nguyen, T.M.; Truong, V.T.; Le, L.B. Agentic AI Meets Edge Computing in Autonomous UAV Swarms. IEEE Internet Things Mag. 2025, 8, 87–95. [Google Scholar] [CrossRef]
Maletić, M.; Peti, M.; Petrović, T.; Bogdan, S. Spatial-Semantic Reasoning using Large Language Models for Efficient UAV Search Operations. In Proceedings of the 12th European Conference on Mobile Robots, Padova, Italy, 2–5 September 2025; pp. 1–8. [Google Scholar] [CrossRef]
Semerikov, S.O.; Vakaliuk, T.A.; Kanevska, O.B.; Ostroushko, O.A.; Kolhatin, A.O. Edge intelligence unleashed: A survey on deploying large language models in resource-constrained environments. J. Edge Comput. 2025, 4, 179–233. [Google Scholar] [CrossRef]
He, Y.; Fang, J.C.; Yu, F.R.; Leung, V.C. Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach. IEEE Trans. Mob. Comput. 2024, 23, 11253–11264. [Google Scholar] [CrossRef]
Jin, H.P.; Wu, Y.Z. CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration. In Proceedings of the 32nd IEEE International Conference on Web Services, Helsinki, Finland, 7–12 July 2025; pp. 316–323. [Google Scholar] [CrossRef]
Shi, W.S.; Cao, J.; Zhang, Q.; Li, Y.H.Z.; Xu, L.Y. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Kristiani, E.; Verma, V.K.; Yang, C.-T. Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI 2026, 7, 15. [Google Scholar] [CrossRef]
Qu, G.Q.; Chen, Q.Y.; Wei, W.; Lin, Z.; Chen, X.H.; Huang, K.B. Mobile Edge Intelligence for Large Language Models: A Contemporary Survey. IEEE Commun. Surv. Tutor. 2025, 27, 3820–3860. [Google Scholar] [CrossRef]
Li, Y.C.; Wen, H.; Wang, W.J.; Li, X.Y.; Yuan, Y.Z.; Liu, G.H.; Liu, J.C.; Xu, W.X.; Wang, X.; Sun, Y.; et al. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security. arXiv 2024, arXiv:2401.05459. [Google Scholar] [CrossRef]
Wang, Y.M.; Lin, Y.; Zeng, X.D.; Zhang, G.N. PrivateLoRA For Efficient Privacy Preserving LLM. arXiv 2023, arXiv:2311.14030. [Google Scholar] [CrossRef]
Wei, Y.T.; Wu, S.; Ji, Z.; Yu, Z.G.; Jiang, C.X.; Kuang, L.L. Multi-UAV Collaborative Edge Computing Algorithm for Joint Task Offloading and Channel Resource Allocation. J. Commun. Inf. Netw. 2024, 9, 137–150. [Google Scholar] [CrossRef]
Cai, F.L.; Yuan, D.; Yang, Z.; Cui, L.Z. Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing. In Proceedings of the 31st IEEE International Conference on Web Services, Shenzhen, China, 7–13 July 2024; pp. 799–809. [Google Scholar] [CrossRef]
Chen, Y.X.; Li, R.P.; Zhao, Z.F.; Peng, C.H.; Wu, J.J.; Hossain, E. NetGPT: An AI-Native Network Architecture for Provisioning Beyond Personalized Generative Services. IEEE Netw. 2024, 38, 404–413. [Google Scholar] [CrossRef]
Shen, X.; Dong, P.; Lu, L.; Kong, Z.L.; Li, Z.G.; Lin, M.; Wu, C.; Wang, Y.Z. Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, QC, Canada, 20–27 February 2024; pp. 18944–18951. [Google Scholar] [CrossRef]
Lin, J.; Tang, J.M.; Tang, H.T.; Yang, S.; Xiao, G.X.; Han, S. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. GetMobile Mob. Comput. Commun. Rev. 2025, 28, 12–17. [Google Scholar] [CrossRef]
Zhang, M.J.; Shen, X.M.; Cao, J.N.; Cui, Z.Y.; Jiang, S. EdgeShard: Efficient LLM Inference via Collaborative Edge Computing. IEEE Internet Things J. 2025, 12, 13119–13131. [Google Scholar] [CrossRef]
Yang, J.; Wu, Q.; Feng, Z.Y.; Zhou, Z.; Guo, D.K.; Chen, X. Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts. IEEE Trans. Mob. Comput. 2025, 24, 13648–13662. [Google Scholar] [CrossRef]
Ye, S.Y.; Ouyang, B.; Zeng, L.K.; Qian, T.Y.; Chu, X.W.; Tang, J. Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices. In Proceedings of the 44th IEEE International Conference on Computer Communications, London, UK, 19–22 May 2025; pp. 1–10. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, QC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Varotto, L.; Fabris, M.; Michieletto, G.; Cenedese, A. Visual Sensor Network Stimulation Model Identification via Gaussian Mixture Model and Deep Embedded Features. Eng. Appl. Artif. Intell. 2022, 114, 105096. [Google Scholar] [CrossRef]
Jadbabaie, A.; Makur, A.; Mossel, E.; Salhab, R. Inference in Opinion Dynamics under Social Pressure. IEEE Trans. Autom. Control 2022, 68, 3377–3392. [Google Scholar] [CrossRef]
Lin, Y.Y.; Peng, S.J.; Wu, S.P.; Li, Y.B.; Lu, C.Z.; Ye, K.J. Serving LLM in Distributed GPU Cluster with Fine-Grain Pipeline Constraints. IEEE Trans. Serv. Comput. 2025, 18, 3164–3176. [Google Scholar] [CrossRef]
Bekmezci, I.; Sahingoz, O.K.; Temel, S. Flying Ad-Hoc Networks (FANETs): A Survey. Ad Hoc Netw. 2013, 11, 1254–1270. [Google Scholar] [CrossRef]
Tripathi, V.; Kadota, I.; Tal, E.; Rahman, M.S.; Warren, A.; Karaman, S.; Modiano, E. WiSwarm: Age-of-Information-Based Wireless Networking for Collaborative Teams of UAVs. In Proceedings of the IEEE INFOCOM 2023—IEEE Conference on Computer Communications, Hoboken, NJ, USA, 17–20 May 2023; pp. 1–10. [Google Scholar] [CrossRef]
IEEE Std 802.11-2024; IEEE Standard for Information Technology–Telecommunications and Information Exchange Between Systems Local and Metropolitan Area Networks–Specific Requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. IEEE: New York, NY, USA, 2024. [CrossRef]
Xia, H.M.; Yang, Z.; Dong, Q.X.; Wang, P.Y.; Li, Y.Q.; Ge, T.; Liu, T.Y.; Li, W.J.; Sui, Z.F. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 7655–7671. [Google Scholar] [CrossRef]
Cai, T.; Li, Y.; Geng, Z.; Peng, H.; Lee, J.D.; Chen, D.; Dao, T. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 5209–5235. [Google Scholar] [CrossRef]
Leviathan, Y.; Kalman, M.; Matias, Y. Fast Inference from Transformers via Speculative Decoding. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19274–19286. [Google Scholar] [CrossRef]
Lee, D.; Maeng, S.J.; Ozdemir, O.; Pandian, M.B.; Guvenc, I. Reliability of Wi-Fi, LTE, and 5G-Based UAV RC Links in ISM Bands: Uplink Interference Asymmetry Analysis and HARQ Design. IEEE Open J. Commun. Soc. 2026, 7, 386–406. [Google Scholar] [CrossRef]
Do, D.-T.; Le, N.-K.; Nguyen, L.-M. AdaSpec: Adaptive Multilingual Speculative Decoding with Self-Synthesized Language-Aware Training and Vocabulary Simplification. In Proceedings of the 40th AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; pp. 30530–30538. [Google Scholar] [CrossRef]
Li, X.C.; Spatharakis, D.; Ghafouri, S.; Fan, J.K.; Vandierendonck, H.; John, D.; Ji, B.; Nikolopoulos, D.S. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving. In Proceedings of the 10th ACM/IEEE Symposium on Edge Computing, Arlington, VA, USA, 3–6 December 2025; pp. 1–8. [Google Scholar] [CrossRef]
Zheng, C.; Yang, T.T. Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding. In Proceedings of the 17th International Conference on Wireless Communications and Signal Processing, Chongqing, China, 23–25 October 2025; pp. 1–6. [Google Scholar] [CrossRef]
Liu, X.; Luo, L.Z.; Tang, M.; Huang, C.; Chen, X. FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference. arXiv 2025, arXiv:2507.02620. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.Z.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 24824–24837. [Google Scholar] [CrossRef]
Shi, L.H.; Li, Z.C.; Zhang, L.F.; Qi, B.Y.; Liu, G.M.; Zhao, H. Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios. In Proceedings of the 40th AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; pp. 32947–32955. [Google Scholar] [CrossRef]
Kumar, T.; Dao, T.; May, A. Speculative Speculative Decoding. arXiv 2026, arXiv:2603.03251. [Google Scholar] [CrossRef]

Figure 1. Edge-enabled collaborative scenario with LLMs for autonomous UAV swarms.

Figure 2. Core architecture of generative LLMs and workflow of their inference process.

Figure 3. Overview of the Orion framework.

Figure 4. Latency result with varying model scales on UAV nodes.

Figure 5. Latency result with varying sensor log lengths.

Figure 6. Latency result with varying air-to-ground bandwidths.

Figure 7. Latency result with varying Inter-UAV Bandwidths.

Figure 8. Ablation results of Orion with varying model scales.

Figure 9. Break-even analysis of predictive decoding.

Figure 10. Sensitivity analysis of Orion’s static partitioning under environmental perturbations.

Table 1. Summary of simulation environment parameters.

Parameter	Value/Range
Hardware (emulated)	UAV node (3 units)
	NVIDIA Jetson AGX Orin (1.88 TFLOPS, 16 GB RAM)
	UAV node (1 unit)
	NVIDIA Orin NX (3.33 TFLOPS, 32 GB RAM)
	Cloud node
	NVIDIA RTX 3090 (36 TFLOPS, 32 GB RAM)
Network	Inter-UAV bandwidth
	400–1000 Mb/s, $\pm 20 %$ random fluctuation
	Air-to-ground bandwidth
	100–400 Mb/s
	Activation tensor size
	8–32 MB (dependent on layer and sequence length)
Model and Input	LLM models
	Llama-2 7B, 13B, 70B (FP16)
	Sensor log length
	16, 32, 64, 128 tokens
	Attention
	Causal with KV cache
Baselines	Edge-Solo
	Cloud-Edge-Even
	Cloud-Edge-Opt
	EdgeShard [20]
	Jupiter [22]
	Orion (ours)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, T.; Guo, H.; Zhao, Z.; Zhu, D. Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms. Drones 2026, 10, 410. https://doi.org/10.3390/drones10060410

AMA Style

Yang T, Guo H, Zhao Z, Zhu D. Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms. Drones. 2026; 10(6):410. https://doi.org/10.3390/drones10060410

Chicago/Turabian Style

Yang, Tianchou, Hongjie Guo, Zhengyu Zhao, and Donglin Zhu. 2026. "Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms" Drones 10, no. 6: 410. https://doi.org/10.3390/drones10060410

APA Style

Yang, T., Guo, H., Zhao, Z., & Zhu, D. (2026). Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms. Drones, 10(6), 410. https://doi.org/10.3390/drones10060410

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms

Highlights

Abstract

1. Introduction

2. Background and Motivation

2.1. Core Architecture of Generative LLMs

2.2. Generative Inference Process

2.3. Summary and Optimization Motivation

2.4. Relation to Alternative Inference Paradigms in Distributed Sensing

3. Proposed Solution: The Orion Framework

3.1. Optimal LLM Partitioning Strategy

3.1.1. System Model and Problem Formulation

3.1.2. Joint Optimization via Dynamic Programming

3.1.3. Summary of Algorithm 1

3.2. Adaptive Sequence Partitioning Strategy

3.2.1. Quadratic Inference Cost Model for Causal Attention

3.2.2. Min-Max DP for Sequence Partitioning

3.2.3. Summary of Algorithm 2

3.2.4. Theoretical Boundary Analysis of Inter-Stage Transmission Latency

3.3. Predictive Decoding Mechanism

4. Case Study

4.1. Experimental Setup

4.2. Latency on Llama-2 Models of Varying Scales

4.3. Latency on Llama-2 13B of Varying Sensor Log Lengths

4.4. Latency on Llama-2 13B of Varying Air-to-Ground Bandwidths

4.5. Latency on Llama-2 13B of Varying Inter-UAV Bandwidths

4.6. Ablation Study

4.7. Investigation of Prediction Accuracy on End-to-End Latency

4.8. Robustness Analysis Under Environmental Uncertainty

5. Future Work

5.1. Decoding Stage Optimization

5.2. Enhanced Predictive Decoding

5.3. Toward Uncertainty-Aware Robust Partitioning

5.4. Extension to Multi-Modal Sensor Data

5.5. Real-System Deployment and Physical Validation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI