Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms
Highlights
- Orion reduces LLM prefill latency by 78-81% on heterogeneous UAV swarm nodes compared to cloud-UAV baselines, and is the only framework that successfully runs a 70B-parameter LLM entirely on memory-constrained UAV onboard computers.
- Adaptive sequence partitioning and predictive decoding eliminate pipeline bubbles and load imbalance, enabling near-linear scaling of inference latency with sensor sequence length.
- Real-time, privacy-preserving LLM inference becomes feasible for autonomous UAV swarms in bandwidth-limited or disconnected environments (e.g., disaster response, surveillance).
- The proposed collaborative edge framework provides a practical pathway to deploy large models on heterogeneous UAV fleets without cloud dependency, enhancing mission robustness and responsiveness.
Abstract
1. Introduction
- We propose Orion, an end-to-end collaborative inference framework for heterogeneous edge environments. It is the first to effectively resolve the computational idling and load imbalance issues during the LLM prefill phase caused by conventional pipeline architectures, specifically tailored to the characteristics of single-sequence sensor requests in UAV swarms.
- We design a theoretically sound adaptive sequence partitioning algorithm and a predictive decoding mechanism, completely eliminating the inherent prefill pipeline bubbles found in state-of-the-art systems, thereby achieving a high degree of overlap between the computational resources of the prefill and decoding phases.
- Extensive experiments on a comprehensive simulation framework (using Llama-2 7B/13B/70B and simulated UAV sensor traces) demonstrate Orion’s superior efficiency and scalability. Orion achieves end-to-end latency reductions of 81% (7B) and 78% (13B) over the best baseline, and uniquely supports the 70B model on resource-constrained UAV nodes.
2. Background and Motivation
2.1. Core Architecture of Generative LLMs
- QKV Projection: The input tokens are first projected into three distinct vector representations: Query (Q), Key (K), and Value (V). These vectors form the foundation of the self-attention mechanism. They allow the model to compute contextual relationships and relevance scores among all tokens in the sequence.
- Masked Multi-Head Self-Attention: The attention scores are calculated between each token and all preceding tokens. A crucial masking mechanism is applied to prevent the model from attending to future tokens, preserving the autoregressive property essential for text generation. The “multi-head” aspect enables the model to simultaneously attend to information from different representation subspaces.
- Feed-Forward Network (FFN): A position-wise feed-forward network applies a non-linear transformation (e.g., SwiGLU) to each token independently. This network often constitutes a large portion of the model’s parameters and is responsible for refining attended representations.
- Residual Connections and Layer Normalization: Each sub-layer is wrapped with a residual connection and followed by layer normalization. This stabilizes training and enables the construction of very deep networks.
- KV cache: A critical performance optimization during inference. The Key and Value states generated for all previous tokens are stored in a dynamic cache, avoiding extremely costly recomputation for every new generated token.
2.2. Generative Inference Process
- Prefill Stage: This stage begins once a user’s input prompt (represented as initial inputs X) is provided. The model performs a forward pass through all L layers for every token in the prompt to produce the first output token and populate the KV cache. The prefill stage is compute-bound. The prefill stage is compute-bound. Due to the causal self-attention mechanism, its computational cost scales quadratically with the prompt length, making it exceptionally demanding for processing long-context sensor traces.
- Autoregressive Decoding Stage: After prefill, the model enters a generation loop, producing tokens one by one. In each iteration, the latest generated token is fed back as input for the next step, illustrated by the autoregressive feedback loop . During this forward pass, the current hidden state is projected to calculate the new query , while the newly generated key and value are appended to the KV cache at position t. Utilizing this dynamic KV cache avoids costly recomputation for historical tokens. This stage is primarily memory-bound. Its latency is dominated by the time required to read model parameters and the continuously growing KV cache from memory, rather than by raw computation.
2.3. Summary and Optimization Motivation
2.4. Relation to Alternative Inference Paradigms in Distributed Sensing
3. Proposed Solution: The Orion Framework
- Optimal LLM Partitioning (Section 3.1), which determines the layer-to-device mapping to minimize stage bottlenecks;
- Adaptive Sequence Partitioning (Section 3.2), which flattens the latency curve of causal attention through dynamic programming;
- Predictive Decoding (Section 3.3), which hides prefill-decode bubbles via speculative execution.
3.1. Optimal LLM Partitioning Strategy
3.1.1. System Model and Problem Formulation
3.1.2. Joint Optimization via Dynamic Programming
3.1.3. Summary of Algorithm 1
| Algorithm 1 Optimal Joint Device Selection and LLM Partitioning |
|
3.2. Adaptive Sequence Partitioning Strategy
3.2.1. Quadratic Inference Cost Model for Causal Attention
3.2.2. Min-Max DP for Sequence Partitioning
3.2.3. Summary of Algorithm 2
| Algorithm 2 Adaptive Sequence Partitioning via Min-Max DP |
|
3.2.4. Theoretical Boundary Analysis of Inter-Stage Transmission Latency
3.3. Predictive Decoding Mechanism
4. Case Study
4.1. Experimental Setup
- Testbed: Our experimental testbed comprises five heterogeneous devices: three NVIDIA AGX Orin modules (each with 1.88 TFLOPS, 16 GB RAM) and one NVIDIA Orin NX module (3.33 TFLOPS, 32 GB RAM) acting as the UAV swarm, alongside one cloud base station (with an RTX 3090 GPU, 36 TFLOPS, 32 GB RAM). To evaluate internal network sensitivity, the ad hoc bandwidth between UAV nodes was varied from 400 to 1000 Mb/s, with a fluctuation applied to all bandwidth settings to simulate realistic aerial link instability. This fluctuation approximates the severe channel non-stationarity in FANETs caused by trajectory changes, antenna masking, and ISM interference, as documented in recent aerial measurements [34]. The air-to-ground (UAV-to-cloud) bandwidth was also varied from 100 to 400 Mb/s to assess the impact of external network conditions on the performance of cloud-reliant baselines.
- Benchmarks: We evaluated Orion using the Llama-2 model family (7B, 13B, and 70B parameters) to test scalability. Experiments included varying sensor log lengths (16–128 tokens) and air-to-ground network bandwidths (100–400 Mb/s) to assess adaptability. Prefill latency was the primary metric.
- Baselines: We compare Orion against 5 baselines to demonstrate its superiority:
- –
- Edge-Solo: The entire LLM runs on the single UAV node (AGX Orin) without any partitioning. This represents the typical setup for independent onboard deployment.
- –
- Cloud-Edge-Even: The LLM is evenly split between the most powerful UAV node and the cloud base station. This represents a naive cloud-offloading strategy.
- –
- Cloud-Edge-Opt: The LLM is optimally partitioned between the most powerful UAV node and the cloud base station using our dynamic programming algorithm. This represents the best possible cloud-assisted baseline.
- –
- EdgeShard [20]: A collaborative edge computing framework that utilizes dynamic programming for joint device selection and model partitioning to orchestrate pipeline-parallel LLM inference.
- –
- Jupiter [22]: A state-of-the-art resource-efficient collaborative inference system that leverages intra-sequence pipeline parallelism for the prefill phase and speculative decoding for autoregressive generation.
4.2. Latency on Llama-2 Models of Varying Scales
4.3. Latency on Llama-2 13B of Varying Sensor Log Lengths
4.4. Latency on Llama-2 13B of Varying Air-to-Ground Bandwidths
4.5. Latency on Llama-2 13B of Varying Inter-UAV Bandwidths
4.6. Ablation Study
4.7. Investigation of Prediction Accuracy on End-to-End Latency
4.8. Robustness Analysis Under Environmental Uncertainty
5. Future Work
5.1. Decoding Stage Optimization
5.2. Enhanced Predictive Decoding
5.3. Toward Uncertainty-Aware Robust Partitioning
5.4. Extension to Multi-Modal Sensor Data
5.5. Real-System Deployment and Physical Validation
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| UAV | Unmanned Aerial Vehicle |
| LLM | Large Language Model |
| DP | Dynamic Programming |
| FLOPs | Floating Point Operations |
| KV cache | Key-Value cache |
| FANET | Flying Ad hoc Network |
| OOM | Out-Of-Memory |
| MLP | Multi-Layer Perceptron |
References
- Ahmed, A.; Wang, L.; Kim, J.; Jin, J.; Cho, K.; Kwon, C.; Lee, D.J. LLM-guided distributed model predictive control for decentralized UAV formations. IEEE Access 2026, 14, 15226–15240. [Google Scholar] [CrossRef]
- Yan, G.C.; Du, J.; Chen, S.; Tian, X.G. Study on the Path Optimization Method of Autonomous Navigation of Uncrewed Aerial Vehicles Integrating Multi-Sensor Data. IEEE Access 2025, 13, 173016–173034. [Google Scholar] [CrossRef]
- Han, B.; Chen, Y.T.; Li, J.R.; Li, J.; Su, J.S. SwarmChain: Collaborative LLM Inference for UAV Swarm Control. IEEE Internet Things Mag. 2025, 8, 64–71. [Google Scholar] [CrossRef]
- Javaid, S.; Fahim, H.; He, B.; Saeed, N. Large Language Models for UAVs: Current State and Pathways to the Future. IEEE Open J. Veh. Technol. 2024, 5, 1166–1192. [Google Scholar] [CrossRef]
- Nguyen, T.M.; Truong, V.T.; Le, L.B. Agentic AI Meets Edge Computing in Autonomous UAV Swarms. IEEE Internet Things Mag. 2025, 8, 87–95. [Google Scholar] [CrossRef]
- Maletić, M.; Peti, M.; Petrović, T.; Bogdan, S. Spatial-Semantic Reasoning using Large Language Models for Efficient UAV Search Operations. In Proceedings of the 12th European Conference on Mobile Robots, Padova, Italy, 2–5 September 2025; pp. 1–8. [Google Scholar] [CrossRef]
- Semerikov, S.O.; Vakaliuk, T.A.; Kanevska, O.B.; Ostroushko, O.A.; Kolhatin, A.O. Edge intelligence unleashed: A survey on deploying large language models in resource-constrained environments. J. Edge Comput. 2025, 4, 179–233. [Google Scholar] [CrossRef]
- He, Y.; Fang, J.C.; Yu, F.R.; Leung, V.C. Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach. IEEE Trans. Mob. Comput. 2024, 23, 11253–11264. [Google Scholar] [CrossRef]
- Jin, H.P.; Wu, Y.Z. CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration. In Proceedings of the 32nd IEEE International Conference on Web Services, Helsinki, Finland, 7–12 July 2025; pp. 316–323. [Google Scholar] [CrossRef]
- Shi, W.S.; Cao, J.; Zhang, Q.; Li, Y.H.Z.; Xu, L.Y. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
- Kristiani, E.; Verma, V.K.; Yang, C.-T. Deploying LLM Transformer on Edge Computing Devices: A Survey of Strategies, Challenges, and Future Directions. AI 2026, 7, 15. [Google Scholar] [CrossRef]
- Qu, G.Q.; Chen, Q.Y.; Wei, W.; Lin, Z.; Chen, X.H.; Huang, K.B. Mobile Edge Intelligence for Large Language Models: A Contemporary Survey. IEEE Commun. Surv. Tutor. 2025, 27, 3820–3860. [Google Scholar] [CrossRef]
- Li, Y.C.; Wen, H.; Wang, W.J.; Li, X.Y.; Yuan, Y.Z.; Liu, G.H.; Liu, J.C.; Xu, W.X.; Wang, X.; Sun, Y.; et al. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security. arXiv 2024, arXiv:2401.05459. [Google Scholar] [CrossRef]
- Wang, Y.M.; Lin, Y.; Zeng, X.D.; Zhang, G.N. PrivateLoRA For Efficient Privacy Preserving LLM. arXiv 2023, arXiv:2311.14030. [Google Scholar] [CrossRef]
- Wei, Y.T.; Wu, S.; Ji, Z.; Yu, Z.G.; Jiang, C.X.; Kuang, L.L. Multi-UAV Collaborative Edge Computing Algorithm for Joint Task Offloading and Channel Resource Allocation. J. Commun. Inf. Netw. 2024, 9, 137–150. [Google Scholar] [CrossRef]
- Cai, F.L.; Yuan, D.; Yang, Z.; Cui, L.Z. Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing. In Proceedings of the 31st IEEE International Conference on Web Services, Shenzhen, China, 7–13 July 2024; pp. 799–809. [Google Scholar] [CrossRef]
- Chen, Y.X.; Li, R.P.; Zhao, Z.F.; Peng, C.H.; Wu, J.J.; Hossain, E. NetGPT: An AI-Native Network Architecture for Provisioning Beyond Personalized Generative Services. IEEE Netw. 2024, 38, 404–413. [Google Scholar] [CrossRef]
- Shen, X.; Dong, P.; Lu, L.; Kong, Z.L.; Li, Z.G.; Lin, M.; Wu, C.; Wang, Y.Z. Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, QC, Canada, 20–27 February 2024; pp. 18944–18951. [Google Scholar] [CrossRef]
- Lin, J.; Tang, J.M.; Tang, H.T.; Yang, S.; Xiao, G.X.; Han, S. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. GetMobile Mob. Comput. Commun. Rev. 2025, 28, 12–17. [Google Scholar] [CrossRef]
- Zhang, M.J.; Shen, X.M.; Cao, J.N.; Cui, Z.Y.; Jiang, S. EdgeShard: Efficient LLM Inference via Collaborative Edge Computing. IEEE Internet Things J. 2025, 12, 13119–13131. [Google Scholar] [CrossRef]
- Yang, J.; Wu, Q.; Feng, Z.Y.; Zhou, Z.; Guo, D.K.; Chen, X. Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts. IEEE Trans. Mob. Comput. 2025, 24, 13648–13662. [Google Scholar] [CrossRef]
- Ye, S.Y.; Ouyang, B.; Zeng, L.K.; Qian, T.Y.; Chu, X.W.; Tang, J. Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices. In Proceedings of the 44th IEEE International Conference on Computer Communications, London, UK, 19–22 May 2025; pp. 1–10. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, QC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Varotto, L.; Fabris, M.; Michieletto, G.; Cenedese, A. Visual Sensor Network Stimulation Model Identification via Gaussian Mixture Model and Deep Embedded Features. Eng. Appl. Artif. Intell. 2022, 114, 105096. [Google Scholar] [CrossRef]
- Jadbabaie, A.; Makur, A.; Mossel, E.; Salhab, R. Inference in Opinion Dynamics under Social Pressure. IEEE Trans. Autom. Control 2022, 68, 3377–3392. [Google Scholar] [CrossRef]
- Lin, Y.Y.; Peng, S.J.; Wu, S.P.; Li, Y.B.; Lu, C.Z.; Ye, K.J. Serving LLM in Distributed GPU Cluster with Fine-Grain Pipeline Constraints. IEEE Trans. Serv. Comput. 2025, 18, 3164–3176. [Google Scholar] [CrossRef]
- Bekmezci, I.; Sahingoz, O.K.; Temel, S. Flying Ad-Hoc Networks (FANETs): A Survey. Ad Hoc Netw. 2013, 11, 1254–1270. [Google Scholar] [CrossRef]
- Tripathi, V.; Kadota, I.; Tal, E.; Rahman, M.S.; Warren, A.; Karaman, S.; Modiano, E. WiSwarm: Age-of-Information-Based Wireless Networking for Collaborative Teams of UAVs. In Proceedings of the IEEE INFOCOM 2023—IEEE Conference on Computer Communications, Hoboken, NJ, USA, 17–20 May 2023; pp. 1–10. [Google Scholar] [CrossRef]
- IEEE Std 802.11-2024; IEEE Standard for Information Technology–Telecommunications and Information Exchange Between Systems Local and Metropolitan Area Networks–Specific Requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. IEEE: New York, NY, USA, 2024. [CrossRef]
- Xia, H.M.; Yang, Z.; Dong, Q.X.; Wang, P.Y.; Li, Y.Q.; Ge, T.; Liu, T.Y.; Li, W.J.; Sui, Z.F. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 7655–7671. [Google Scholar] [CrossRef]
- Cai, T.; Li, Y.; Geng, Z.; Peng, H.; Lee, J.D.; Chen, D.; Dao, T. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 5209–5235. [Google Scholar] [CrossRef]
- Leviathan, Y.; Kalman, M.; Matias, Y. Fast Inference from Transformers via Speculative Decoding. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19274–19286. [Google Scholar] [CrossRef]
- Lee, D.; Maeng, S.J.; Ozdemir, O.; Pandian, M.B.; Guvenc, I. Reliability of Wi-Fi, LTE, and 5G-Based UAV RC Links in ISM Bands: Uplink Interference Asymmetry Analysis and HARQ Design. IEEE Open J. Commun. Soc. 2026, 7, 386–406. [Google Scholar] [CrossRef]
- Do, D.-T.; Le, N.-K.; Nguyen, L.-M. AdaSpec: Adaptive Multilingual Speculative Decoding with Self-Synthesized Language-Aware Training and Vocabulary Simplification. In Proceedings of the 40th AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; pp. 30530–30538. [Google Scholar] [CrossRef]
- Li, X.C.; Spatharakis, D.; Ghafouri, S.; Fan, J.K.; Vandierendonck, H.; John, D.; Ji, B.; Nikolopoulos, D.S. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving. In Proceedings of the 10th ACM/IEEE Symposium on Edge Computing, Arlington, VA, USA, 3–6 December 2025; pp. 1–8. [Google Scholar] [CrossRef]
- Zheng, C.; Yang, T.T. Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding. In Proceedings of the 17th International Conference on Wireless Communications and Signal Processing, Chongqing, China, 23–25 October 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Liu, X.; Luo, L.Z.; Tang, M.; Huang, C.; Chen, X. FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference. arXiv 2025, arXiv:2507.02620. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.Z.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 24824–24837. [Google Scholar] [CrossRef]
- Shi, L.H.; Li, Z.C.; Zhang, L.F.; Qi, B.Y.; Liu, G.M.; Zhao, H. Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios. In Proceedings of the 40th AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; pp. 32947–32955. [Google Scholar] [CrossRef]
- Kumar, T.; Dao, T.; May, A. Speculative Speculative Decoding. arXiv 2026, arXiv:2603.03251. [Google Scholar] [CrossRef]










| Parameter | Value/Range |
|---|---|
| Hardware (emulated) | UAV node (3 units) |
| NVIDIA Jetson AGX Orin (1.88 TFLOPS, 16 GB RAM) | |
| UAV node (1 unit) | |
| NVIDIA Orin NX (3.33 TFLOPS, 32 GB RAM) | |
| Cloud node | |
| NVIDIA RTX 3090 (36 TFLOPS, 32 GB RAM) | |
| Network | Inter-UAV bandwidth |
| 400–1000 Mb/s, random fluctuation | |
| Air-to-ground bandwidth | |
| 100–400 Mb/s | |
| Activation tensor size | |
| 8–32 MB (dependent on layer and sequence length) | |
| Model and Input | LLM models |
| Llama-2 7B, 13B, 70B (FP16) | |
| Sensor log length | |
| 16, 32, 64, 128 tokens | |
| Attention | |
| Causal with KV cache | |
| Baselines | Edge-Solo |
| Cloud-Edge-Even | |
| Cloud-Edge-Opt | |
| EdgeShard [20] | |
| Jupiter [22] | |
| Orion (ours) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, T.; Guo, H.; Zhao, Z.; Zhu, D. Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms. Drones 2026, 10, 410. https://doi.org/10.3390/drones10060410
Yang T, Guo H, Zhao Z, Zhu D. Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms. Drones. 2026; 10(6):410. https://doi.org/10.3390/drones10060410
Chicago/Turabian StyleYang, Tianchou, Hongjie Guo, Zhengyu Zhao, and Donglin Zhu. 2026. "Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms" Drones 10, no. 6: 410. https://doi.org/10.3390/drones10060410
APA StyleYang, T., Guo, H., Zhao, Z., & Zhu, D. (2026). Orion: A Collaborative Edge Inference Framework for Large Language Models Processing Multi-Sensor Data in UAV Swarms. Drones, 10(6), 410. https://doi.org/10.3390/drones10060410

