Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability

Sezgin, Anıl

doi:10.3390/drones9030213

Open AccessArticle

Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability

by

Anıl Sezgin

Research and Development, Siemens A.S, Istanbul 34870, Turkey

Drones 2025, 9(3), 213; https://doi.org/10.3390/drones9030213

Submission received: 4 February 2025 / Revised: 11 March 2025 / Accepted: 12 March 2025 / Published: 17 March 2025

(This article belongs to the Section Drone Communications)

Download

Browse Figures

Versions Notes

Abstract

The Internet of Drones (IoD) integrates autonomous aerial platforms with security, logistics, agriculture, and disaster relief. Decision-making in IoD suffers in real-time adaptability, platform interoperability, and scalability. Conventional decision frameworks with heuristic algorithms and narrow Artificial Intelligence (AI) falter in complex environments. To mitigate these, in this study, an augmented decision model is proposed, combining large language models (LLMs) and retrieval-augmented generation (RAG) for enhancing IoD intelligence. Centralized intelligence is achieved by processing environment factors, mission logs, and telemetry, with real-time adaptability. Efficient retrieval of contextual information through RAG is merged with LLMs for timely, correct decision-making. Contextualized decision-making vastly improves adaptability in uncertain environments for a drone network. With LLMs and RAG, the model introduces a scalable, adaptable IoD operations solution. It enables the development of autonomous aerial platforms in industries, with future work in computational efficiency, ethics, and extending operational environments. In-depth analysis with the collection of drone telemetry logs and operational factors was conducted. Decision accuracy, response time, and contextual relevance were measured to gauge system effectiveness. The model’s performance increased remarkably, with a BLEU of 0.82 and a cosine similarity of 0.87, proving its effectiveness for operational commands. Decision latency averaged 120 milliseconds, proving its suitability for real-time IoD use cases.

Keywords:

Internet of Drones; large language models; centralized decision-making; autonomous systems; retrieval-augmented generation

1. Introduction

The Internet of Drones is a cutting-edge development in autonomous technology, ushering in a new era in which networked drones skillfully conduct complex operations in numerous industries. IoD refers to an interconnected network of drones that communicate with each other and with centralized control systems over the internet. This enables coordinated operations, data sharing, and real-time decision-making across various industries. Specifically, industries including logistics, search and rescue, agricultural tracking, and security operations have embraced such cutting-edge technology. As use of drones expands, a critical necessity for efficient coordination and decision-support structures arises. The IoD aims to provide a platform for the integration of drones in an organized environment that enables information, assets, and intelligence to be shared in an effort to accomplish critical objectives.

IoD faces the following key operational challenges:

Limited Adaptability: conventional decision-making frameworks struggle in dynamic environments, failing to respond effectively to weather fluctuations, obstacles, and mission changes.
Interoperability Issues: inconsistent data formats and communication protocols between drones and ground stations reduce efficiency.
Scalability Constraints: as IoD networks grow, data processing delays and network congestion hinder their performance.
Security Risks: IoD systems remain vulnerable to cyberattacks, GPS spoofing, and unauthorized access, threatening mission integrity [1].

Over the decade, we have seen an unprecedented range of breakthroughs and improvements in AI and machine learning (ML) over the years. All these breakthroughs have immensely unlocked new and emerging avenues that are important for enhancing and optimizing decision-making processes in autonomous entities and complex systems. AI integration has demonstrated significant improvements in reducing cognitive workloads and enhancing situational awareness, which directly impacts UAV operator performance and safety [2]. Out of all the numerous breakthroughs in AI and ML, one such breakthrough stands out in prominence, namely the development of LLMs. LLMs are advanced deep learning models trained on massive text datasets, enabling them to understand, generate, and process a natural language with high contextual awareness. These models are based on transformer architectures and have demonstrated remarkable performance in a wide range of NLP tasks. LLMs are incredibly powerful and high-performance tools, not only for analysis but also for generating complex datasets, including ones with unstructured textual information. Advanced AI-assisted UAV systems must integrate real-time physiological monitoring and decision-support techniques, such as the analytic hierarchy process (AHP), to optimize AI-UAV interactions and improve operator efficiency [3]. With powerful tools such as retrieval-augmented generation, LLMs have the capability to extensively review and analyze information specific to a range of environments and, consequently, generate actionable information that is both highly specific and relevant in its orientation. To evaluate the effectiveness of generated textual outputs, metrics such as BLEU (Bilingual Evaluation Understudy) are commonly used. BLEU is a widely accepted measure for assessing the quality of machine-generated texts by comparing it to human-written references, ensuring linguistic accuracy and coherence. Together, when both these technological breakthroughs are considered, both have enormous potential to act as a remedy for inaccuracies prevalent in conventional IoD decision frameworks. Recent studies have consistently testified to the key role played by AI-powered frameworks in optimizing operational efficiency in the IoD environment, specifically in cases that demand a high level of real-time adaptability in reaction to fluctuations and shifts in the environment [4].

Despite these advancements, most existing IoD decision-making models still rely on narrow AI frameworks or rule-based systems, which introduce several limitations:

Inefficiency in Real-Time Processing: conventional models struggle to analyze telemetry data dynamically as conditions change.
Fragmented Data Sources: IoD networks require multi-source data fusion, integrating telemetry logs, sensor feedback, and mission directives.
Scalability Issues in Expanding Networks: as IoD systems grow, data bottlenecks and inefficiencies reduce their responsiveness.

The absence of strong decision frameworks not only impacts individual performance in a drone but also the overall efficiency in IoD networks. As IoD networks grow in complexity, AI-driven security frameworks such as convolutional attention networks have become crucial for detecting cyber threats in real-time and ensuring the integrity of UAV operations [5]. For instance, during a disaster, a drone will have to quickly analyze situational information and coordinate actions for search and rescue operations or relief efforts. Delay and poor decision can have disastrous consequences. In similar commercial use cases, such as delivery, inefficient routing or failure to adapt to unanticipated obstacles can cause additional expenses and reduced customer satisfaction. The integration of LLMs in IoD frameworks is a key move towards overcoming such vulnerabilities. With LLMs’ high contextual awareness and reasoning capabilities, LLMs can make decision-making in IoD networks more flexible and accurate. Nevertheless, the direct use of LLMs in IoD decision processes necessitates careful consideration of computational efficiency, real-time requirements, and requirements for scalability.

To address these challenges, this study proposes a centralized decision-making model that integrates LLMs with RAG-based architectures, thereby offering the following solutions:

Real-Time Analysis: LLMs process telemetry data and mission logs dynamically.
Contextual Data Retrieval: RAG enables drones to access mission-specific knowledge for informed decision-making.
Optimized Multi-Drone Coordination: a cloud-powered intelligence unit ensures synchronized drone operations.
Enhanced Situational Awareness: the model processes unstructured data (e.g., environmental reports and mission objectives) to improve adaptability.

The proposed scheme is analyzed with a dataset comprising drone operational scenarios and telemetry. The dataset is leveraged to emulate a variety of IoD scenarios, including urban navigation, disaster response, and agricultural monitoring at a large scale. Preliminary observations confirm that the integration of LLMs with RAG significantly fortifies IoD decision-making, with improvements in terms of adaptability, scalability, and reliability.

The rest of this study is organized in the following sections: Section 2 presents an overview of the state of the art in LLMs in autonomous systems and in IoD decision-making. Section 3 presents in detail the proposed scheme, including its architecture and its modules. Section 4 presents the experimental configuration, reveals the evaluation results, and identifies the implications of observations. Section 5 concludes with a summary of the key contribution and overall impact of this study. Section 6 outlines future work directions.

2. Related Work

The incorporation of large language models in autonomous platforms, particularly in relation to the Internet of Drones, has emerged as a key field of inquiry. This section integrates studies that include robotics, UAVs’ optimization, security protocols, human–machine interfaces, and synthetic data creation, providing a systemic platform for an understanding of LLMs’ contribution to IoD platforms. Integration is organized in a theme-wise manner to shed light on the development of ideas and technology that underlie the proposed platform.

2.1. Foundational Integration of LLMs in Robotics

The foundational role of LLMs in robotics is established through studies that explore their capability for enabling human-like thinking, perception, and control. Study [6] highlights post-GPT-3.5 LLM improvements, providing robots with enhanced adaptability in operations and contextual thinking capabilities. Study [6] recognizes multimodal frameworks in which LLMs understand texts, images, and sensor feedback, with real-time environment navigation capabilities. Computational requirements and ethics in high-stakes scenarios, however, raise ongoing concerns. Similarly, ref. [7] demonstrates LLMs’ capability for generating toolpaths and processing parameters independently, with an 81.88% success rate in an industrial scenario. These studies validate LLMs’ role in mapping natural language commands with robotic actions, a backbone for IoD systems with real-time adaptability.

Extending this, study [8] expands LLMs’ function in multimodal processing, problem-solving collaboration, and natural language interpretation. The study recognizes contextual ambiguity and concerns regarding data privacy but identifies LLMs’ role in supporting complex, adaptable operations. Complementing this, study [9] clusters LLM use cases under perception, decision, and motion control, with strong integration with VLMs and diffusion models emphasized for strong trajectory planning. Together, these founding studies firmly establish LLMs at the nucleus of autonomous decision-making in dynamically changing environments.

2.2. Enhancing Human–Robot Interaction and Task Execution

Human–robot interaction (HRI) frameworks with LLMs are imperative for IoD platforms with collaboration requirements in task execution. Study [10] introduces a feedback loop in which GPT-4 and VLMs allow for the real-time refinement of robotic plans, with a 14% performance gain in real-world environments. This is in agreement with [11], in which GPT-4 is accompanied with vision transformers and learning-based control for a 65% success in unstructured environments, with a strong focus placed on zero-shot learning and iterative feedback.

Study [12] brings HRI fundamentals to a level of extending them to a group of drones, translating natural language commands to geometric structures, with 93% accuracy for complex structures. Interactive editing and real-time feedback in such a system exhibit LLMs’ contribution towards multi-agent system scalability. Study [13] employs fine-tuned BERT models for translating a natural language to commands for controlling a drone with 89% accuracy, proving the ease of use in UAV operations. All these studies together exhibit LLMs’ contribution towards mapping human intention with autonomous execution, a key imperative for IoD platforms.

2.3. Dynamic Control and Adaptation in Robotic Systems

Adaptive control frameworks with LLMs counteract unpredictability in IoD environments. Study [14] introduces a hybrid controller with a combination of Lyapunov theory and LLM-guided contextual reasoning, with 100% success in adapting 2-link manipulators to new dynamics. That study’s emphasis on chain-of-thought prompts and real-time adaptability is shared with [15], with Pythonic program generation for a 37% improvement in success in complex simulations. The integration of precondition checking and assertions in the latter ensures task feasibility, a necessity for drones in dynamic airspaces. Recently, ref. [16] extends such a paradigm with GPT-4 for automating stepwise policy creation for discrete controller synthesis, with an 82.2% state space and a 91.2% transition reduction in scenarios for controlling drones, as well as a considerable improvement in computational efficiency for real-time adaptability. Study [17] extends such tenets to autonomous locomotion task codes, minimizing the intervention and achieving human performance in simulations. Study [18] demonstrates that LLMs can generate multibody dynamics simulations in a natural language, but with difficulty in complex syntax. All such studies validate LLMs’ potential for optimizing control policies and real-time adaptability to changing constraints, a key for IoD platforms that have to recalibrate in real-time.

2.4. Security and Anomaly Detection in Autonomous Systems

Security is paramount for IoD systems, whose failure can have catastrophic consequences. Study [19] provided a comprehensive review of IoD security requirements, categorizing common vulnerabilities and proposing countermeasures for secure drone network operations. Study [20] employs FastText embeddings and ANN search for detecting malicious API requests with minimum training data, achieving high recall and accuracy. Scalability via incremental update in this model conforms with [21], employing blockchain and LLMs for secure V2X communications, with 18% less latency in high-traffic scenarios. The latter’s reputation mechanism for incentives promotes nodal dependability, a model extendable to drone swarms.

Study [22] categorizes AI-powered defenses for GPS spoofing and signal jamming, with an emphasis towards lightweight models for real-time analysis of threats. Complementing this, ref. [23] introduces a safety taxonomy matrix for vulnerability analysis, with strong adversarial training proving to be a countermeasure. Together, these studies form a blueprint for securing IoD platforms for emerging threats.

2.5. UAV-Specific Applications and Optimization

Optimizing UAV operations through AI techniques is a recurring theme. Study [24] provided an extensive review of IoD applications, including smart cities, cloud-based mission control, and real-time UAV decision-making, emphasizing the role of AI in enhancing drone autonomy and operational efficiency. Study [25] addresses the planning of trajectories and radio resource management, with an emphasis placed on AI for enhancing scalability. Study [26] compares and contrasts SVSD, SVMD, and MVMD models, with a consideration for NP-hard concerns such as real-time re-allocation and the energy efficiency of the tasks. These studies make a contribution towards IoD frameworks through a discussion of multi-agent coordination optimization algorithms.

Study [27] introduces the ASDA framework, with a 75% improvement in the diversity of the synthetic datasets through procedural domain randomization. This study closes the sim-to-real gap, with robust model training for drone landing pad detection feasible. Similarly, ref. [28] compares and contrasts CNNs for aerial action recognition through a proposed integration with transformers for occlusion improvements. These studies illustrate the requirements for high-quality datasets and algorithm adaptability in use cases for UAVs.

2.6. Collaborative and Multi-Agent Systems

Collaborative systems with LLMs are critical for IoD’s scalability. Study [29] proposes a single, unifying workflow for agent profiling, interaction, and evolution, with a strong emphasis placed on LLMs for enhancing inter-agent communications. Study [30] couples LLMs with decentralized consensus to offer traceable content creation, a process beneficial for secure data exchanges in a swarm of drones. Study [31] identifies edge-based TinyML models for real-time decision-making, balancing computational efficiency with cost-effectiveness, a critical consideration for budget-constrained drones.

2.7. Synthetic Data Generation and Simulation

Study [32] introduces LLM-ENFT, a system that carries out fault restoration in edge networks in an autonomous manner, with augmented throughput through a dynamically managed congestion. LLM-ENFT employs synthetic network scenarios for edge environments with increased system robustness. Similarly, study [33] compares LLM-based test-case generating tools such as StarCoder, with a demonstration of their capabilities in automating test-case creation and in minimizing the software testing workload through reduced testing requirements. All such frameworks highlight synthetic data’s contribution towards enhancing IoT-driven system dependability, particularly in edge computation and software testing.

To adapt such a model for agricultural environments, the integration model in [34] integrates GIS, satellite data, weather APIs, and GPT models to generate synthetic crop suitability recommendations. By processing real-time geospatial data and simulating agricultural performance through AI-powered simulations, the model predicts best-fit crops for a region with 80% accuracy over ground-truth datasets. The GPT-4 model generates synthesized information for temperature, moisture, and precipitation, generating contextual agricultural information and effectively mapping real-world observations with AI simulations. This model is in consonance with synthetic data use cases in [32,33], in which simulation scenarios inform decision-making in complex environments. Integration with generative models not only reduces overreliance on past datasets but also enables flexible farm planning under variable climatic scenarios, further showcasing synthetic data’s capabilities in enhancing IoD (Internet of Drones)’s dependability and precision in agricultural frameworks.

2.8. Code Generation and Software Development

Recent advancements in LLM code generation focus on adaptability, security, and modularity, with study [35] combining LLMs with genetic programming for the automatization of context-aware code development for dynamic environments such as IoT-powered networks of drones and study [36] reducing cross-platform migration faults by 71.5% through syntax transformation and security checking for critical use cases such as autonomous navigation. To address complexity, study [37] employs hierarchical decomposition of work, decomposing problem-solving into modular subtasks in harmony with IoD requirements such as coordination in a swarm, and study [38] integrates unstructured documentation for APIs and coded output through RESTBERTa, achieving 81.95% parameter-matching accuracy and 88.44% endpoint-discovery accuracy, guiding LLMs to disambiguate schema and reduce compatibility faults by 37% in OpenAPI pipelines, processes critical for IoD operations such as the real-time aggregation of telemetry or interoperability between multi-drone platforms. Capitalizing on API-focused automatization, study [39] introduces an LLM-powered motif-aware linearizing graph transformer (L-MTAR) for the recommendation of Web APIs, combining semantic representations of fine-tuned LLMs with motifs of a higher-order structure to represent hidden dependencies between APIs. By combining spectral graph theory and motif adjacency matrices, L-MTAR reaches state-of-the-art recommendation accuracy (HR@10 of 0.9306, NDCG@10 of 0.9081) and reduces computational complexity to linear time, allowing for efficient use in IoD environments with limited resources and real-time requirements for API coordination. Complementing these breakthroughs, study [40] demonstrates LLM-powered AI agents feasible for the autonomous administration of Linux servers, with a GPT-4-powered agent achieving 100% success in 150 Dockerized operations (e.g., Bash script, process tracking, and security fortification) through iterative feedback, minimizing human errors and operational burden. Secure execution is assured through its Dockerized sandbox, and its adaptability in terms of honing strategies over 1.13–1.43 average trials per activity underlines LLMs’ potential for critical system administration such as the real-time coordination of containers and vulnerability countermeasures in IoD networks with a distribution of servers.

2.9. Future Directions and Ethical Considerations

The studies surveyed in this review cumulatively highlight several challenges, such as computational overhead, ethical concerns, and restrictions in real-time processing capabilities. Study [41] identifies scenarios relevant to human–AI collaboration, re-emphasizing the need for ethical frameworks, and studies [15,30] reiterate the importance of security and explainability in LLM use cases. It is critical that future studies address these gaps while supporting retrieval-augmented generation, on-device base models, and fair resource development.

The surveyed studies cumulatively point out challenges such as computational overhead, ethical concerns, and restrictions in real-time processing capabilities. Study [42] identifies critical gaps in multilingual and domain-specific datasets, in addition to ethical concerns regarding privacy, bias, and LLM training-related environment impacts, and re-emphasizes the concerns with synthetic data use, in addition to security and explainability in LLM use cases.

3. Methodology

3.1. Dataset Overview

The dataset used in developing this framework consists of a well-crafted collection generated through the use of the Hypersense platform, an enhanced Internet of Drones (IoD) technology. Hypersense’s design enables the collection and documentation of rich telemetry and sensor information during operations with a drone. This not only involves unprocessed sensor readings but also actions defined by an expert during critical events in a mission. Information is extracted from a shared database in addition to system logs kept by the platform. With these logs, we build structured JSON objects in a systematic manner, encapsulating real-time and historical information regarding operations with a drone. These JSON objects form a composite of sensor readings and expert-defined actions, and therefore, each operational detail is captured in its proper context. The dataset used in this study consists of 150,000 telemetry log entries, collected from both real-world and simulated drone missions. It spans over 65 flight hours, covering various operational scenarios such as urban navigation, disaster response, and agricultural monitoring. Each telemetry entry includes critical parameters such as the battery level, altitude, GPS coordinates, speed, and environmental conditions, including the wind speed and temperature. Additionally, expert-defined actions recorded during mission-critical events provide valuable insights for refining decision-making strategies.

To enhance the effectiveness of the training process and ensure model generalization, several preprocessing steps were applied to the dataset. First, noise filtering was conducted to remove incomplete, redundant, or inconsistent telemetry records that could negatively impact the learning performance. This step helped to maintain data integrity and reliability. Feature normalization was performed to standardize key sensor readings, such as altitude and velocity, across different drone models and operational conditions. By ensuring consistency in numerical values, this step improved the model’s ability to learn patterns from the data and make accurate predictions under varying circumstances. Context enrichment was another crucial preprocessing step, where historical mission logs and expert annotations were integrated into the dataset. This enrichment process provided the model with a broader context for decision-making, allowing it to learn from past experiences and apply those insights to new and unseen scenarios. Following these preprocessing steps, the structured dataset was used for both the fine-tuning and real-time inference testing of the proposed LLM-RAG model. The diverse range of flight conditions and mission challenges included in the dataset enabled the model to generate operationally reliable drone commands with improved contextual awareness.

For instance, both environment variables and drone-specific information such as battery state, velocity, height, wind conditions, and connectivity state are included in the dataset. Expert-defined actions such as moving towards a target location and returning to the base under certain conditions (e.g., battery exhaustion) are included, too. Expert-defined actions form a critical part of simulating real-life decision-making in reaction to operational challenges.

A typical data structure in the dataset is described in Table 1.

In the current case, the JSON record covers an incident of a “GPS loss of signal”; then, “Switch to manual navigation” is performed. Such logs enable thorough post-analysis and form a basis for predictive model and decision-support system development at its most fundamental level. This operational dataset is critical for training algorithms that will mimic and extend operations in real environments for a drone.

To preserve the integrity and usability of the dataset for use in machine learning, a sequence of preprocessing operations is performed. These operations include filtering out noise from irrelevant information, the normalization of sensor values for uniformity between variables, and feature extraction for the prioritization of important factors critical for accurate prediction. The application of such preprocessing operations validates the effectiveness of the dataset for a decision-support system developed with a large language model and, therefore, ensures its suitability for complex analysis and enhanced decision-making regarding operations for a drone.

3.2. System Architecture

The architecture of the proposed system consists of three key components: information collection and preprocessing, a decision nucleus, and an execution and feedback mechanism. All three synergistically work together to enable continuous operations in a variety of scenarios through an integrated pipeline that runs through collection, processing, and decision actionable output phases. In its first stage, the collection and preprocessing phases leverage the Internet of Drones platform, collecting and collating telemetry information, mission logs, and real-time environment information. Through such a mechanism, a continuous flow of uniform and reliable information is assured through a decentralized network of sensors. Filtering out noise, normalization, and feature extraction techniques work towards information quality improvements, transforming unprocessed inputs into organized forms for future operations. Not only is integrity in information preserved in such a stage, but adaptability in the system is also geared towards a reaction to changing inputs, such as fluctuations in sensor readings, a loss of connectivity, and changing mission parameters.

The decision-making nucleus forms the nucleus of the system, combining strong LLM capabilities with retrieval-augmented generation (RAG). By dynamically combining real-time contextual information with information derived through an enriched database, the nucleus creates commands specifically for operational use at a specific location and at a specific point in time. The mechanism of RAG retrieves information relevant in a specific scenario and then processes it through an LLM to produce specific output with regard to environment and situational factors. For example, in a scenario when a critical battery failure is detected in a drone, the nucleus compiles relevant information, such as the distance to the base and the current environment, to produce a command, such as “Return to base,” and, in the process, ensures operational continuity and security.

The system employs LLaMA3.2 1B Instruct, a lightweight model specifically tuned for producing strings of drone commands. It processes structured inputs (e.g., “Battery: 10%, Obstacles: 2, GPS: [lat, long]”) and generates actions with contextual awareness such as “Return to base via sector B-3”. For efficiency in edge deployment, the model is quantized for reduced computational loading, supporting real-time inference in edge hardware. Fine-tuning entails mapping outputs to comply with operational requirements and safety protocols, with a preference for mission-critical objectives such as collision avoidance or conserving energy.

Cosine similarity is calculated via encoding queries and historical mission information into dense vector representations. Contextual representations for texts (e.g., a “Battery critical: 8%” alarm for a drone or a previous log message such as “Returned safely with 7% battery”) are produced via a language model that is pre-trained. Sentence representations, generated by averaging the output at a token level, are then contrasted via cosine similarity. For instance, when a drone reports a sensor problem, the system encodes a query, compares it with a structured knowledge base of previous missions via computation of similarity, and retrieves the most relevant entries (e.g., low-battery protocols).

Generated commands receive BLEU, a measure for comparing expert-approved and expert-generated outputs for conformance. N-gram overlaps (e.g., 4-granularity such as “set altitude to 50 m”) are measured with a validated corpus of approved commands. Manual checking for conformance with safety protocols is triggered through low scores, and high scores confirm operational suitability. For example, a command such as “Reroute to avoid storm” is rated with consideration for such reference actions such as “Evacuate northwest for turbulence” in terms of lexical variation for technical accuracy. Automated decision-making is augmented with high-quality assurance in this stage, a requirement for high-consequence environments.

The final module, namely the execution and feedback mechanism, couples decision output with actionable actions performed by drones. Commands, including navigation corrections and obstacles, are dispatched to drones, and such actions are performed in real-time. There is a strong performance and environment observation feedback loop, and such information is fed into the system in an iterative manner to update and improve decision-making.

The algorithm flow of the proposed model, as shown in Algorithm 1, employs a mix of LLMs and RAG techniques in the Internet of Drones platform for decision improvement. The algorithm consists of collecting data, preprocessing, the retrieval of contexts, the creation of commands, execution, and refinement through feedback.

Algorithm 1. IoD Decision-Making System

1: Initialization: create a knowledge base with telemetry logs, mission records, and contextual inputs. Load the pre-trained LLM.
2: for each input_data perform
3: Extract telemetry, mission, and contextual data from input_data.
4: Retrieve relevant context from the knowledge base.
5: Compute the similarity score using cosine similarity.
6: Generate command using the LLM based on the retrieved context.
7: Compute the BLEU score for command accuracy.
8: Execute the generated command and log the process.
9: Monitor execution and collect performance metrics.
10: Update knowledge base with performance feedback.
11: end for
12: Return: the generated command, similarity score, BLEU score, and performance metrics.

3.3. Decision Flow

The decision flow, which can be seen in Figure 1, describes a systematic sequence of processes that enable efficient and flexible operations in the Internet of Drones and allow for a proper reaction to changing circumstances. In the first stage, data collection involves the real-time collection of contextual and telemetry information directly from the drones. Collected information covers critical statistics such as battery level, velocity, height, weather, and connectivity, and they together form the basis for sound decision-making.

In the following preprocessing stage, raw information is subjected to in-depth cleaning, normalization, and transformation according to the requirements of analysis in the system. Preprocessing seeks to remove unnecessary noise, correct discrepancies, and normalize disparate types of inputs and, therefore, enable the integration of disparate sources of information into a single, coherent dataset capable of supporting the decision-making nucleus effectively.

Once preprocessing is completed, the stage of context retrieval utilizes the retrieval-augmented generation algorithm to obtain relevant contextual information from a pre-indexed store of information. Historical records of telemetry, operational protocols, and information specific to a mission form part of such a store, allowing for the current state in a drone to become associated with contextual and historical datasets. The context extracted then forms an input for command generation.

The generation of commands forms the key part of the analysis process, in which a well-tuned large language model integrates processed information and the gained context to generate commands specific for specific scenarios. The generated commands are optimized for a specific operational environment, thus ensuring that drones operate at an ideal level, even in complex scenarios. For instance, in case of a loss of connectivity, the system can issue a “Return to base” command based on an evaluation of the involved risks.

The execution stage follows, in which generated commands are communicated to the drones, and subsequently, the drones execute actions such as changing routes, evading obstacles, or completing specific tasks. To form a feedback loop, this stage obtains performance statistics, situational information, and any fluctuations in environment factors. The accumulated feedback is then added to the system, enriching the knowledge base and allowing for improvements in future decision cycles. This iterative approach ensures continuous improvements in system adaptability, operational effectiveness, and accuracy in decision-making.

4. Results and Experiments

4.1. Evaluation Metrics

The evaluation of the proposed Internet of Drones system included a rich set of performance metrics designed to evaluate its performance in terms of several dimensions. Metrics included decision accuracy, response time, and specific ones pertaining to the RAG mechanism, such as similarity and BLEU values. The proposed system was evaluated using multiple performance metrics, including decision accuracy, cosine similarity, and the BLEU score, to assess its effectiveness in real-time IoD operations. Decision accuracy measures how often the system-generated drone commands match expert-labeled correct actions. It is defined in Equation (1).

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(1)

True Positives (TPs) represent correctly generated commands that match expert recommendations, ensuring that the system effectively aligns with expected decision-making processes. True Negatives (TNs) account for instances where incorrect commands are accurately rejected, preventing erroneous actions. False Positives (FPs) occur when the system mistakenly generates an incorrect command, which can lead to unintended drone behavior. False Negatives (FNs) refer to correct commands that were not generated by the system, potentially resulting in missed operational actions. A higher accuracy value indicates that the model successfully produces correct drone operation commands while minimizing errors, thereby improving overall system reliability.

Cosine similarity is used to compare the vector representations of the retrieved context and input data. It is defined in Equation (2).

\cos (θ) = \frac{A \cdot B}{| | A | | | | B | |}

(2)

The vector A represents the retrieved contextual information, capturing relevant past telemetry logs, mission data, and expert annotations that aid in decision-making. Similarly, the vector B corresponds to the input query, which includes real-time drone telemetry and environmental conditions that require analysis. The terms ||A|| and ||B|| denote the Euclidean norms, or magnitudes, of the vectors A and B, respectively, measuring their overall scale in the vector space. A cosine similarity score close to one indicates a strong contextual relationship between the retrieved knowledge and the input query, ensuring that the system effectively retrieves meaningful information for accurate command generation.

The BLEU score measures the linguistic similarity between system-generated drone commands and expert-provided reference commands. The BLEU score is calculated using n-gram precision and a brevity penalty, as shown in Equation (3).

BLEU = BP \times \exp (\sum_{n = 1}^{N} ω_{n} \log p_{n})

(3)

The brevity penalty (BP) in BLEU scoring, defined in Equation (4), prevents the model from favoring overly short translations by penalizing system-generated commands that are significantly shorter than the reference command, where r is the length of the reference command and c is the length of the system-generated command. The n-gram precision (p_n) measures the proportion of n-grams in the system-generated output that matches those in the reference command, ensuring accuracy at multiple word levels. The weighting factor (w_n) determines the contribution of different n-gram levels, with BLEU-4 typically using equal weights (w₁ = w₂ = w₃ = w₄ = 1/4) to balance unigrams, bigrams, trigrams, and fourgrams. A higher BLEU score indicates stronger linguistic and contextual alignment between system-generated drone commands and expert-defined reference outputs, ensuring that the generated commands are both syntactically correct and contextually appropriate for real-world operations.

BP = \min (1, e^{(1 - r / c)})

(4)

Decision accuracy evaluation utilized a range of methodologies that measured the agreement between commands produced by the system and expert recommendations, allowing for accurate interpretation and execution in a variety of scenarios. Confusion matrix analysis was utilized for comparing predicted commands with actual expert recommendations, allowing for an analysis of true positive, false positive, and false negative cases. Statistical analysis produced a rich view of performance in a range of operational scenarios. Domain-specific simulations were performed to emulate real-life scenarios, such as battery failure and connectivity loss, and subsequently for testing accuracy with regard to relevant commands generated by the system. Precision, recall, and F1-score values for each scenario were computed, with a strong focus placed on the model’s ability to maintain a proper accuracy and responsiveness balance.

In the RAG-based approach, similarity scores were utilized to evaluate both contextual pertinence and the suitability of entries in the knowledge base with regard to the input queries. The calculation of cosine similarity, a basis for such scoring, measured the angular distance between representations of the input queries and entries in the knowledge base. By utilizing such an approach, information relevant both contextually and semantically alone was extracted, and therefore, the LLM’s command generation process was immensely enhanced. In addition, BLEU (Bilingual Evaluation Understudy) metrics were utilized to evaluate generated command quality in terms of comparison with expert-provided outputs. BLEU scores provided a quantitative evaluation of both contextual accuracy and language fidelity of generated commands, and therefore, compliance with operational standards was guaranteed.

Moreover, cross-validation was performed to verify that the results were not unduly impacted by specific subsets of information. This involved partitioning the dataset into training and testing sets and calculating the mean of output to obtain reliable accuracy values. Error analysis for a specific case was performed in order to identify edge cases in which the performance of the system departed from the predicted behavior, allowing for focused improvements in the LLM and RAG modules. Quantifiable response time is measured as the elapsed time between data collection and command execution, and thus, compliance with real-time requirement standards is guaranteed. Under a variety of loads, this metric was measured in order to verify that the platform could maintain performance even under high loads. Together, these approaches present a focused and thorough evaluation of the performance of the system in providing timely and correct decision-making in a variety of operational scenarios.

4.2. Experimental Setup

The experiments involved a simulation environment for IoD, with careful consideration taken to replicate real-life operational scenarios and present a complete analysis of the proposed system. Rich telemetry logs, mission logs, and contextual inputs derived out of a range of scenarios such as battery failure, the loss of connectivity, and critical mission adaptations, were included in the experimental dataset. All such scenarios were crafted with a view to challenge the robustness and adaptability of the system under changing operational environments in a thorough manner.

The experimental environment involved a cloud processing hub with a centralized processing unit, with efficient computation and transmission of information, as well as computational scalability. Real-time integration and processing for concurrent operations of a range of drones were supported seamlessly in such an arrangement. Input information, as seen in Figure 2, entered a pipeline in a structured form, beginning with telemetry logs, mission logs, and contextual information. Telemetry information such as battery, GPS, height, and velocity facilitated the real-time tracking of drones, and mission information captured routes and executed jobs. Contextual information involved additional information such as weather and override in case of an emergency. All such sources were collected in a centralized hub, preparing them for processing in downline processes such as the retrieval of contexts through RAG- and LLM-based command creation.

The experimental setup leveraged a cloud-based, centrally positioned processing hub, with efficient computation and data transmission capabilities. To ensure clarity in the technical implementation, we provide details regarding the specific software tools used for data processing, model training, and evaluation. Data preprocessing was conducted using Python libraries, particularly NumPy v2.1.2 and Pandas v2.2.0, to structure and clean telemetry logs before feeding them into the decision-making pipeline. For model training and fine-tuning, we employed PyTorch v2.5.1, which provided a flexible and efficient framework for optimizing the model’s performance. To evaluate the generated commands and decision outputs, cosine similarity and BLEU score calculations were performed using scikit-learn v1.5.2 and NLTK v3.9.1, ensuring an objective assessment of the model’s accuracy and contextual alignment with expert-generated references. These tools and methodologies enhance the transparency and reproducibility of our experimental setup while providing a robust framework for IoD decision-making.

The model was fine-tuned on 150,000 telemetry log entries using the pre-trained LLaMA 3.2 1B model, with training lasting approximately 24 h at a batch size of 32 and a learning rate of 3 × 10⁻⁵. To optimize performance, several techniques were applied. Gradient checkpointing was used to reduce memory overhead, allowing training on limited GPU resources while preserving model accuracy. Mixed-precision training (FP16) was enabled to accelerate computations and lower VRAM consumption. For real-time inference, the fine-tuned model was further optimized using quantization (INT8 precision). This optimization reduced the computational complexity of the inference while maintaining high decision accuracy. The model was then deployed on the same local system, achieving an average inference time of 120 milliseconds per command generation. These optimizations ensure that the model remains efficient and responsive within real-time IoD mission constraints.

Real-time processing and the integration of data for a group of concurrent drones were facilitated through a pipeline of structured inputs, beginning with telemetry logs, mission logs, and contextual information. Real-time tracking of drones included critical information such as battery level, GPS, height, and velocity, as well as captured information such as routes flown and tasks performed during a mission. Contextual information included additional information such as override events and the weather.

All such sources were consolidated in a centrally positioned hub, ready for processing in downstream operations such as the retrieval of an LLM-based command and the LLM-based retrieval of the RAG-based context.

High demand, high-stress tests were incorporated into experimental testing for the simulation of high demand scenarios. Drones and sources of data were added sequentially for performance evaluation under variable workloads. Under high-stress testing, during executed commands, no degradation in performance, with negligible latency and high accuracy in commands, was experienced even at high loads.

Step-by-step testing of workflows, with the observation of the decision flow through the collection of data, preprocessing, the retrieval of contexts, generating commands, and execution, was incorporated in experiments. An analysis of feedback received through executed commands was conducted in a systemic manner in order to tune system parameters and make future decision iteratively better.

The key performance factors evaluated in the experiments involved computational latency, in terms of elapsed time between issuing commands and collecting data; the accuracy of commands, in terms of agreement between actions generated and recommended by the system; and the efficiency of feedback loops, in terms of effectiveness in using performance feedback to update the knowledge base. Not only did such an experimental setup validate the technological feasibility of the proposed scheme, but it also shed significant insights into its real-world usability and extendibility in real-time IoD environments.

4.3. Results Analysis

The research results confirm the significant advantages of the proposed framework in enhancing decision-making processes in the Internet of Drones environment. Response times were kept within acceptable real-time constraints, with an average command generation latency of 120 milliseconds. In addition, the evaluation of the RAG-based framework facilitated a deeper understanding of contextual and linguistic capabilities. Cosine similarity values were utilized in testing for the pertinence of extracted knowledge base entries with regard to input queries, with an overall average value of 0.87 for all scenarios. This high value reflects that extracted contexts consistently conformed to operational requirements, and therefore, accuracy and dependability in command generation processes were guaranteed.

BLEU values helped in the additional confirmation of the generated command language and contextual accuracy. For example, in the “Battery failure detected” scenario, the expert reference command “Return to base” was generated, similar to system-generated command. The BLEU value for such outputs was calculated through a comparison of n-grams in a system-generated command and in a reference command, with a calculated value of 0.85. A high value reflects a high concordance in both synthetic and semantic terms. For all scenarios evaluated, an average BLEU value of 0.82 reflects a consistent capability in generating correct, contextual outputs while preserving semantic and language integrity. BLEU values effectively measured performance in terms of generated commands’ conformation with operational requirements and expert expectations.

The examination of individual cases showed that the model handled a variety of scenarios with ease, such as cases of battery failure, disconnection, and re-rerouting in high wind scenarios. In all cases, an evaluation was performed both for the extracted context and the generated commands in terms of similarity and BLEU, supporting their relevance and accuracy. This in-depth evaluation emphasized the robustness of the RAG module in extracting relevant information and the skill of the LLM in generating actionable and accurate commands.

Table 2 showcases sample scenarios, their corresponding actions, cosine similarity scores, and BLEU scores.

These scores highlight the framework’s ability to accurately align system-generated actions with expert-recommended solutions, ensuring optimal performance across a variety of scenarios. Cosine similarity scores demonstrate the contextual relevance of retrieved information, while BLEU scores validate the linguistic and operational fidelity of the commands.

Table 3 provides a comprehensive comparison of the system’s decision accuracy across various operational scenarios, highlighting how closely system-generated commands align with expert recommendations. For each scenario, the total number of actions is analyzed alongside correctly executed actions, resulting in a calculated decision accuracy. In the “Battery failure detected” scenario, for instance, the system achieved 96.2% accuracy, demonstrating a high level of dependability in issuing precise commands in critical situations. Similarly, for “Connectivity loss”, the model achieved 94.5% accuracy, ensuring that the drone adapts appropriately to communication disruptions.

While Table 3 showcases the effectiveness of the LLM-based decision-making system within the IoD platform, it is also essential to contextualize these results relative to traditional classification techniques. To provide a well-rounded evaluation, we compared the LLM’s performance with Decision Trees, Random Forests, and Neural Networks, all of which were trained on equivalent operational datasets designed to mimic real-world drone operations. The results indicate that the LLM consistently outperforms traditional classifiers across all scenarios. For example, in the “Battery failure detected” scenario, the LLM achieved 96.2% accuracy, significantly higher than Decision Trees (81.1%), Neural Networks (77.3%), and Random Forests (85.2%). This discrepancy underscores the LLM’s superior ability to process complex contextual factors, such as prioritizing safe landing protocols while dynamically assessing real-time conditions (e.g., proximity to obstacles). Across more challenging scenarios, such as “High wind conditions during flight”, the LLM maintained 91.8% accuracy, while Decision Trees and Neural Networks struggled with accuracies of 72.5% and 55.84%, respectively, indicating difficulties in adapting to rapidly changing environmental conditions. Random Forests performed relatively well (80.1%) but still lagged behind the LLM. Likewise, in “Severe weather and GPS issues”, the LLM exhibited 94.2% accuracy, compared to 81.0% for Decision Trees, 73.2% for Neural Networks, and 86.0% for Random Forests. These results highlight the LLM’s enhanced ability to generalize across diverse operational scenarios and maintain robust decision-making under uncertain conditions.

Traditional classifiers, despite performing well in handling structured datasets, often struggle in dynamic, real-time environments where decision-making requires contextual awareness and reasoning. The LLM, leveraging its retrieval-augmented generation architecture, demonstrates a more adaptive, context-aware approach to command generation. These findings emphasize the necessity of advanced LLM-driven models for complex drone operations, particularly in scenarios where environmental variables change unpredictably and require immediate, intelligent responses.

Figure 3 illustrates the decision-making accuracy of the LLM model compared to the best-performing traditional classifier across thirteen different operational scenarios in the Internet of Drones (IoD) platform. The results highlight performance variations based on scenario complexity, with the LLM consistently outperforming traditional classifiers, particularly in dynamic conditions. For example, in the “Battery failure detected” scenario, the LLM achieves 96.2% accuracy, whereas the best traditional model, Random Forests, reaches 85.2% accuracy. Similarly, in “Connectivity loss,” the LLM attains 94.5% accuracy, compared to 82.6% for Random Forests. The LLM’s advantage becomes more evident in challenging scenarios, such as “High wind conditions” (91.8% vs. 80.1%) and “Severe weather and GPS issues” (94.2% vs. 86.0%), where environmental factors significantly impact drone operations. While traditional classifiers, particularly Random Forests, perform relatively well in structured decision-making tasks, they struggle with contextually complex and dynamically changing environments. The LLM’s ability to process real-time contextual information and adapt to varying operational constraints enables it to generate more accurate and reliable decisions across all tested scenarios.

Figure 4 presents the feature importance analysis using a Random Forest model, highlighting the relative impact of different features on the decision-making process. The results indicate that connectivity, GPS signal strength, and battery level are the most influential factors, significantly contributing to the model’s predictions.

Figure 5 presents partial dependence plots (PDPs) for the features “weather condition” and “distance to target”, both identified as having a low impact on the model’s decision-making. The left plot illustrates that variations in weather conditions (clear, rainy, and foggy) result in only a minor increase in the model’s predictions, indicating that weather has a negligible direct effect on classification outcomes. Similarly, the right plot shows the relationship between the distance to the target and model predictions, where the partial dependence fluctuates but remains within a narrow range, confirming its limited influence. The low impact of these features suggests that the model relies more heavily on factors like connectivity, GPS signal strength, and battery level, and further data collection or feature engineering may be required to increase their significance in decision-making.

The current study provides a valid justification of performance for the proposed model through a consideration of the accuracy of the decision, in relation to cosine values and BLEU values. All three values in unison represent the solidity, accuracy, and aptness of the proposed model for actual operations in real-life IoD and, in the process, allow for the effective operations of drones under changing and variable environments.

5. Conclusions

The combination of large language models with retrieval-augmented generation in the Internet of Drones takes autonomous decision-making for drones to the next level. This research illustrates how real-time telemetry data, mission reporting, and contextual knowledge lookup improve accuracy, efficiency, and flexibility while reducing the necessity for human intervention. The suggested system enhances command generation and real-time situational awareness, with greater responsiveness to environmental and mission dynamics.

One of the greatest strengths of this framework lies in the fact that it can retrieve pertinent information dynamically and not just depend on static knowledge. The RAG-based retrieval system guarantees contextually more appropriate decision-making, as validated by an average cosine similarity of 0.87 and an average BLEU score of 0.82. The system even achieves an average command generation latency of 120 milliseconds, thereby rendering it highly suitable for real-time IoD operations in mission-critical applications such as search and rescue and factory inspection.

The system was tested under scenario-dependent stress conditions such as a loss of connectivity, battery depletion, windy conditions, and the loss of GPS signal. It was able to adapt by changing decision-making strategies in real-time. Comparison with classical classifiers also indicates the superiority of the present approach since LLM-RAG outperforms Decision Trees, Neural Networks, and Random Forests consistently using contextual awareness and real-time knowledge retrieval.

Though it has its own strengths, the performance of the system relies on the quality of the pre-indexed knowledge base. Scaling the response time and data using distributed computing would be an exciting way to enhance scalability further. Together, this study lays a good groundwork for LLM-based decision-making for IoD applications, with follow-on research targeting multimodal AI fusion and ethics to enable secure and explainable autonomous operations.

6. Future Research

Future research will require the improvement of system scalability, flexibility, and integration of state-of-the-art learning techniques. Large-scale implementations will call for decentralized processing architectures through edge computing to move away from the dependence on central servers and to improve reliability. Incorporating reinforcement learning could allow drones to optimize decision-making based on real-world experience, providing longer-term efficiency.

Multimodal AI technique integration, i.e., merging LiDAR and infrared sensor feeds with telemetry, would provide improved situational awareness for object detection in real time and terrain mapping for better situational awareness. Furthermore, the extension of this framework for critical infrastructure monitoring, i.e., industrial complexes and high-risk areas, would improve real-time anomaly detection and automatic responses to threats.

Ethical and regulatory considerations must also be addressed to ensure fairness, transparency, and accountability of AI-based autonomous systems. Explainable AI techniques will be created to provide more interpretable justification of AI-generated decisions, which is essential for regulatory approval and public trust.

The future of IoD decision-making using LLMs is in the flawless convergence of cutting-edge AI, distributed computing, and ethical supervision. Ongoing advances in scalability, real-time learning, and multimodal data fusion will propel the future of autonomous drone flights.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to confidentiality and security restrictions associated with drone operation data.

Acknowledgments

We would like to thank Siemens A.S for their support in the completion of the study. This study was supported by TUBITAK 1512 Grant, project no: 2220423.

Conflicts of Interest

Author was employed by the company Siemens Corporate Technology and declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
RAG	Retrieval-Augmented Generation
AI	Artificial Intelligence
ML	Machine Learning
UAV	Unmanned Aerial Vehicle
IoD	Internet of Drones
PDP	Partial Dependence Plot
GPS	Global Positioning System
BLEU	Bilingual Evaluation Understudy

References

Sezgin, A.; Boyacı, A. Rising Threats: Privacy and Security Considerations in the IoD Landscape. J. Aeronaut. Space Technol. 2024, 17, 219–235. [Google Scholar]
Alharasees, O.; Adalı, O.H.; Kale, U. Human Factors in the Age of Autonomous UAVs: Impact of Artificial Intelligence on Operator Performance and Safety. In Proceedings of the 2023 International Conference on Unmanned Aircraft Systems (ICUAS), Warsaw, Poland, 6–9 June 2023. [Google Scholar]
Alharasees, O.; Kale, U. Human Factors and AI in UAV Systems: Enhancing Operational Efficiency Through AHP and Real-Time Physiological Monitoring. J. Intell. Robot. Syst. 2025, 111, 5. [Google Scholar] [CrossRef]
Sezgin, A.; Boyacı, A. Securing the Skies: Exploring Privacy and Security Challenges in Internet of Drones. In Proceedings of the 2023 10th International Conference on Recent Advances in Air and Space Technologies (RAST), Istanbul, Turkiye, 7–9 June 2023. [Google Scholar]
Aldossary, M.; Alzamil, I.; Almutairi, J. Enhanced Intrusion Detection in Drone Networks: A Cross-Layer Convolutional Attention Approach for Drone-to-Drone and Drone-to-Base Station Communications. Drones 2025, 9, 46. [Google Scholar] [CrossRef]
Kim, Y.; Kim, D.; Choi, J.; Park, J.; Oh, N.; Park, D. A survey on integration of large language models with intelligent robots. Intell. Serv. Robot. 2024, 17, 1091–1107. [Google Scholar]
Fan, H.; Liu, X.; Fuh, J.Y.H.; Lu, W.F.; Li, B. Embodied intelligence in manufacturing: Leveraging large language models for autonomous industrial robotics. J. Intell. Manuf. 2024, 36, 1141–1157. [Google Scholar] [CrossRef]
Zhang, C.; Chen, J.; Li, J.; Peng, Y.; Mao, Z. Large language models for human–robot interaction: A review. Biomim. Intell. Robot. 2023, 3, 100131. [Google Scholar]
Jang, D.; Cho, D.; Lee, W.; Ryu, S.; Jeong, B.; Hong, M.; Jung, M.; Kim, M.; Lee, M.; Lee, S.; et al. Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models. Int. J. Control Autom. Syst. 2024, 22, 2341–2394. [Google Scholar]
Asuzu, K.; Singh, H.; Idrissi, M. Human–robot interaction through joint robot planning with large language models. Intell. Serv. Robot. 2025, 1–17. [Google Scholar] [CrossRef]
Choi, S.; Kim, D.; Ahn, M.; Choi, D. Large language model based collaborative robot system for daily task assistance. JMST Adv. 2024, 6, 315–327. [Google Scholar] [CrossRef]
Lykov, A.; Karaf, S.; Martynov, M.; Serpiva, V.; Fedoseev, A.; Konenkov, M.; Tsetserukou, D. FlockGPT: Guiding UAV Flocking with Linguistic Orchestration. In Proceedings of the 2024 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Bellevue, WA, USA, 21–25 October 2024. [Google Scholar]
Savenko, I. Command interpretation for UAV using language models. In Proceedings of the 2024 IEEE 7th International Conference on Actual Problems of Unmanned Aerial Vehicles Development (APUAVD), Kyiv, Ukraine, 22–24 October 2024. [Google Scholar]
Zahedifar, R.; Baghshah, M.S.; Taheri, A. LLM-controller: Dynamic robot control adaptation using large language models. Robot. Auton. Syst. 2025, 186, 104913. [Google Scholar]
Singh, I.; Blukis, V.; Mousavian, A.; Goyal, A.; Xu, D.; Tremblay, J.; Fox, D.; Thomason, J.; Garg, A. ProgPrompt: Program generation for situated robot task planning using large language models. Auton. Robot. 2023, 47, 999–1012. [Google Scholar] [CrossRef]
Ishimizu, Y.; Li, J.; Yamauchi, T.; Chen, S.; Cai, J.; Hirano, T.; Tei, K. Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Danang, Vietnam, 3–6 November 2024. [Google Scholar]
Sun, S.; Li, C.; Zhao, Z.; Huang, H.; Xu, W. Leveraging large language models for comprehensive locomotion control in humanoid robots design. Biomim. Intell. Robot. 2024, 4, 100187. [Google Scholar]
Gerstmayr, J.; Manzl, P.; Pieber, M. Multibody Models Generated from Natural Language. Multibody Syst. Dyn. 2024, 62, 249–271. [Google Scholar]
Abdelmaboud, A. The Internet of Drones: Requirements, Taxonomy, Recent Advances, and Challenges of Research Trends. Sensors 2021, 21, 5718. [Google Scholar] [CrossRef]
Aharon, U.; Dubin, R.; Dvir, A.; Hajaj, C. A classification-by-retrieval framework for few-shot anomaly detection to detect API injection. Comput. Secur. 2025, 150, 104249. [Google Scholar]
Arshad, U.; Halim, Z. BlockLLM: A futuristic LLM-based decentralized vehicular network architecture for secure communications. Comput. Electr. Eng. 2025, 123, 110027. [Google Scholar] [CrossRef]
Tlili, F.; Ayed, S.; Fourati, L.C. Advancing UAV security with artificial intelligence: A comprehensive survey of techniques and future directions. Internet Things 2024, 27, 101281. [Google Scholar]
Ibrahum, A.D.M.; Hussain, M.; Hong, J. Deep learning adversarial attacks and defenses in autonomous vehicles: A systematic literature review from a safety perspective. Artif. Intell. Rev. 2025, 58, 28. [Google Scholar]
Abualigah, L.; Diabat, A.; Gandomi, A.H. Applications, Deployments, and Integration of Internet of Drones (IoD): A Review. IEEE Sens. J. 2021, 21, 25532–25546. [Google Scholar] [CrossRef]
Zhou, L.; Yin, H.; Zhao, H.; Wei, J.; Hu, D.; Leung, V.C.M. A Comprehensive Survey of Artificial Intelligence Applications in UAV-Enabled Wireless Networks. Digit. Commun. Netw. 2024, in press. [Google Scholar] [CrossRef]
Zhou, J.; Yi, J.; Yang, Z.; Pu, H.; Li, X.; Luo, J.; Gao, L. A survey on vehicle–drone cooperative delivery operations optimization: Models, methods, and future research directions. Swarm Evol. Comput. 2025, 92, 101780. [Google Scholar] [CrossRef]
Sabet, M.; Palanisamy, P.; Mishra, S. Scalable modular synthetic data generation for advancing aerial autonomy. Robot. Auton. Syst. 2023, 166, 104464. [Google Scholar] [CrossRef]
Maheriya, K.; Rahevar, M.; Mewada, H.; Parmar, M.; Patel, A. Insights into aerial intelligence: Assessing CNN-based algorithms for human action recognition and object detection in diverse environments. Multimed. Tools Appl. 2024, 1–43. [Google Scholar] [CrossRef]
Li, X.; Wang, S.; Zeng, S.; Wu, Y.; Yang, Y. A survey on LLM-based multi-agent systems: Workflow, infrastructure, and challenges. Vicinagearth 2024, 1, 9. [Google Scholar] [CrossRef]
Luo, H.; Luo, J.; Vasilakos, A.V. BC4LLM: A perspective of trusted artificial intelligence when blockchain meets large language models. Neurocomputing 2024, 599, 128089. [Google Scholar] [CrossRef]
Oliveira, F.; Costa, D.G.; Assis, F.; Silva, I. Internet of Intelligent Things: A convergence of embedded systems, edge computing and machine learning. Internet Things 2024, 26, 101153. [Google Scholar] [CrossRef]
Fang, H.; Zhang, D.; Tan, C.; Yu, P.; Wang, Y.; Li, W. Large Language Model Enhanced Autonomous Agents for Proactive Fault-Tolerant Edge Networks. In Proceedings of the IEEE INFOCOM 2024-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Vancouver, BC, Canada, 20–23 May 2024. [Google Scholar]
Poth, A.; Rrjolli, O.; Arcuri, A. Technology adoption performance evaluation applied to testing industrial REST APIs. Autom. Softw. Eng. 2025, 32, 5. [Google Scholar] [CrossRef]
Balasundaram, A.; Aziz, A.B.A.; Gupta, A.; Shaik, A.; Kavitha, M.S. A fusion approach using GIS, green area detection, weather API and GPT for satellite image based fertile land discovery and crop suitability. Sci. Rep. 2024, 14, 16241. [Google Scholar] [CrossRef]
Hemberg, E.; Moskal, S.; O’Reilly, U. Evolving code with a large language model. Genet. Program. Evolvable Mach. 2024, 25, 21. [Google Scholar] [CrossRef]
Hong, J.; Ryu, S. Type-migrating C-to-Rust translation using a large language model. Empir. Softw. Eng. 2025, 30, 3. [Google Scholar] [CrossRef]
Ma, Z.; An, S.; Xie, B.; Lin, Z. Compositional API Recommendation for Library-Oriented Code Generation. In Proceedings of the 2024 IEEE/ACM 32nd International Conference on Program Comprehension (ICPC), Lisbon, Portugal, 15–16 April 2024. [Google Scholar]
Kotstein, S.; Decker, C. RESTBERTa: A Transformer-based question answering approach for semantic search in Web API documentation. Clust. Comput. 2024, 27, 4035–4061. [Google Scholar] [CrossRef]
Zheng, X.; Wang, G.; Xu, G.; Yang, J.; Han, B.; Yu, J. A LLM-driven and motif-informed linearizing graph transformer for Web API recommendation. Appl. Soft Comput. 2025, 169, 112547. [Google Scholar] [CrossRef]
Cao, C.; Wang, F.; Lindley, L.; Wang, Z. Managing Linux servers with LLM-based AI agents: An empirical evaluation with GPT4. Mach. Learn. Appl. 2024, 17, 100570. [Google Scholar] [CrossRef]
Sauvola, J.; Tarkoma, S.; Klemettinen, M.; Riekki, J.; Doermann, D. Future of software development with generative AI. Autom. Softw. Eng. 2024, 31, 26. [Google Scholar] [CrossRef]
Du, F.; Ma, X.; Yang, J.; Liu, Y.; Luo, C.; Wang, X.; Jiang, H.; Jing, X. A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot. J. Comput. Sci. Technol. 2024, 39, 542–566. [Google Scholar] [CrossRef]

Figure 1. Internet of drones decision flow.

Figure 2. Input content of proposed system.

Figure 3. Scenario-based accuracy comparison.

Figure 4. Feature importance.

Figure 5. Partial dependence plots for features: weather conditions and the distance to the target.

Table 1. Structure in the dataset.

Field	Description
battery	Drone’s battery level (percentage)
distance_to_target	Distance of the drone to its target (in meters)
weather_condition	Weather condition (clear, rain, and fog)
connectivity	Connection status (strong, weak, and lost)
speed	Current speed of the drone (in m/s)
altitude	Altitude of the drone (in meters)
gps_signal_strength	GPS signal strength (range 0–1)
wind_speed	Wind speed (in m/s)
scenario	Operational scenario (e.g., “gps_signal_loss”, “battery_low”)
action	Action taken by the drone in response to the scenario (e.g., “switch_to_manual_navigation”)

Table 2. Evaluation of system-generated actions.

Scenario	System-Generated Action	Cosine Similarity	BLEU Score
Battery failure detected	Return to base	0.91	0.85
Connectivity loss	Hover and reconnect	0.88	0.80
High wind conditions during flight	Adjust altitude and reduce speed	0.86	0.83
Obstacle detected mid-flight	Change flight path	0.89	0.84

Table 3. Decision accuracy across different scenarios.

Scenario	Total Actions	LLM Accuracy (%)	Decision Tree Accuracy (%)	Neural Network Accuracy (%)	Random Forests Accuracy (%)
Battery failure detected	100	96.2	81.1	77.3	85.2
Connectivity loss	150	94.5	79.1	74.10	82.6
High wind conditions during flight	200	91.8	72.5	55.84	80.1
Obstacle detected mid-flight	120	95.0	82.8	76.3	82.80
GPS signal loss	135	92.3	78.6	70.9	83.4
Wind induced connectivity issues	185	93.6	80.2	72.4	81.7
Maximum altitude exceeded	140	90.4	74.6	69.5	79.8
Low power and weak signal	215	95.1	83.1	75.9	87.4
Severe weather and GPS issues	230	94.2	81.0	73.2	86.0
Weak signal and GPS instability	145	91.3	77.4	71.6	82.1
Low altitude and weak GPS signal	160	92.7	78.2	69.9	80.9
Fog and connectivity loss	200	94.8	82.0	74.1	85.7
Normal operation	180	96.0	84.5	78.0	88.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sezgin, A. Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability. Drones 2025, 9, 213. https://doi.org/10.3390/drones9030213

AMA Style

Sezgin A. Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability. Drones. 2025; 9(3):213. https://doi.org/10.3390/drones9030213

Chicago/Turabian Style

Sezgin, Anıl. 2025. "Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability" Drones 9, no. 3: 213. https://doi.org/10.3390/drones9030213

APA Style

Sezgin, A. (2025). Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability. Drones, 9(3), 213. https://doi.org/10.3390/drones9030213

Article Menu

Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability

Abstract

1. Introduction

2. Related Work

2.1. Foundational Integration of LLMs in Robotics

2.2. Enhancing Human–Robot Interaction and Task Execution

2.3. Dynamic Control and Adaptation in Robotic Systems

2.4. Security and Anomaly Detection in Autonomous Systems

2.5. UAV-Specific Applications and Optimization

2.6. Collaborative and Multi-Agent Systems

2.7. Synthetic Data Generation and Simulation

2.8. Code Generation and Software Development

2.9. Future Directions and Ethical Considerations

3. Methodology

3.1. Dataset Overview

3.2. System Architecture

3.3. Decision Flow

4. Results and Experiments

4.1. Evaluation Metrics

4.2. Experimental Setup

4.3. Results Analysis

5. Conclusions

6. Future Research

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI