1. Introduction
Applications of unmanned aerial vehicles have spread across various domains, including environmental monitoring, precision agriculture, logistics, infrastructure inspection, and public safety missions. As fleet sizes and mission complexities grow, effective management of such distributed systems has emerged as a significant concern. This necessity has propelled the creation of Internet of Drones platforms that facilitate centralized monitoring, coordination, and control of UAV missions by consolidating mission information, telemetry data, object detection logs, and status reports into unified systems.
In spite of such technological evolution, the interaction with Internet of Drones systems remains largely technical in nature. End users typically must possess in-depth system-specific application programming interface expertise, sift through a huge amount of documentation, and manually craft complex queries to retrieve or interact with UAV-generated data. This technical barrier inhibits the broader adoption of IoD platforms, rendering them accessible to field specialists alone and disenfranchising non-technical stakeholders who may significantly benefit from real-time UAV data.
In dynamic operational settings such as disaster relief, environmental monitoring, and military logistics, the decision-makers must be provided with real-time access to drone-provided information without the overhead of technical queries. Large language models (LLMs) have shown tremendous promise in bridging this limitation by enabling users to query complex data systems using natural language instructions. The ability to convert human conversation into data querying in a structured format has therefore emerged as a critical requirement for future Internet of Drones systems.
Nonetheless, drone data ecosystems present certain unique challenges for information retrieval systems. UAVs generate continuous real-time telemetry data, sensor readings, and detection events, which need time-sensitive filtering and quick access structures. In addition, an IoD platform integrates heterogeneous multi-modal data types, including numeric telemetry streams, image-based detections, mission metadata, and environmental sensor recordings. The user might require elaborate, multi-part queries with several filters spanning multiple data tables. In addition, UAV systems comprise different mission types and payloads, and consequently, different data schemas are incorporated into the same platform. Access control security makes it even more complex to build queries and retrieve data.
Traditional API-based systems attempt to address these needs using highly parameterized endpoints. Yet, constructing correct API requests demands significant familiarity with the platform’s structure, query parameters, and expected response formats, creating a critical usability bottleneck.
While large language model interfaces have seen notable progress in static relational database environments, their application to dynamic, real-time, multi-modal drone data remains underexplored. Existing systems often assume static and homogeneous data schemas, whereas IoD environments involve evolving, mission-specific, and context-sensitive information. Time-stamped and spatially contextual drone data require interpretation that accurately captures temporal, spatial, and event-based references.
Current solutions inadequately support UAV operational vocabularies, dynamic API orchestration for mission-specific queries, or retrieval-augmented architectures capable of efficiently mapping large language model inputs to structured API calls with minimal training data. Furthermore, they often fail to utilize the rich semantics encoded within API documentation (e.g., Swagger/OpenAPI specifications), missing an opportunity to systematically align user intents with available system functionalities.
In critical field operations such as disaster relief, wildfire response, or search-and-rescue missions, the ability to rapidly query UAV systems becomes a time-sensitive and potentially life-saving function. However, frontline operators often lack the technical training required to construct valid API calls or navigate complex documentation under stress. This creates a usability gap between those who control UAVs and those who rely on their data. Furthermore, latency in accessing mission telemetry or object detections—caused by manual query bottlenecks—can delay coordination, reduce situational awareness, and ultimately compromise mission outcomes. Therefore, empowering operators with natural language interfaces that eliminate technical barriers is not merely a convenience but a necessity for modern IoD deployments in field-critical applications.
To address these gaps, this paper proposes an architecture that combines a retrieval-augmented generation framework with structured API documentation to enable dynamic, large language model-driven access to IoD data. Our approach systematically extracts and annotates natural language query examples mapped to specific API endpoints and parameter configurations directly from the IoD platform’s API documentation. These annotated examples are embedded using semantic models to construct a vector-based retrieval database. Incoming user queries are matched against this database to find the most relevant examples, which are then refined through a lightweight language model to generate the appropriate API call.
This architecture minimizes the need for extensive fine-tuning, leverages existing documentation artifacts, and remains adaptable to evolving IoD system configurations. It enables end users to perform complex operational queries on UAV data using simple large language model inputs, enhancing usability, accessibility, and operational efficiency in IoD environments.
The contributions of this paper are as follows: (1) a semantic query interface is designed and implemented that allows users to interact with UAV telemetry, detection, and mission data using unconstrained natural language, eliminating the need for manual API reference or endpoint knowledge; (2) a retrieval-augmented generation pipeline is constructed by embedding 520 annotated query–endpoint–parameter triples using a transformer-based sentence encoder (all-MiniLM-L6-v2), and indexing them using Facebook AI Similarity Search (FAISS) for real-time similarity search. At query time, semantically close examples are retrieved and composed into structured few-shot prompts to contextualize the LLM’s inference; (3) the LLM (LLaMA 3.1 8B Instruct) interprets these prompts to produce schema-compliant API call structures, including endpoint resolution and parameter inference. The output is then subjected to strict validation against OpenAPI definitions to ensure structural correctness, type safety, and semantic alignment with the IoD backend; (4) the complete system is deployed in a microservice-based architecture, composed of containerized modules for embedding, retrieval, prompt construction, inference, validation, and API execution. This enables scalable deployment on IoD platforms, with support for runtime monitoring, error handling, and schema evolution.
The remainder of this paper is organized as follows.
Section 2 reviews related work in large language model querying, API mapping, and drone data management.
Section 3 details the proposed methodology, including dataset construction, system architecture, and implementation details.
Section 4 presents the experimental setup and evaluation results.
Section 5 provides the conclusion of this study, summarizing key findings and highlighting the significance of the proposed approach.
Section 6 outlines directions for future research, focusing on opportunities to expand the system’s capabilities, enhance scalability, and explore additional application domains.
Unlike our previous study [
1], which emphasized autonomous decision-making based on telemetry logs, the current work introduces a natural language interface for interacting with IoD APIs using structured semantic queries. A detailed comparison of the two approaches is provided in
Table 1.
3. Methodology
In this work, we propose an end-to-end system that integrates autonomous query understanding, schema-aware validation, and UAV-specific API interactions. The framework is designed to convert user-defined queries in natural language to structured API calls for fetching UAV-related information precisely. The method involves three main stages: data preparation, retrieval-augmented inference, and system integration.
3.1. Data Preparation and Embedding Index Construction
To enable accurate query interpretation, we began by constructing a comprehensive dataset derived from the IoD platform’s API documentation. This dataset consisted of example queries, associated API endpoints, required parameters, and response schemas. For each API route, we generated multiple semantically diverse query variants that a user might naturally express. These were manually annotated with their corresponding endpoint and parameter mappings, resulting in a corpus of 520 annotated samples.
Each query was embedded using the all-MiniLM-L6-v2 sentence transformer, producing high-dimensional vectors that encode semantic similarity. These embeddings were stored in a FAISS vector index to support efficient similarity-based retrieval during runtime.
The dataset consists of 520 natural language queries, semi-automatically created and manually curated to reflect realistic operator requests from UAV mission logs, telemetry messages, and object detection reports. Query templates were crafted from commonly used access patterns in the IoD platform and then diversified further by paraphrasing and contextual enrichment. Every sample was annotated with the corresponding API endpoint, required parameters, and desired response format.
For indexing, we employed FAISS with a Hierarchical Navigable Small World (HNSW) index. The dimensionality of the vectors was 384, the dimensionality of the output of the all-MiniLM-L6-v2 model. The HNSW index parameters were configured as follows: M = 32 (maximum number of outgoing links), efConstruction = 100, and efSearch = 50. These parameters were chosen as a trade-off between retrieval accuracy and runtime performance in real-time inference settings. All embeddings were L2 normalized prior to indexing to guarantee metric consistency.
The dataset contains natural language requests and corresponding API endpoints and parameters from IoD structured documentation. Annotation was performed by a team of three domain experts with prior knowledge of UAV operations and API design. Annotators followed a predefined set of guidelines in interpreting user intent, matching this with API actions, and formatting the expected responses to ensure consistency.
The original questions were generated from Swagger/OpenAPI v3.0 specifications by transforming formal text into its natural language counterpart. The base questions were then modified by systematic paraphrasing, tense changes, entity replacement, and query enrichment (e.g., “drone 5 battery level” was transformed to “how much battery does drone 5 have?”).
3.2. Retrieval-Augmented Inference Pipeline
At query time, a user submits a single natural language query to a unified/api/query endpoint. The input is embedded and matched against the pre-computed vector index to retrieve the top-k most similar examples. These examples are assembled into a few-shot prompt, which is forwarded to a lightweight large language model, specifically LLaMA 3.1 8B Instruct. The LLM infers the intended API endpoint and parameter configuration, producing a structured JSON object. The key components of a typical output are summarized in
Table 2.
The system validates the LLM’s output against the API specification to ensure conformity with parameter types, required fields, and value formats. This guards against erroneous or hallucinated queries.
Unlike supervised fine-tuning approaches, we expressly employed retrieval-augmented few-shot prompting solely for inference. The architecture was selected to support flexibility, modularity, and low-latency performance in real-time IoD environments. Fine-tuning would require considerable computational overhead and repeated retraining for adjustment to API evolution and, thus, was less suitable for rapidly changing deployment environments. In contrast, our few-shot approach enables the seamless addition of new examples by simply updating the FAISS index without any alteration to the model itself. Fine-tuning can yield incremental task-specific improvements but at the expense of our approach’s emphasis on adaptability, maintainability, and lean deployment—supreme requirements for edge-oriented UAV systems.
3.3. System Design and Integration
The final API request, once validated, is sent to the IoD backend via Representational State Transfer (REST). The results are returned and wrapped in a top-level response that includes both interpreted query metadata and actual retrieved data. This allows for enhanced transparency and traceability in dynamic query execution.
The architecture comprises stateless components, including:
A unified gateway layer for handling incoming queries;
A dedicated service for embedding, retrieval, and vector-based search;
A large language model inference engine;
A schema-aware validator and dynamic API execution layer.
These components operate in coordination to process natural language queries. The gateway receives requests and forwards them to the retrieval service, which locates semantically similar examples from a pre-indexed vector database. These examples, along with the original query, are passed to the LLM engine to determine the correct API call. The schema validator ensures the generated API request adheres to platform specifications before it is executed. This pipeline ensures accurate, explainable, and secure access to complex UAV data through a single, dynamic interface.
All components communicate via REST and are containerized for scalable deployment. Swagger files are periodically parsed and transformed into structured schemas to update the retrieval corpus and validation layer.
To better illustrate the operational flow, the system architecture can be visualized across four main layers: user interaction, semantic retrieval, inference and validation, and execution and response. This layered structure is depicted in
Figure 1.
The first stage, referred to as the user interaction layer, captures the entry point of the user query. When a user submits a natural language question, it is initially received by the gateway. The gateway acts as the centralized access point that routes incoming requests to the appropriate processing modules in the pipeline. It also handles tasks such as basic authentication, logging, and formatting, providing a uniform interface for the end user.
At the semantic retrieval layer, the user’s query is converted to a vector representation using a transformer-based embedding model. This embedding effectively encodes the semantic intent of the user’s input, enabling comparison with past queries in a meaningful way. The top-k queries with the most similar semantic meaning are then retrieved from the pre-computed dataset using the FAISS index by the system. This retrieval grounds the user’s request in previously discovered instances, enhancing contextual relevance.
Upon retrieval, the inference and validation layer produces the inputs. The retrieved instances and the original user query are combined to formulate a structured prompt. This is fed through the LLaMA 3.1 8B Instruct model, which comprehends the action wanted and generates an API call suggestion in the form of a JSON message. Then, the output is compared meticulously against the API schema to verify the existence of valid parameter types, adherence to structural necessities, and conformance with predetermined endpoints.
Upon validation, the execution and response layer is invoked. The suggested API call is invoked against the IoD backend. The backend retrieves the requested information, which could be mission records, telemetry logs, object detections, or live metrics. The information is then combined with the interpreted query metadata in a structured form of response. The final output is rendered to the user via the gateway, thereby completing the interaction cycle with clear and traceable data.
For poorly formatted inputs, unforeseen query formats, or validation errors due to missing or incorrectly typed parameters, an error handling path is invoked to maintain system reliability and user confidence. This functionality guarantees that the user is presented with a correct, context-dependent error message specifying the issue—be it schema mismatch, missing parameters, or unrecognized endpoints. By trapping invalid outputs early in the processing pipeline, the system effectively blocks the transmission of defective or unwanted API calls to the backend, thereby ensuring both data integrity and system safety.
The error handling mechanism is also proficient in documenting erroneous queries for subsequent examination, facilitating ongoing enhancements in both the performance of retrieval and the construction of prompts. Furthermore, alternative prompts can be employed in situations of significant failure to suggest restructured queries or direct users toward acceptable input formats. This degree of resilience contributes to the overall robustness of the system.
Overall, this stratified and modular architecture improves the natural language to API format conversion while supporting transparency, scalability, and real-time responsiveness simultaneously, thus ensuring dependable performance and secure operation in mission-critical drone operation scenarios.
The extensible and modular architecture guarantees robustness along with simplicity in maintenance, while the use of retrieval-augmented generation enables the interpretation of sophisticated UAV queries in real-time within operational settings.
4. Experimental Setup and Evaluation
The experimental setup and evaluation procedures for the proposed dynamic query interpretation system in the context of the Internet of Drones are described in this section. The testing environment, datasets, evaluation metrics, and analysis methodologies are detailed to assess the performance of the retrieval-augmented generation (RAG) architecture integrated with a large language model. Furthermore, comparative assessments across different system configurations are conducted to analyze the individual contributions of each component within the overall pipeline.
4.1. Experimental Setup
To simulate a realistic environment, we developed a testing framework aligned with the operational characteristics of an Internet of Drones system. The evaluation setup incorporated the platform’s API documentation, which defines available endpoints, request structures, and validation rules. This documentation was parsed and utilized to support structured prompt construction and ensure the accuracy of schema-based query validation.
The dataset used in this evaluation consists of 520 natural language query samples constructed from UAV telemetry, mission logs, and detection data originating from previously logged operations within an IoD platform. Each query was manually annotated with its corresponding API endpoint and structured parameters. To ensure diversity, the queries were generated in varying sentence structures, levels of specificity, and types of operations (e.g., detection queries, telemetry access, mission history, and anomaly reports).
For testing purposes, the dataset was split into 80% training examples used for retrieval corpus indexing and 20% held out as unseen queries. The model did not have direct access to test queries during inference; instead, it leveraged semantically similar examples retrieved via vector search to perform few-shot prompting.
We employed the all-MiniLM-L6-v2 model to embed each query into a 384 dimensional vector, which was then indexed using Facebook AI Similarity Search (FAISS) with an HNSW configuration to support fast top-k retrieval. At runtime, for each test query, the top five closest examples were retrieved and used to construct a prompt for the LLaMA 3.1 8B Instruct model.
The final evaluation pipeline begins by embedding the incoming user query into a vector space using the same transformer model that was used during index construction. This ensures consistency in semantic representation.
Next, the system retrieves the top-k most similar queries from the indexed training corpus. These examples serve as context to help the language model understand the intent and structure expected for similar inputs.
Once the relevant examples are identified, a structured prompt is constructed. This prompt includes both the retrieved examples and the current user query, formatted in a way that the LLM can parse and learn from the context. The structured prompt is then submitted to the LLM, which returns a predicted API call, including the target endpoint and associated parameters. After inference, the system parses the output and performs validation. The predicted endpoint and parameters are checked against the API schema to ensure conformity in data types, parameter formats, and required fields. If the output passes validation, the system simulates an API call using the predicted values. The result of this simulated call is compared against the annotated ground truth to evaluate correctness and system reliability.
The experimental environment was established with a simulated Internet of Drones dashboard, which interacted with a backend API server. A specific frontend interface allowed evaluators to provide queries formulated in natural language through a web form, which were fed into the retrieval and inference pipeline. The API requests generated as structured requests were then formulated in JSON format, along with the result of their execution. The interaction also provided feedback regarding adherence to the schema. This configuration supported an integrative cycle with the inquiry process, data gathering, and presenting outcomes. One instance of this interface is represented in
Figure 2.
4.2. Evaluation Metrics
To assess the performance of the system, we employ multiple evaluation metrics designed to capture different aspects of query interpretation quality and overall system robustness.
Endpoint accuracy (
EA): Endpoint accuracy measures the proportion of queries where the system correctly predicts the intended API endpoint. It quantifies the system’s ability to map natural language queries to their corresponding API routes, as formalized in Equation (1).
Parameter match rate (
PMR): Parameter match rate evaluates the extent to which the predicted parameters (both names and values) match ground truth annotations. This metric assesses the precision of the generated API call arguments, as described in Equation (2).
Schema conformity (
SC): Schema conformity indicates the degree to which the LLM-generated API calls adhere to the predefined API schema, ensuring structural and type correctness. As shown in Equation (3), this metric reflects schema validation success.
Retrieval precision at top-
k (
RP@k): Retrieval precision at top-
k measures the fraction of retrieved examples within the top-
k candidates that are semantically relevant to the input query. This is expressed formally in Equation (4).
These formulations are adapted from standard evaluation methodologies for information retrieval and structured prediction tasks.
4.3. Results and Discussion
We evaluated our system using 104 held-out test queries. The following results summarize the average performance across the test set:
Endpoint accuracy (EA): 91.3%
Parameter match rate (PMR): 87.6%
Schema conformity (SC): 95.2%
Retrieval precision (RP@5): 89.7%
These results indicate that the system can correctly interpret a wide range of user queries and produce executable API calls with high reliability. As illustrated in
Figure 3, the system achieves 91.3% endpoint accuracy, 87.6% parameter match rate, and 95.2% schema conformity. These results confirm that the proposed pipeline is capable of producing both semantically correct and structurally valid API queries, ensuring robust communication with the IoD backend.
The relatively high endpoint accuracy demonstrates the effectiveness of the prompt structure and retrieval examples. In contrast, the parameter match rate was slightly lower, often because of ambiguous or underspecified user queries (e.g., missing time windows or vague object references). In such cases, the system either inserted defaults based on similar examples or flagged the issue via the validator. Although 91% endpoint accuracy seems hopeful, it by no means eliminates the possibility of intermittent misinterpretation or breakdown in high-stakes situations. Hence, we accept this as a successful outcome for semi-automated functions and recognize that additional verification—especially in mission-critical environments—is necessary before announcing high-reliability deployment.
We observed that the inclusion of top-k examples in the prompt significantly enhanced the performance of the large language model. Without the retrieval of similar examples, the model’s endpoint accuracy dropped to 74.1%, and the hallucination rate nearly doubled, highlighting the critical role of semantic retrieval as a grounding mechanism in structured reasoning tasks.
To further evaluate the contribution of each architectural component, an ablation study was conducted by selectively disabling major subsystems and analyzing their impact on performance. The comparative results are summarized in
Table 3.
The results confirm that both the retrieval component and the schema validator are essential for high performance and reliability.
Figure 4 provides a comparative view of system performance across different architectural configurations. The full system achieves the highest scores in all metrics while disabling retrieval or schema validation leads to significant performance degradation. Notably, removing validation causes a sharp drop in schema conformity, indicating that LLMs alone are prone to producing structurally incorrect outputs without external constraints.
The other significant strength of our analysis is the high schema conformity rate of 95.2% achieved by the system across various types of queries. The metric not only indicates semantic equivalence between natural language intention and API structure but also structural validity confirmed by OpenAPI specifications. High schema compliance is an essential component for enhancing system reliability by reducing the number of erroneous or non-executable requests, thereby preventing API failure and guaranteeing type-safe parameter generation. In safety-critical UAV settings, where reliability and automation take precedence, this method of validation acts as insurance against errors of inference and enables effortless integration into the operational backend.
4.4. Qualitative Evaluation Cases
One of the key goals of this system is to bridge the gap between natural language expressions and structured API calls. To evaluate the semantic precision and practical value of the proposed approach, we present a series of representative user queries and their corresponding interpretations by the LLM–RAG pipeline. These examples illustrate how well the system can infer intent, resolve ambiguous phrasing, and generate valid and executable API structures in diverse contexts.
The selected queries cover a range of operations frequently encountered in Internet of Drones (IoD) platforms, such as retrieving telemetry data, listing object detections, and filtering events by mission or time interval. Each case demonstrates how the system integrates retrieval-augmented context with prompt-based inference to derive API endpoint mappings and parameter configurations. The outputs shown were schema-validated and would be directly executable within the backend infrastructure.
Notably, the system demonstrates strong generalization for semantically similar inputs, even when user phrasing varies significantly. In contrast, cases with vague intent or missing constraints still result in acceptable outputs because of the influence of relevant examples retrieved during the prompt construction phase.
Table 4 summarizes a selection of query-to-API transformations as observed in the evaluation phase.
These qualitative results provide insight into how the system generalizes across varied natural language queries and maintains structural consistency in its outputs. The presented queries differ in phrasing complexity, intent specificity, and temporal or spatial references. Despite this diversity, the system consistently produces semantically aligned and schema-valid API calls, demonstrating both linguistic flexibility and adherence to backend constraints. This confirms the strength of the retrieval-augmented prompt in guiding the large language model toward the correct endpoint and parameter configuration.
The examples further show the system’s capacity to handle implicit constraints and linguistic ambiguity. Queries involving relative time expressions (e.g., “yesterday afternoon” or “latest telemetry”) are accurately grounded using example-based inference and prior schema knowledge. The presence of diverse operators, such as filtering by region, altitude, or mission scope, also illustrates the model’s adaptability in multi-dimensional query interpretation. These cases validate the system’s robustness in real-world scenarios where user input may be underspecified, informal, or context dependent.
5. Conclusions
5.1. Overview of Findings
This paper proposes a retrieval-augmented generation system based on a large language model to dynamically decode natural language queries for the Internet of Drones. The system allows users to query structured UAV telemetry and mission data using natural language-based input, lowering the technical barrier for data querying and command generation in sophisticated drone missions. A semantic interface is presented to facilitate natural language access to structured IoD data through the incorporation of a light large language model and structured API documentation. Unlike our previous work, which dealt with decision-making logic mined from telemetry logs and mission scenarios, this project deals with real-time data accessibility and schema-compliant query generation. With API documentation indexing and output structure verification, the system presented herein paves a new way for intelligent and secure interaction with UAV platforms. This work, therefore, opens up the usage realm of LLMs in autonomous systems, from mission reliability to dynamic and natural data inquiry.
To demonstrate the real-world applicability of our system, we present a representative example taken from post-disaster field deployments. Consider an earthquake relief operation in which drones are deployed to scan impacted zones and locate survivors through onboard thermal imaging cameras. A field operator—a non-technical API expert—needs to fetch all human detections made by drones in “District 12” within the last hour. Instead of coding a formal API request manually, the operator simply states:
List all human sightings near Zone 12 in the past 60 min.
The system accepts this natural language input via the proposed semantic retrieval pipeline and translates it into a schema-compliant API call.
Table 5 presents the composition of the generated request, ranging from the endpoint to the parameters required. Such an immediate response allows command centers to allocate rescue resources more effectively and without delay due to technical constraints in support of the platform’s usability in time-critical environments.
In addition to demonstrating schema-conformant generation, this research shows how structured retrieval and consistent, prompt formatting can significantly mitigate hallucination risk in practical applications. Semantic variation in the example set was found to promote robustness to varied input expressions. In the absence of decoding constraints being actually applied, schema validation proved an effective guard against malformed outputs.
We validated the efficacy of our method through extensive experimentation across a range of evaluation metrics. The system performed well in endpoint accuracy, schema validity, and parameter prediction accuracy, with a dramatic boost when utilizing semantically close retrieval examples. Comparative findings on various system settings verified the indispensable role of both the semantic retrieval layer and the schema validation component in producing correct and executable API responses.
Our results highlight the importance of augmenting lightweight transformer-based retrieval with guided prompting for enhancing structured prediction in LLMs. In addition, the modularity and containerization of the architecture provide maintainability and scalability in production environments.
As the variety of drones and missions increases, the RAG pipeline’s efficiency can be improved by a semantically more diverse example corpus. The addition of mission logs from various drone types (e.g., fixed-wing vs. multirotor) or task domains (e.g., mapping, surveillance, and delivery) improves retrieval accuracy if the input differs. Schema validation, however, depends on OpenAPI specification accuracy and coverage. If unregistered endpoints are introduced during missions, validation failure acts as a defense mechanism rather than a system fault.
Though our system purposely avoids fine-tuning in favor of few-shot prompting and schema validation, future work can explore domain-specific LLM training for UAV systems. One direction is integrating time-series telemetry, object detection outputs, and spatial features into multi-modal training pipelines to improve contextual grounding. Further, synthetically created questions—extracted from actual mission scenarios or simulated logs—have the potential to reduce annotation effort while improving coverage.
Our effort aids in the development of intelligent, human-in-the-loop drone platforms through the incorporation of natural language interfaces and structured control systems via explainable and trustworthy AI techniques.
5.2. Limitations
Although the system suggested herein is very good at converting natural language queries to executable API calls, it does have some limitations. First, the model depends to a large extent on retrieving semantically close examples to steer inference. In situations where such examples do not exist or are mismatched, performance can break down, particularly for out-of-distribution queries. Second, the framework presupposes the existence of a carefully designed API specification; endpoint definition or parameter format changes may need manual involvement in prompt construction and validation schema changes. Third, while schema validation helps reduce hallucinations during deployment by enforcing structural consistency and endpoint compliance, it does not fully eliminate the risk of hallucinating outputs. The present version only takes English queries and presumes a certain background knowledge of the API ecosystem, restricting its utilization for general or multilingual purposes.
6. Future Work
While the current system provides satisfactory performance in natural language query understanding and processing in IoD environments, there are still some research and development possibilities left for the future.
One such promising direction is the integration of multi-modal data sources, i.e., visual data or sensor data from the world, into the query understanding process. This would enable users to pose queries that combine textual as well as visual data, making the system more expressive.
Another area of interest relates to temporal reasoning and contextual grounding. Enhancing the model’s capacity to decipher vague temporal expressions and preserve dialogue context within multi-turn conversations has the potential to greatly improve both usability and accuracy.
Continual learning techniques can also be investigated, where user feedback is leveraged to update the retrieval and inference models incrementally over time. This would facilitate system adaptability and personalization in evolving mission contexts.
The presence of proactive features such as real-time notifications, anomaly detection, and automatic recommendation generation would make the platform more of a decision-support entity. Then, leveraging the system to run efficiently in resource-scarce settings, such as edge-deployed UAVs, can expand the usability of the system to distributed, low-latency environments required for field operations.
In addition to the aforementioned directions, future work will also explore schema versioning strategies, incremental registration of newly introduced endpoints, and enabling multi-agent query orchestration for large-scale deployments. Moreover, adaptive prompting based on validation feedback and benchmark comparisons between fine-tuned and prompted models may offer insights into trade-offs regarding speed, memory, and quality, especially in edge-constrained UAV environments.