Multimodal AI for UAV: Vision–Language Models in Human– Machine Collaboration
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis work is interesting, proposing a multimodal AI for UAVs based on vision-language models. My comments are given below:
(1) As mentioned by the authors, the UAV will be used in a complex environment. However, this work did not consider the external interference, such as the intentional electromagnetic interference. As the latest study of A systematic three-stage safety enhancement approach for motor drive and gimbal systems in unmanned aerial vehicles, the image and sensor reading will be affected by the external noises, which will definitely affect the inputs of your multimodal. More concepts can be found in a review of intentional electromagnetic interference in power electronics: conducted and radiated susceptibility. The authors may add a few sentences and references to discuss these points and highlight them as limitations or future studies.
(2) The method proposed by the author only provides a framework, but there is no complete mathematical derivation or convergence proof. Whether the algorithm can converge stably under different initial conditions or how to ensure that the components work together is not shown in detail. This makes the methodological foundation seem weak.
(3) This work lacks a detailed comparison with other benchmarks and SOTA solutions.
(4) The number of experiments is limited, and the sample size is insufficient to support statistically significant conclusions. Relevant indicators mostly rely on subjective human ratings and lack systematic user research and statistical significance verification.
(5) All experiments were conducted in a controlled laboratory environment with only a few static obstacles and limited complexity. This setting is quite different from the real UAV application environment, such as suffering from the noise as mentioned in (1).
(6) The experimental parts of the explanation confused me. What's your meaning of model explanation?
(7) Could you estimate the threshold of your proposed method in distance and speed?
Author Response
Comment No. |
Comments |
Responses |
1 |
As mentioned by the authors, the UAV will be used in a complex environment. However, this work did not consider the external interference, such as the intentional electromagnetic interference. As the latest study of A systematic three-stage safety enhancement approach for motor drive and gimbal systems in unmanned aerial vehicles, the image and sensor reading will be affected by the external noises, which will definitely affect the inputs of your multimodal. More concepts can be found in a review of intentional electromagnetic interference in power electronics: conducted and radiated susceptibility. The authors may add a few sentences and references to discuss these points and highlight them as limitations or future studies. |
We appreciate your feedback and pointing this out. We addressed this issue at the end of the Discussion section as a limitation and a future research direction. To address all comments, we highlighted all changes in the revised document in red color. |
2 |
The method proposed by the author only provides a framework, but there is no complete mathematical derivation or convergence proof. Whether the algorithm can converge stably under different initial conditions or how to ensure that the components work together is not shown in detail. This makes the methodological foundation seem weak. |
We thank the reviewer for this important observation. We would like to clarify that the contribution of our work is primarily architectural rather than algorithmic. The proposed system does not introduce a novel optimization algorithm requiring mathematical derivation or convergence proof; rather, it integrates existing AI components (e.g., MiDaS for depth estimation and GPT-4.1-nano for reasoning) into a reference architecture for human-machine collaboration. As such, the stability of convergence is not derived mathematically but instead evaluated empirically through controlled UAV navigation experiments. At the same time, we addressed this clarification at the end of Section 3.2 |
3 |
This work lacks a detailed comparison with other benchmarks and SOTA solutions. |
Thank you for pointing this out. In the revised manuscript, we have strengthened the Related Work section by expanding the comparative analysis of prior benchmarks and state-of-the-art (SOTA) solutions. Specifically, we added Table 1, which systematically contrasts recent UAV vision–language navigation (VLN) studies (AerialVLN, GRaD-Nav++, UAV-VLN, OpenUAV) with our approach. The table highlights differences in methodology (simulation vs. real UAV deployment), evaluation metrics (navigation success vs. human-centric measures), and the degree of human-in-the-loop validation. |
4 |
The number of experiments is limited, and the sample size is insufficient to support statistically significant conclusions. Relevant indicators mostly rely on subjective human ratings and lack systematic user research and statistical significance verification. |
We thank you for this valuable observation. We have revised the end of the Discussion section to explicitly acknowledge these limitations. |
5 |
All experiments were conducted in a controlled laboratory environment with only a few static obstacles and limited complexity. This setting is quite different from the real UAV application environment, such as suffering from the noise as mentioned in (1). |
Thank you for the comment. We fully acknowledge that our evaluation was carried out in a controlled laboratory environment with static obstacles and limited complexity. This setup was chosen intentionally to ensure experimental safety, reproducibility, and isolation of variables during early-stage testing of our architecture. We have revised the manuscript to make this limitation explicit (see end of Discussion section). While the results should therefore be interpreted as exploratory, they still provide meaningful proof-of-concept evidence for the feasibility and human-centric benefits of integrating vision–language models into UAV navigation. In future work, we plan to extend our evaluation to more realistic outdoor and industrial settings, where UAVs must operate under noise, dynamic obstacles, and environmental uncertainty. |
6 |
The experimental parts of the explanation confused me. What's your meaning of model explanation? |
We thank the reviewer for pointing out this ambiguity. In the manuscript, “model explanation” refers not to a technical interpretation of the internal mechanics of GPT-4.1-nano or MiDaS, but to the natural-language justifications generated by the system to explain its navigation decisions to the user. These explanations were presented alongside navigation commands to increase transparency, support situational awareness, and foster user trust. We have revised the manuscript to clarify this terminology and avoid confusion. To address this issue, we changed the term model’s explanation to ”The system-generated natural-language explanations of navigation decisions”. |
7 |
Could you estimate the threshold of your proposed method in distance and speed? |
Thank you for the comment. We described the threshold of the system in the discussion section. |
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsJournal: Electronics (ISSN 2079-9292)
Manuscript ID: electronics-3840281
Title: Multimodal AI for UAV: Vision-Language Models in Human-Machine Collaboration
Authors: Maroš Krupáš * , Ľubomír Urblík , Iveta Zolotová *
This article presents an integration of vision language models (VLMs) with UAV navigation to improve user trust, transparency, and explainability. After reviewing the manuscript, my specific comments are below::
- Abstract:
- Consider simplifying technical details while concentrating on research gaps, key contributions, and significant quantitative outcomes.
- Elaborate on the novelty; why it's this type of implementation, and how it goes beyond past work with UAVs/VLM? Explain further.
- Include quantitative results (% ) in the abstract.
- Introduction:
- Explain/Clarify the problem statement and its limits in previous work?
- The research questions could be framed/presented in a more hypothesis-driven manner towards the desired results.
- Justify your decision to prioritize UAV navigation over other HMC applications by providing a more compelling rationale.
- Related Work:
- The related work tends to emphasize summary rather than comparative analysis, add a comparative table highlighting how prior studies differ in methodology, evaluation metrics, ... etc.
- Explicate the gaps in integrating VLMs with UAVs in real-life, human-in-the-loop scenarios and link them to your work.
- The Proposed System:
- Block diagrams that depict data flow and latency-critical paths would be useful for certain modules, such as reasoning workflow and interface design.
- Clarify the rationale behind using GPT-4.1-nano and MiDaS (more justification), including considerations for alternative models that require computation time and accuracy.
- Explain the method used to measure latency and determine if network jitter or cloud availability had an impact on the results.
- It is possible to enhance the prompting process for GPT-4.1-nano by incorporating formatting guidelines and examples of few-shot prompts.
- What is the count of participants in the lab evaluation besides the 14 from the user study, and how were the roles assigned to the participants (operator vs. observer)?.
- Findings:
- The correlation analysis based on GPT confidence and human ratings is interesting, but statistical significance tests should be reported with effect sizes for all major metrics.
- Observed some inconsistencies in model explanations; determine the frequency of such deviations/inconsistencies and propose mitigation strategies.
- When assessing usability, explain the process of how environmental mapping or calibration was performed before each trial to minimize navigation misalignment.
- To determine performance metrics, it is necessary to define "successedful run" criteria in the presence of partially completed goals or minor collisions.
- Based on the observed high latency outliers, can an increase in response times be achieved by either caching or batching image and depth data?
- Discussion: The discussion could be enhanced by explicitly linking each research question to the experimental finding, and explain how the results answer these questionss.
- Conclusion: Future work suggestions should include exploring fully on-device multimodal reasoning to eliminate cloud latency and improve operational robustness.....
- There are a few noticeable grammar or typographical issues need to be addressed.
- References: Ref [33] is a bachelor’s thesis, which may not meet the preferred source standards for key technical claims in certain journals.
Author Response
Comment No. |
Comments |
Responses |
1 |
Abstract: Consider simplifying technical details while concentrating on research gaps, key contributions, and significant quantitative outcomes. Elaborate on the novelty; why it's this type of implementation, and how it goes beyond past work with UAVs/VLM? Explain further. Include quantitative results (% ) in the abstract. |
Thank you for the comments. To address them, we highlighted all changes in the revised document with red color. To address the first comment, we have changed the whole abstract accordingly, reducing technical detail while focusing more on research gaps, key contributions, and quantitative outcomes. |
2 |
Introduction: Explain/Clarify the problem statement and its limits in previous work? The research questions could be framed/presented in a more hypothesis-driven manner towards the desired results. Justify your decision to prioritize UAV navigation over other HMC applications by providing a more compelling rationale. |
Thank you for this comment. The Introduction now clearly presents prior work limitations, frames research questions as testable hypotheses, and better justifies the focus on UAV navigation (highlighted in red). |
3 |
Related Work: The related work tends to emphasize summary rather than comparative analysis, add a comparative table highlighting how prior studies differ in methodology, evaluation metrics, ... etc. Explicate the gaps in integrating VLMs with UAVs in real-life, human-in-the-loop scenarios and link them to your work. |
Thank you for pointing this out. In the revised manuscript, we have strengthened the Related Work section by expanding the comparative analysis of prior benchmarks and state-of-the-art (SOTA) solutions. Specifically, we added Table 1, which systematically contrasts recent UAV vision–language navigation (VLN) studies (AerialVLN, GRaD-Nav++, UAV-VLN, OpenUAV) with our approach. The table highlights differences in methodology (simulation vs. real UAV deployment), evaluation metrics (navigation success vs. human-centric measures), and the degree of human-in-the-loop validation. |
4 |
The Proposed System: Block diagrams that depict data flow and latency-critical paths would be useful for certain modules, such as reasoning workflow and interface design. Clarify the rationale behind using GPT-4.1-nano and MiDaS (more justification), including considerations for alternative models that require computation time and accuracy. Explain the method used to measure latency and determine if network jitter or cloud availability had an impact on the results. It is possible to enhance the prompting process for GPT-4.1-nano by incorporating formatting guidelines and examples of few-shot prompts. What is the count of participants in the lab evaluation besides the 14 from the user study, and how were the roles assigned to the participants (operator vs. observer)?. |
We thank the reviewer for these suggestions. In the revised manuscript, we have revised a block diagram of the proposed system architecture (see Figure 2). The diagram illustrates the overall data flow between all modules. To highlight timing-sensitive dependencies, we explicitly distinguish between standard data flow (black arrows) and the latency-critical path (red arrows) that connects the core application with the GPT-based multimodal reasoning service, which emphasizes the workflow bottlenecks most relevant for performance. We believe this visual representation improves the transparency of our architectural design and directly addresses the reviewer’s request. At the same time, we added more justification for using MiDaS and GPT 4.1 nano models. We also explained the latency measurement in more detail and its impact on the results. To address GPT 4.1 prompting process, we added this as a limitation and future work in the discussion section. Additionally, in section 4.2 we clarified that there were only 14 participants in the study besides the authors, who only acted as observers and not operators. |
5 |
Findings: The correlation analysis based on GPT confidence and human ratings is interesting, but statistical significance tests should be reported with effect sizes for all major metrics. Observed some inconsistencies in model explanations; determine the frequency of such deviations/inconsistencies and propose mitigation strategies. When assessing usability, explain the process of how environmental mapping or calibration was performed before each trial to minimize navigation misalignment. To determine performance metrics, it is necessary to define "successedful run" criteria in the presence of partially completed goals or minor collisions. Based on the observed high latency outliers, can an increase in response times be achieved by either caching or batching image and depth data? |
We have incorporated the reviewer’s suggestions into the revised manuscript. A discussion of caching and batching strategies for image and depth data has been added to address the observed high-latency outliers. The definition of a “successful run” has been refined to clearly distinguish between full success and failure cases within the evaluation, which was described at the beginning of the usability evaluation section. In addition, the process used to initialize usability and other tested metrics—specifically the calibration and trial setup procedures performed before each run—is now described in detail at the end of the evaluation methodology section. |
6 |
Discussion: The discussion could be enhanced by explicitly linking each research question to the experimental finding, and explain how the results answer these questions. |
We thank the reviewer for this helpful suggestion. In the revised manuscript, we strengthened the Discussion section by explicitly connecting each research question (RQ1 and RQ2) to the experimental findings. |
7 |
Conclusion: Future work suggestions should include exploring fully on-device multimodal reasoning to eliminate cloud latency and improve operational robustness..... |
We have added a paragraph at the end of Section 6. Conclusion, which mentions the transition from a cloud-based system to on-device computing, along with what we consider to be the main challenge to this approach. |
8 |
There are a few noticeable grammar or typographical issues need to be addressed. |
Thank you for pointing out this issue. The grammatical errors have been thoroughly checked and fixed in the whole manuscript. |
9 |
References: Ref [33] is a bachelor’s thesis, which may not meet the preferred source standards for key technical claims in certain journals. |
Thank you for pointing this out. We have decided to create new pictures and not cite the bachelor's thesis instead. |
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThanks for the revision.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript has been improved for publication in Electronics.