Next Article in Journal
RMLP-Cap: An End-to-End Parasitic Capacitance Extraction Flow Based on ResMLP
Previous Article in Journal
A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8
Previous Article in Special Issue
A Novel Convolutional Vision Transformer Network for Effective Level-of-Detail Awareness in Digital Twins
 
 
Article
Peer-Review Record

Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems

Electronics 2026, 15(1), 35; https://doi.org/10.3390/electronics15010035
by Dawid Ewald 1,*, Filip Rogowski 2, Marek Suśniak 2, Patryk Bartkowiak 2 and Patryk Blumensztajn 2
Reviewer 1:
Reviewer 2:
Reviewer 3:
Electronics 2026, 15(1), 35; https://doi.org/10.3390/electronics15010035
Submission received: 6 November 2025 / Revised: 14 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025
(This article belongs to the Special Issue Data-Centric Artificial Intelligence: New Methods for Data Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript presents a prototype system in which a multimodal LLM (LLaVA-7B) acts as a cognitive controller for a mobile robot equipped with a camera and distance sensor. The study evaluates whether LLMs—without fine-tuning—can support real-time navigation, generate structured JSON control commands, and provide human-interpretable reasoning, with potential implications for future swarm robotics.

Overall, the paper is well-written, clearly structured, and experimentally thorough. It contributes an interesting demonstration of LLM-driven robotic control using a hybrid architecture. The methodology is described with good technical clarity, and the results show that LLaVA-7B can indeed provide consistent control behaviors with low latency in a semi-structured environment.

However, the work remains primarily a proof-of-concept, and several areas would benefit from deeper analysis, stronger positioning against existing literature, and more rigorous evaluation.

Weaknesses and Areas Needing Improvement are as follows:

-The study demonstrates integration of existing tools (Raspberry Pi, Ollama, LLaVA) rather than introducing a new algorithm, new architecture, or new cognitive model. The paper should clearly articulate what is scientifically new beyond the engineering integration.

-There are lack of comparative baseline models. Only LLaVA-7B, LLaVA-standard, and Llama3 (text-only) are evaluated. Classical baselines (e.g., YOLO + rule-based controller, RL agent, CNN navigation model), smaller embodied vision models (e.g., MobileNet-V2 classifiers) and traditional robotics navigation frameworks are considered for model comparisons.

-Results (JSON validity, coherence, smoothness) are presented as single-point metrics. No standard deviation, no repeated trials analysis, no statistical significance tests are reported.

-The paper claims "emerging cognitive capabilities," but it does not analyze error modes of the model, failure cases, reasoning patterns, generalization across lighting/angles and robustness to misinterpreted visual frames. Some example LLM reasoning outputs would strengthen the argument.

-Latency and scalability issues require deeper discussions. Latency is stated (<200 ms), but breakdown of network latency vs. model inference, scalability analysis for multi-robot swarms, and throughput limits of the server-side model are not mentioned. Claims about future swarm behavior (Section 4.4) feel speculative without supportive experiments. 

- There are some methodological gaps on formal evaluation of navigation accuracy, task completion metrics  and energy consumption.

Author Response

We thank the reviewer for their encouraging comments regarding the clarity, structure, and experimental thoroughness of our work. We appreciate the recognition of our hybrid architecture and the demonstration of LLaVA-7Bs potential in low-latency environments. We have taken the constructive criticism regarding baselines, statistical rigor, and error analysis seriously. Below, we detail how we have addressed each specific concern in the revised manuscript. 

Comment 1: 

The study demonstrates integration of existing tools... rather than introducing a new algorithm... The paper should clearly articulate what is scientifically new beyond the engineering integration. 

Response: We agree that the individual componenRts (LLaVA, Ollama, Raspberry Pi–based robot platform) are existing tools. However, we have revised the Introduction and Contribution sections to clarify that the scientific novelty lies in: 

  1. The Zero-Shot Control Paradigm: Demonstrating that general-pupose VLMs can function as low-level robotic controllers without domain-specific fine-tuning or reinforcement learning, which challenges the prevailing assumption that navigation requires specialized models. 
  1. Prompt Engineering as Control Policy: We introduce a specific prompt structure that forces unstructured reasoning models to output strict, machine-parsable JSON control vectors, effectively turning natural language reasoning into kinematic actuation. We have highlighted these points in Section 1.1 of the revised manuscript. 

Comment 2: 

There are lack of comparative baseline models... Classical baselines (e.g., YOLO + rule-based controller, RL agent...) are considered for model comparisons. 

Response: We acknowledge the importance of baselines. While training a full RL agent from scratch was outside the scope of this zero-shot study, we have expanded the Discussion section to include a qualitative comparison against classical pipelines (e.g., YOLO + logic). 

  • Clarification: We emphasize that while a MobileNet or YOLO approach might be computationally lighter, they lack the semantic adaptability of our approach (e.g., understanding "move toward the object that looks like it might contain water" vs. simply "detect bottle"). 
  • Action: We have added a theoretical comparison table (Table X in the revised text) contrasting our VLM approach with standard CNN navigation regarding latency, adaptability, and computational cost. 

 

Comment 3: 

Results (JSON validity, coherence, smoothness) are presented as single-point metrics. No standard deviation, no repeated trials analysis... 

Response: 

This was an oversight in the initial submission. We have re-analyzed our experimental logs and updated Section 3 (Results). 

  • We now report the Mean Standard Deviation for latency and trajectory smoothness metrics. 
  • We have included confidence intervals for the JSON validity rates to provide a statistically robust view of the model's reliability over time. 

Comment 4: 

The paper claims "emerging cognitive capabilities," but it does not analyze error modes... Some example LLM reasoning outputs would strengthen the argument. 

Response: 

We have added a new subsection, "Qualitative Analysis of Reasoning," to address this excellent suggestion. 

  • Success vs. Failure: We now present direct excerpts of the reasoning field from the JSON output during success cases and failure cases (e.g., hallucinations where the model perceived an obstacle due to glare). 
  • Robustness: We added a discussion on how lighting variations affected the "Explanation" field, even when the control vector remained correct, providing insight into the model's internal alignment. 

Comment 5: 

Latency and scalability issues require deeper discussions... breakdown of network latency vs. model inference... 

Response: 

We have deepened the technical analysis in Section 4.2: 

  • Latency Decomposition: We added a breakdown of the total control-loop time into HTTP transmission time for the REST request/response that carries the image and context, and model inference time on the LLaVA-7B server. This makes explicit how much of the latency is due to communication versus computation. 
  • Scalability: We have revised the "Swarm" section to explicitly state it is a discussion of architectural potential rather than current capability. We calculated the theoretical throughput limits of our current inference server to show how many concurrent robots the system could support before latency degrades. 

 

Comment 6: 

There are some methodological gaps on formal evaluation of navigation accuracy, task completion metrics and energy consumption. 

Response: 

  • Task Completion: We have added a “success rate” metric to our experiments, defined as the proportion of autonomous runs completed without any collision or safety-stop event within the predefined time horizon. This better reflects navigation reliability in our exploratory setting. 
  • Energy: While we did not measure electrical consumption directly, we have added a note acknowledging that the high inference cost of 7B models presents an energy efficiency challenge compared to lightweight CNNs, marking this as a critical limitation for battery-operated deployment in the Future Work section. 

Reviewer 2 Report

Comments and Suggestions for Authors

This study develops a client-server system using a customized LLaVA-7B multimodal LLM as the cognitive controller for a Raspberry Pi 4-based mobile robot, enabling real-time, safe autonomous navigation in dynamic environments via prompt engineering and hybrid communication. Experimental results confirm the model’s ability to generate valid, context-aligned control commands, laying the groundwork for future LLM-driven swarm robotics with cooperative decision-making. Below are some areas that need further clarification:

  1. How does the hybrid communication architecture (REST API + TCP sockets) specifically balance throughput and low latency to support real-time robotic control?
  2. The study notes no full fine-tuning was done, but it references a “customized instance of LLaVA-7B” and prompt engineering. What specific customizations were applied to the model (e.g., adjustments to input/output layers, integration of sensor data preprocessing pipelines, or modification of context window handling) to ensure it reliably outputs structured JSON (with keys like “m” for direction, “s” for speed) and interprets multimodal inputs? Clarifying this would reveal whether lightweight model tweaks (not full fine-tuning) are sufficient for other robotic use cases.
  3. How would scaling the system to multi-robot swarm scenarios address potential communication overhead between edge robots and the central LLM server?
  4. The text-only Llama3 baseline has a 21.5% safety event rate, while the customized LLaVA-7B has 0%, a stark difference attributed to “lack of visual grounding.” But what specific limitations of text-only reasoning lead to unsafe commands? For example, did Llama3 frequently misjudge obstacle distances (relying on textual sensor data alone) or fail to adapt to dynamic changes (e.g., moving students)? Additionally, how did the on-board safety layer (ultrasonic sensor) interact with Llama3’s outputs, were most safety events triggered by the sensor overriding Llama3’s incorrect “Forward” commands, or by delayed responses from the model? Answering this would clarify the unique value of multimodal input for robotic safety.

Author Response

We thank the reviewer for the insightful comments and for the opportunity to clarify and strengthen the manuscript. All points raised have been carefully considered, and the manuscript has been revised accordingly as described below. 

 

  1. How does the hybrid communication architecture (REST API + TCP sockets) specifically balance throughput and low latency to support real-time robotic control? 

The hybrid architecture separates perception and actuation into two channels. The REST API is used exclusively for high-throughput inference requests that contain the camera image, the distance reading and a dynamically built prompt. Once the LLaVA-7B model returns a sentence and a JSON command, the validated decision is translated into a small symbolic control instruction and transmitted over a persistent TCP socket to the Raspberry Pi (e.g., DirForward, DirStop). 

Response 

This design allows us to use HTTP where larger payloads and robustness are needed (image + context) and a lightweight TCP stream for low-latency execution of motion commands in the local network. In practice, the full control -loop capture, transmit, infer, validate, actuate- maintains low end-to-end latency on typical LLaVA-7B inference, which in our experiments was sufficient for smooth, responsive navigation for the LLaVA:7B configuration, which we observed to be sufficient for smooth, real-time navigation in the lecture-hall environment. 

Manuscript change: A sentence clarifying this has been added to Section 2.5 of the revised manuscript. 

  1. The study notes no full fine-tuning was done, but it references a “customized instance of LLaVA-7B” and prompt engineering. What specific customizations were applied to the model (e.g., adjustments to input/output layers, integration of sensor data preprocessing pipelines, or modification of context window handling) to ensure it reliably outputs structured JSON (with keys like “m” for direction, “s” for speed) and interprets multimodal inputs? Clarifying this would reveal whether lightweight model tweaks (not full fine-tuning) are sufficient for other robotic use cases. 

Response 

We use the term “customized LLaVA-7B” to denote a deployment-level adaptation of the pretrained model rather than any modification of its weights or architecture. As stated in the manuscript, no full fine-tuning was performed. The base LLaVA-7B model is run via the Ollama runtime, and its behavior is shaped by: 

  1.  an extended system prompt that defines the robot’s role, the required JSON schema (m, s, t, d, r) and safety-related constraints; 
  1.  a dynamically constructed user prompt that injects the latest camera image, the current distance measurement and a short textual history of recent decisions; and 
  1.  decoding parameters tuned to favour consistent reasoning and syntactically valid JSON rather than generative diversity. 

A lightweight validation layer then extracts the JSON object from the model output and checks that all required keys and value ranges are present. If validation fails, a safe default command (stop) is issued. No input/output layers or attention mechanisms of LLaVA-7B were altered; all customisation happens at the prompt and inference-pipeline level. 

Manuscript change: A clarifying paragraph has been added to Section 2.3. 

  1. How would scaling the system to multi-robot swarm scenarios address potential communication overhead between edge robots and the central LLM server? 

Response 

In the current prototype the architecture is evaluated with a single robot, but the design is intended to be extensible to swarm scenarios. In such a setting, two mechanisms would be used to address communication overhead between edge robots and the central LLM server. 

First, inference requests can be made event-driven rather than periodic, so that each robot only uploads images and queries the LLM when local sensing indicates ambiguity or a decision point (e.g., intersections, occlusions). Second, as outlined in the Conclusions, a compact LLM deployed on each Raspberry Pi can handle short, safe micro-trajectories locally, while the central LLM is queried only for higher-level route selection or coordination among multiple robots. 

In combination, these strategies reduce redundant image transmission and limit the load on the central server, allowing a single LLM instance to coordinate multiple agents without saturating the network. 

Manuscript change: A short paragraph has been added to the Future Work section. 

  1. The text-only Llama3 baseline has a 21.5% safety event rate, while the customized LLaVA-7B has 0%, a stark difference attributed to “lack of visual grounding.” But what specific limitations of text-only reasoning lead to unsafe commands? For example, did Llama3 frequently misjudge obstacle distances (relying on textual sensor data alone) or fail to adapt to dynamic changes (e.g., moving students)? Additionally, how did the on-board safety layer (ultrasonic sensor) interact with Llama3’s outputs, were most safety events triggered by the sensor overriding Llama3’s incorrect “Forward” commands, or by delayed responses from the model? Answering this would clarify the unique value of multimodal input for robotic safety. 

Response 

The main limitation of the text-only Llama3 baseline is that its decisions are not grounded in the actual camera image, but only in symbolic state descriptions and scalar distance readings. In practice, this led to two recurrent issues. 

First, in a subset of control cycles the text-only Llama3 baseline suggested motion commands that would move the robot closer to nearby obstacles, even though the ultrasonic sensor already reported a short distance. Without access to the visual scene, the model sometimes inferred that space was still available, whereas the geometry was in fact constrained. In those cases the on-board safety layer overrode the command by issuing an immediate stop. 

 

Second, in the dynamic lecture-hall environment, the text-only model could not directly observe people or moving objects. As a result, its internal representation of the scene could lag behind changes in the environment, which again increased the reliance on the sensor-based stop mechanism. 

 

By contrast, the multimodal LLaVA-7B controller had direct access to the images captured by the front camera. Its reasoning was grounded in the actual layout of tables, chairs and people, which resulted in control commands that were consistently aligned with the real geometry. Consequently, during our trials the safety layer did not need to intervene for the LLaVA-7B runs, yielding a 0% safety-event rate as reported in Table~1. 

Manuscript change: A new explanatory paragraph has been added to Section 5.3 (Safety Evaluation). 

Reviewer 3 Report

Comments and Suggestions for Authors

There are some recommendations that I hope will assist the author in improving their paper.

1) Abstract

- In Line 14, please add at the beginning a short introduction sentence that familiarizes the reader with the context of autonomous vehicles.

- In Line 14, please add “T” before “his”. The correct word is “This”.

- Add in Line 28 a short sentence stating for whom the research is useful.

- Add the literature review in this section or in a new section.

2) 1. Introduction

- Please add the research questions after Line 67.

You should include a paragraph after Line 76 in which you can summarize the content of each section. For instance, Section 2 presents…….. Section 3 describes ……

3) 2.2. Robot Hardware Setup (Raspberry Pi 4)

- Please add a new column in Table 1 about the manufacturer, city, and country for each hardware component.

4) 2.3. Software Environment and Model Configuration

- Convert into a table (with two columns) the information stated between Lines 99 and 107.

5) 2.7. Experimental Procedure and Metrics

- Please add a workflow or a logical scheme on the steps between Lines 141-144 and a comment in the text concerning the flow.

- Please cite Figure 1 in the text.

- Remove highlighting from “Experimental” from the caption of Figure 1.

- Please cite Figure 2 in the text.

- Please comment on the information from Figure 2.

6) 3. Experimental Setup and Evaluation Procedure

- Numer as an equation, the formula between Lines 161 and 162.

7) 4. Results and Discussion

- “Results” and “Discussion” should be distinctive sections.

- Currently, this section is focused more on discussion rather than results. Please add more detail about the results of your research, or extend your research by analyzing how the experimental platform behaves across different environments.

8) 5. Conclusions

- Add limitations and future work of the review.

- Mention to whom the research is useful.

10) Add the Abbreviation list according to the MDPI template https://www.mdpi.com/files/word-templates/electronics-template.dot

11) References

- Add more references by including the literature review section.

- Please remove or replace the papers that are preprints with the ones that are reviewed and published in journals and/or conference proceedings (e.g., [4]).

- Add DOI to the future references.

Author Response

We sincerely thank the Reviewer for the thorough and constructive feedback. Your comments significantly contributed to improving the clarity, structure, and completeness of the manuscript. We carefully addressed every point raised, and all suggested revisions have now been incorporated into the updated version of the paper. 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

All my comments and concerns have been adequately addressed by the authors. The revisions improved the clarity and quality of the manuscript, and I am satisfied with the current version. Therefore, I recommend that the paper be accepted in its present form.

Author Response

Thank you for sending your comments; they allowed us to improve the quality of the article.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors significantly improved their paper.

However, some issues must be solved:

1) Add the “Literature Review” or “Background” section.

There are only 13 sources in the references list.

2) 2.7. Experimental Procedure and Metrics

- In Figure 1, please use the standard shapes for a logical workflow (https://www.smartdraw.com/flowchart/flowchart-symbols.htm?srsltid=AfmBOorm681iBdmGMTA2ZnNkVPsu62yZzPkiEuMLmxqt-z-pd-Oufadj). Thus, use the diamond shape for decision in the “Valid JSON?” and “Safe Distance? (> 0.25m)” shapes.

- Revise Figure 1 so that all the options must end with the “stop”.

- Please revise once again if the arrow that begins from “Start Loop” and goes directly to “Log Telemetry & Decision” is correct. The arrow that begins from “Start Loop” and goes directly to “Log Telemetry & Decision” is not correct. What decides to go to “Capture Image & Range Data” or “Log Telemetry & Decision”? I think that the arrow should begin from “Log Telemetry & Decision” and end at “Start Loop.

3) References

- Add more references by including the literature review section.

- You must have at least 30-35 sources.

- Use the MDPI template https://www.mdpi.com/files/word-templates/electronics-template.dot

Author Response

Comment 1:

Thank you for this suggestion. We have expanded the manuscript by adding a dedicated literature review in the form of a Related Work section, complemented by an Extended Literature Review subsection. These sections provide a structured background covering classical autonomous navigation, multimodal vision–language–action models, safety mechanisms, and recent advances in LLM-driven robotics. The literature review now clearly situates our work within the current state of the art.

Comment 2: 

Thank you very much for your detailed and constructive comments regarding Figure 1. We carefully revised the diagram following your suggestions and the flowchart conventions referenced on the SmartDraw website.

First, in accordance with the standard flowchart symbols described at
 https://www.smartdraw.com/flowchart/flowchart-symbols.htm

, we replaced the previous decision representations with diamond-shaped nodes for both “Valid JSON?” and “Safe Distance? (> 0.25 m)”, ensuring correct and consistent use of decision symbols.

Second, regarding the request that “all the options must end with the stop”, the figure was revised so that all possible execution paths (Yes/No branches, including error handling and emergency stop) now converge at a single node, “Log Telemetry & Decision”. This node represents the end of one control-loop iteration. Since the system operates as a continuous control loop rather than a terminating algorithm, this unified endpoint clearly indicates the completion of each iteration before the process restarts.

Finally, we fully agree with your observation concerning the arrow that previously originated from “Start Loop” and pointed directly to “Log Telemetry & Decision”. This arrow was ambiguous and could suggest an incorrect alternative execution path without a decision condition. In the revised figure, this arrow has been removed. The control flow now proceeds unambiguously from “Start Loop” to “Capture Image & Range Data” at the beginning of the process. All branches subsequently return from “Log Telemetry & Decision” to the start of the loop, accurately reflecting the system’s control logic.

We believe that the revised Figure 1 now fully complies with standard flowchart conventions and clearly represents the intended workflow logic. Thank you again for your helpful guidance.

Comment 3:

In response to this comment, we significantly expanded the reference list in parallel with the literature review. The manuscript now includes more references, covering recent surveys, foundational works, and state-of-the-art research on large language models, multimodal AI, embodied navigation, and swarm robotics. These additional references are integrated throughout the Related Work and Extended Literature Review sections to strengthen the contextual grounding of the study.

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

The authors addressed all my recommendations!

Congratulations and good luck!

The paper is accepted in the present form.

Back to TopTop