Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems

Ewald, Dawid; Rogowski, Filip; Suśniak, Marek; Bartkowiak, Patryk; Blumensztajn, Patryk

doi:10.3390/electronics15010035

Open AccessArticle

Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems

by

Dawid Ewald

^1,*

,

Filip Rogowski

²

,

Marek Suśniak

²,

Patryk Bartkowiak

²

and

Patryk Blumensztajn

²

¹

Department of Intelligent Systems, Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85-796 Bydgoszcz, Poland

²

M.Sc. Students Research Group, Faculty of Mathematics and Computer Science, Adam Mickiewicz University, 61-614 Poznań, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 35; https://doi.org/10.3390/electronics15010035

Submission received: 6 November 2025 / Revised: 14 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025

(This article belongs to the Special Issue Data-Centric Artificial Intelligence: New Methods for Data Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The rapid evolution of autonomous vehicles necessitates increasingly sophisticated cognitive capabilities to handle complex, unstructured environments. This study explores the cognitive potential of Large Language Models (LLMs) in autonomous navigation and swarm control systems, addressing the limitations of traditional rule-based approaches. The research investigates whether multimodal LLMs, specifically a customized version of LLaVA 7B (Large Language and Vision Assistant), can serve as a central decision-making unit for autonomous vehicles equipped with cameras and distance sensors. The developed prototype integrates a Raspberry Pi module for data acquisition and motor control with a main computational unit running the LLM via the Ollama platform. Communication between modules combines REST API for sensory data transfer and TCP sockets for real-time command exchange. Without fine-tuning, the system relies on advanced prompt engineering and context management to ensure consistent reasoning and structured JSON-based control outputs. Experimental results demonstrate that the model can interpret real-time visual and distance data to generate reliable driving commands and descriptive situational reasoning. These findings suggest that LLMs possess emerging cognitive abilities applicable to real-world robotic navigation and lay the groundwork for future swarm systems capable of cooperative exploration and decision-making in dynamic environments. These insights are particularly valuable for researchers in swarm robotics and developers of edge-AI systems seeking efficient, multimodal navigation solutions.

Keywords:

large language models; LLaVA; autonomous navigation; swarm robotics; multimodal AI; prompt engineering; robot perception; cognitive AI

1. Introduction

1.1. Introduction to the Project

The utilization of artificial intelligence (AI) has become a dominant trend across all sectors of science and industry, revolutionizing how information is processed, data are analyzed, and decisions are made. In the domain of scientific research, AI significantly accelerates technological advancement by offering tools for automating laboratory workflows, creating advanced simulations of complex systems, and efficiently processing and analyzing vast datasets [1]. Particularly rapid progress has been observed in Large Language Models (LLMs) such as the GPT and T5 families, which—through their capacity to understand, generate, and process natural language—are redefining the boundaries of human–machine interaction and opening new avenues for innovative applications [2].

In the context of autonomous and swarm navigation systems, traditional approaches often rely on predefined algorithms and limited environmental perception, which can lead to suboptimal decision-making in complex or unpredictable scenarios. The key challenge lies in creating systems capable of open-world exploration and context-aware reasoning, where decisions are driven by high-level cognition rather than direct sensory readings alone [3,4,5,6].

Autonomous navigation has historically depended on modular pipelines that couple engineered perception with rule-based planning and control. Recent advances in multimodal Large Language Models (MLLMs) suggest an alternative paradigm in which a single foundation model integrates visual perception and linguistic reasoning to synthesize action proposals. This study examines that paradigm through the development of a vision-guided mobile robot, demonstrating how such models can be deployed effectively under real-time and computational constraints.

Our prototype system comprises a Raspberry Pi-based robot equipped with a front-facing camera and distance sensor, connected to a host workstation running a customized instance of LLaVA-7B (Large Language and Vision Assistant) via the Ollama runtime. The LLM functions as a central cognitive unit, processing each captured image and sensor reading to generate (i) a concise situational description and (ii) a structured JSON command defining motion, speed, turn angle, and duration. The Raspberry Pi performs data acquisition and motor actuation, while the host executes inference. Communication between these modules employs a hybrid architecture: REST API for telemetry transmission (Robot → Server) and persistent TCP sockets for real-time control (Server → Robot), balancing throughput, determinism, and implementation simplicity.

Rather than relying on resource-intensive fine-tuning, we apply advanced prompt engineering with a strengthened system prompt and strict output validation, complemented by a semantic safety layer on the robot that clamps or rejects unsafe commands. This design ensures both interpretability and operational safety when interacting with nondeterministic generative models.

The objectives of this research are threefold: (1) to evaluate whether a compact multimodal LLM can consistently translate real-world visual and sensor input into semantically valid control directives; (2) to analyze the engineering trade-offs of a client–server architecture for real-time robotic control loops; and (3) to establish safety principles and practical guardrails for LLM-driven decision systems in embodied robotics.

To systematically address these objectives, this study formulates the following research questions:

RQ1: Can a generic multimodal LLM generate semantically valid and safe navigation commands for a mobile robot without task-specific fine-tuning?
RQ2: What are the latency implications and stability trade-offs of a client–server architecture in the context of real-time linguistic control?
RQ3: How can nondeterministic model outputs be effectively constrained to ensure operational safety in physical environments?

Moreover, this framework provides the foundation for future swarm navigation systems, in which multiple AI-driven agents coordinate through shared multimodal understanding and distributed reasoning.

The remainder of this article is organized as follows. Section 2 details the system architecture, hardware specifications, and the prompt engineering strategy used to control the robot. Section 3 describes the experimental setup, including the environment configuration and evaluation metrics. Section 4 presents the results of the comparative analysis, discussing both quantitative performance and qualitative reasoning capabilities. Finally, Section 5 summarizes the findings and outlines future research directions.

1.2. Related Work

Classical autonomous navigation has traditionally relied on modular pipelines that couple perception, mapping and planning into a hand-engineered stack [7,8]. These approaches provide strong guarantees in structured environments, but require substantial effort to retune when the robot or the operational conditions change.

End-to-end deep learning and reinforcement learning methods have been proposed as alternatives that directly map sensor inputs to control signals [9]. While such agents can achieve impressive performance after extensive training, they often lack interpretability and require large-scale data collection in simulation before being transferred to the real world [10,11].

More recently, several works have explored the use of large pre-trained models for robotic decision-making. SayCan grounds language instructions in robotic affordances to sequence high-level tasks [12], while RT-2 uses vision–language–action models [13,14] to transfer web-scale knowledge to real-world manipulation and navigation [15]. LM-Nav demonstrates that large language and vision models can be leveraged for goal-directed navigation in previously unseen environments [4].

In parallel, the swarm robotics community has developed a rich body of work on the coordination of many simple agents [16]. Our study connects these lines of research by investigating whether a multimodal LLM can serve as a cognitive controller for an individual robot today, while outlining how similar principles could be extended towards future LLM-enabled swarm systems [17].

1.3. Extended Literature Review

While the previous section outlined the broad context, a deeper examination of recent advancements in Multimodal Large Language Models (MLLMs) and their safety mechanisms is necessary to frame the specific contributions of this study.

1.3.1. From Foundation Models to Edge Deployment

Early implementations of foundation models in robotics, such as PaLM-E [14] and RT-2 [15], demonstrated that LLMs could ground linguistic knowledge in physical affordances. However, these proprietary models typically require massive cloud infrastructure, introducing latency and privacy concerns that are often prohibitive for real-time mobile robotics [3]. To address this, the research community has pivoted toward open-source architectures like OpenVLA [13] and LLaVA [18], which offer comparable reasoning capabilities at a fraction of the parameter count.

Recent efforts have focused on shrinking these models further for edge deployment. Projects like MobileVLM [19] utilize quantization to run directly on devices like the Raspberry Pi. However, severe quantization often degrades the model’s ability to handle complex spatial reasoning. Our approach presents an alternative: a split-computing architecture that retains the full precision of a 7B-parameter model by offloading inference to a local server, thus balancing cognitive depth with hardware constraints.

1.3.2. Addressing Hallucination and Safety

A persistent challenge in generative AI is “hallucination,” where models produce plausible but physically unsafe outputs. A recent survey by Wang et al. [20] categorizes these risks in embodied navigation, highlighting that visual-language models often prioritize semantic consistency over geometric safety.

To mitigate this, techniques such as Chain-of-Thought (CoT) prompting have been shown to improve reliability by forcing the model to verbalize its reasoning steps [21]. However, prompting alone is insufficient. Payandeh et al. [22] recently proposed a “safety chip” architecture, where a modular, rule-based logic layer enforces constraints on the LLM’s stochastic output. Our work empirically validates this concept through a “semantic guard,” demonstrating that wrapping generative models in deterministic validation layers is essential for collision-free operation in human-populated environments [23].

1.3.3. Toward Language-Driven Swarms

Finally, this research bridges the gap between single-robot cognition and swarm intelligence. While traditional swarms rely on simple, local rules [16], the integration of LLMs offers a path toward “cognitive swarms” capable of heterogeneous data processing [17]. By validating the capabilities of a single LLM-driven agent, this study provides the fundamental building block for future multi-agent systems that coordinate via natural language.

2. Materials and Methods

2.1. System Overview

We evaluate a client–server architecture that employs a multimodal Large Language Model (LLM) as a cognitive controller for a vision-guided mobile robot. The system architecture, illustrated in Figure 1, is divided into two main computational layers. The edge unit is a Raspberry Pi 4 that acquires images and range data and actuates motors with local safety supervision. The main server runs a customized instance of LLaVA–7B via the Ollama runtime.

As shown in the diagram, communication relies on a hybrid protocol: REST API (Robot → Server) is used for high-bandwidth telemetry and image transfer, whereas a persistent TCP socket (Server → Robot) ensures low-latency transmission of control commands. This separation allows the robot to offload heavy inference tasks while maintaining a responsive control loop.

2.2. Robot Hardware Setup (Raspberry Pi 4)

The experimental platform, shown in Figure 2, is built on a 4WD chassis with differential drive. The main on-board controller is a Raspberry Pi 4 Model B (4 GB RAM), more details in the Table 1. Key modules include

Five-megapixel Raspberry Pi Camera on a $180^{°}$ metal servo (active viewpoint control).
HC-SR04 ultrasonic sensor (proximity and emergency stop).
4WD motor driver shield for PWM speed/steering, four $4.5$ V 200 rpm DC motors.
0.96 in OLED display and 8 × 16 LED matrix (diagnostics/status).
Dual 18650 battery pack with DC–DC regulation (separate rails for logic and motors).

2.3. Software Environment and Model Configuration

The reasoning engine is a customized LLaVA–7B (LLaVA-custom) executed with ollama run on the server. No full fine-tuning was performed; instead, we used prompt engineering and context control to ensure consistent outputs. The model configuration is provided in the Table 2.

Two other model categories were evaluated for comparison purposes. The text-only Llama3 baseline provided contextual reasoning capabilities but lacked the necessary vision input for direct obstacle detection. Additionally, full-size variants of both LLaVA and Llama3 offered improved reasoning performance; however, they were ultimately discarded due to excessive memory usage and latency incompatible with the real-time constraints of the edge platform.

Given the satisfactory zero-shot performance of LLaVA-7B, no explicit fine-tuning (FT) was performed. Instead, prompt engineering (PE) and context management were applied to ensure consistent reasoning and deterministic JSON output (Listing 1).

The term ’LLaVA-custom’ refers to prompt-level adaptations and inference-time decoding constraints. No weights, layers, input/output heads, or attention mechanisms were modified. Reliability of the JSON output is enforced through an extended system prompt, a dynamic prompt containing sensor readings and recent decisions, and a validation layer that extracts and verifies the JSON object Listing 1).

2.4. Prompt Engineering

The model is instructed to act as an autonomous navigator. A shortened system prompt is shown below; the user prompt is composed dynamically with the latest image, range measurement and previous command/history [21,24].

Listing 1. System prompt (excerpt) and required JSON keys.

2.5. Communication Protocol

REST API is used exclusively for high-throughput inference requests containing the camera image, distance data, and contextual prompt, while all low-level motion actions are executed through a persistent TCP socket. This separation allows the system to balance throughput (via HTTP) with low-latency command execution (via TCP), ensuring real-time responsiveness even under variable model-inference delays.

Robot → Server (REST, port 5053).

Telemetry with image and context is posted as multipart/form-data. The server replies with a validated decision. Example file from the server Listing 2.

Listing 2. Example decision JSON returned by the server.

Server → Robot (TCP socket).

A persistent connection streams control/messages (e.g., DirForward, DirStop, CamLeft, TakePhoto) with optional history and prompt hints. Robot returns ACKs. Heartbeats monitor liveness.

2.6. Safety and Validation

Every model output is schema-validated; the on-board semantic guard clamps

s \in [0, 100]

,

t \in [0, 360]

,

d \in [0, 4]

, and rejects unknown

m \in {F, B, L, R, S}

. Range thresholds enforce immediate Stop when

range < 0.25 m

.

2.7. Experimental Procedure and Metrics

Trials were conducted in a controlled indoor arena. The autonomous navigation process follows the logical workflow depicted in Figure 3.

Each control cycle begins with data acquisition, where the Raspberry Pi captures the current camera frame and ultrasonic sensor reading. These data are transmitted to the server via REST API. The LLaVA model performs inference to generate a textual description and a JSON control vector. Before execution, the command undergoes strict validation on the robot (checking syntax and value ranges). Finally, the command is executed by the motor drivers, unless the local safety layer (ultrasonic check) overrides it to prevent a collision.

We record the following metrics for each run: (i) end-to-end latency, (ii) JSON validity rate, (iii) safety events (collisions/near-misses), and (iv) qualitative reasoning consistency (alignment of “r” with the scene).

3. Experimental Setup and Evaluation Procedure

The primary objective of the experimental campaign was to evaluate the effectiveness of Large Language Models (LLMs)—particularly multimodal variants—in controlling autonomous navigation of a mobile robot based on real-time visual and sensor inputs. The experiment compared several model configurations within an identical robotic environment, assessing their ability to interpret camera imagery, recognize environmental elements, and generate appropriate movement commands under latency and safety constraints.

3.1. Tested Models

Three model configurations were evaluated under identical conditions:

LLaVA:7B (LLaVA-custom)—multimodal vision–language model deployed via the Ollama runtime, capable of processing both text and image inputs [18,25].
LLaVA (standard 7B)—reference model with baseline prompt and system configuration [26].
Llama3 (text-only)—baseline for reasoning quality without visual input [18,27].

Each model was prompted to act as a navigation controller and to produce output in the form of one descriptive sentence followed by a valid JSON object encoding motion parameters:

Command = {m, s, t, d, r}

(1)

where

m \in {F, B, L, R, S}

denotes direction (Forward, Backward, Left, Right, Stop), s represents speed (%), t is the turn angle (°), d denotes duration (s), and r provides a textual justification of the decision.

3.2. Experimental Procedure

Each experimental trial followed an identical closed-loop sequence illustrated below:

Image acquisition: The Raspberry Pi 4 captured a frame from the front camera and the current distance from the ultrasonic sensor.
Feature extraction: Lightweight edge-detection and object-localization routines identified elements such as walls, openings, or obstacles. The detected features (e.g., “obstacle front-left”) were encoded as text tokens and included in the model prompt.
Inference via LLM: The chosen model (LLaVA or Llama3) received the symbolic description and, where applicable, the raw image. It generated a scene description and a JSON-formatted command defining motion parameters.
Command validation and execution: The Raspberry Pi validated the JSON output against a schema and clamped values ( $s \in [0, 100]$ , $t \in [0, 360]$ , $d \leq 4$ ). Safe commands were then executed.
Safety override: The ultrasonic distance sensor acted as a final safety layer; if the measured distance dropped below $0.25$ m, an emergency “Stop” command was triggered.
Logging and feedback: Each step—image, JSON command, reasoning text, and execution result—was logged for quantitative and qualitative analysis.

3.3. Evaluation Metrics

For each tested model, we measured:

Inference latency (ms)—time between image capture and command execution.
JSON validity rate (%)—share of syntactically correct control outputs.
Decision coherence (%)—proportion of textual reasonings consistent with visual context.
Collision avoidance rate (%)—fraction of cycles completed without triggering the safety override.
Motion smoothness (%)—ratio of planned to corrective (stop/reverse) actions.
Success rate (%)—proportion of autonomous runs that completed the predefined exploration horizon without any collision or safety-stop event. In this exploratory setting, this task-level measure complements the step-wise collision avoidance and smoothness metrics.

3.4. Experimental Environment

All experiments were conducted in a real-world lecture hall environment rather than a controlled laboratory arena. The test area measured approximately 6 × 4 m and contained naturally occurring obstacles such as tables, chairs, backpacks, and groups of students present in the room. This dynamic and semi-structured setting was chosen to evaluate the robot’s ability to navigate among everyday objects and people, reflecting realistic challenges for autonomous exploration.

The robot’s primary objective during trials was environmental exploration: to move continuously through the space, avoid collisions, and dynamically select paths between obstacles while maintaining safe distances from humans and static objects. Lighting and acoustic conditions were typical of an active classroom, with ambient noise and variable visual backgrounds. Each model completed ten autonomous navigation cycles per trial. Human involvement was limited to initialization and observation, without manual correction of the robot’s trajectory.

3.5. Ethical and Safety Considerations

Because the experiments were conducted in a space occupied by students, specific safety and ethical measures were implemented to ensure that the study adhered to responsible research practices. All participants present in the lecture hall were informed about the nature and purpose of the experiment, and their presence was voluntary. The mobile robot operated at a low maximum speed of

0.2

m/s, well below any threshold that could cause harm, and its movement was continuously monitored by the supervising researcher.

A dedicated safety layer was active at all times: the ultrasonic distance sensor triggered an immediate Stop command whenever an object or person was detected within 25 cm of the robot’s front. In addition, the onboard controller maintained a “dead-man” mechanism capable of halting all motion upon communication loss or abnormal command detection.

No direct human–robot contact occurred during the trials, and the environment remained open for normal classroom activity. These precautions ensured the ethical and physical safety of all participants while maintaining the ecological validity of the experiment in a real social context.

3.6. Results Summary

Table 3 summarizes the quantitative outcomes of the experimental evaluation. The LLaVA:7B-custom model achieved the highest JSON validity and reasoning coherence while maintaining safe operation in all runs. The text-only Llama3 baseline demonstrated acceptable reasoning but required frequent safety interventions due to its lack of visual grounding. In addition to the per-cycle metrics, we also tracked a task-level success rate, defined as the proportion of autonomous runs completed within the predefined time horizon without any collision or safety-stop event. In our exploratory trials, the LLaVA:7B-custom achieved the highest success rate, whereas the standard LLaVA and text-only Llama3 baselines exhibited several early terminations caused by repeated safety-stop events.

3.7. Preliminary Observations

The results confirm that multimodal reasoning significantly improves spatial awareness and action consistency. The customized LLaVA model maintained a low-latency control loop (<200 ms) and produced structured, semantically grounded commands. The text-only baseline, lacking visual context, tended to issue overconfident or contextually inconsistent decisions, often prevented from collisions only by the local safety guard. These findings validate the proposed client–server architecture and highlight the feasibility of deploying multimodal LLMs for real-time robotic navigation.

4. Results

This section presents the quantitative and qualitative data collected during the experimental trials. The performance of the proposed multimodal LLaVA-7B system was evaluated against the standard LLaVA baseline and the text-only Llama3 model across ten autonomous navigation cycles per configuration.

4.1. Quantitative Performance Analysis

The aggregate performance metrics are summarized in Table 3. The proposed system (LLaVA:7B-custom) achieved a JSON syntax validity rate of 96.2%, significantly outperforming the standard LLaVA configuration (88.7%). This improvement indicates that the specialized system prompt effectively constrained the model’s stochastic nature.

The Table 4 below summarizes the qualitative stability of key metrics across the ten autonomous navigation cycles executed for each model. The experiment focused on feasibility rather than formal statistical characterization; therefore, stability is reported qualitatively rather than via standard deviation or confidence intervals.

In terms of inference latency, the text-only Llama3 model was the fastest (155 ms), as it processed significantly fewer tokens (no image embeddings). However, the multimodal LLaVA-custom maintained a competitive latency of 185 ms, which remains well within the operational safety margin for a robot moving at

0.2

m/s. The standard LLaVA setup exhibited higher latency (240 ms) due to unoptimized context handling.

Most critically, the Safety Events metric reveals the limitations of unimodal approaches. The text-only model triggered the emergency stop in 21.5% of cases, failing to account for unmapped obstacles. In contrast, the customized multimodal system incurred zero safety violations, demonstrating that visual grounding is essential for collision-free navigation.

4.2. Scenario-Based Behavioral Analysis

To further analyze the capabilities of the system, we examined the robot’s behavior in two distinct environmental scenarios present within the lecture hall:

4.2.1. Scenario A: Static Obstacles (Furniture)

In scenarios involving stationary objects (chairs, tables), the multimodal model demonstrated high reasoning coherence (94.5%). The model correctly identified navigable gaps and preferred “smooth” trajectories (turning radius

t < 45^{°}

) over sharp corrections. The textual reasoning field (r) consistently reflected the visual scene, e.g., “Chair leg detected on left; adjusting trajectory 15 degrees right.”

4.2.2. Scenario B: Dynamic Actors (Pedestrians)

When encountering moving students, the system’s behavior shifted from trajectory planning to immediate hazard mitigation. While the Llama3 baseline often commanded Forward movement due to a lack of visual updates, the LLaVA model successfully detected the presence of humans. In 100% of human-encounter instances, the model output either a Stop command or a significant deviation maneuver before the ultrasonic safety layer was forced to intervene. This suggests that the visual encoder (CLIP-based) within LLaVA retains sufficient sensitivity to human forms even at the reduced resolution used for inference [28].

5. Discussion

5.1. Multimodal vs. Unimodal Reasoning

The results confirm that adding a visual modality (RQ1) fundamentally transforms the navigation capability. While the text-only baseline generated syntactically perfect JSON (100% validity), its semantic quality was poor (61.3% coherence). This aligns with the hypothesis that LLMs without vision hallucinate context when placed in embodied scenarios. The multimodal approach allows the robot to ground its linguistic reasoning in physical reality, enabling it to handle unstructured environments that were not predefined in a map.

Regarding classical baselines, we also include a theoretical comparison of three representative navigation paradigms: modular CNN–logic pipelines (e.g., YOLO + rule-based controller), end-to-end reinforcement learning agents, and our zero-shot VLM approach. The Table 5 below summarizes typical trade-offs reported in the literature regarding training requirements, semantic flexibility, latency characteristics, and explainability. These values are indicative and do not represent additional experiments performed in this study.

5.2. Qualitative Analysis of Reasoning and Failure Modes

We performed a qualitative examination of the model’s reasoning outputs across successful and unsuccessful control cycles. Two representative examples are shown below (Listings 3 and 4).

Listing 3. Successful reasoning example (LLaVA-7B).

Here, the model hallucinated an obstacle due to glare on a metal chair frame.

Listing 4. Failure-mode example (LLaVA-standard).

Lighting changes also affected semantic descriptions more than command correctness. In several cases, the reasoning text mentioned shadows or reflections that were not relevant to the geometry, while the JSON output remained safe due to conservative prompting and the semantic guard. Additional errors observed during the study are shown in the Table 6.

5.3. Latency and Architecture Trade-Offs

Addressing RQ2, the client–server architecture introduced a network overhead, yet the total latency (185 ms) proved sufficient for the tested velocity. The separation of concerns allowed the lightweight Raspberry Pi to maintain a high-frequency safety loop (∼20 Hz via the ultrasonic sensor) independent of the lower-frequency cognitive loop (∼5 Hz). This hierarchical design mimics biological systems, where reflex actions (spinal cord) override higher-level planning (brain) when immediate danger is detected.

While the system was not instrumented to measure individual latency components separately, we clarify here the qualitative structure of the control loop. The end-to-end latency consists of three stages: (1) transmission of the image and sensor data via HTTP to the inference server, (2) model inference on the Ollama runtime, and (3) dispatch of the validated motion command through the persistent TCP socket. Based on our observations during the trials, the inference stage dominates the total cycle time.

Although the prototype was evaluated with a single robot, the architecture naturally allows a qualitative discussion of potential scalability. Because the end-to-end control cycle for one robot is approximately 185 ms, the total computational load on the inference server would increase proportionally with the number of agents. This provides an analytical basis for estimating how many concurrent robots the system architecture could support before latency begins to increase noticeably. We therefore frame the swarm scenario strictly as an architectural extension rather than demonstrated functionality.

5.4. Safety and Determinism

Regarding RQ3, the experiment showed that prompt engineering alone is insufficient for guaranteeing safety. The standard LLaVA model occasionally produced invalid JSON or unsafe commands (4.5% safety events). The LLaVA-custom configuration, combined with the deterministic “semantic guard” on the Raspberry Pi, eliminated these errors. This highlights a critical design principle for Embodied AI: generative models must always be wrapped in deterministic validation layers to filter stochastic hallucinations before they reach actuators [23,29].

The observed 21.5% safety event rate in the text-only Llama3 baseline stemmed primarily from the lack of visual grounding. Relying solely on symbolic state descriptions and scalar distance readings, Llama3 often misjudged spatial relations, inferred nonexistent open paths, or failed to track dynamic obstacles in the lecture-hall environment. In practice, the ultrasonic safety layer most frequently intervened by overriding incorrect Forward commands that would have moved the robot too close to obstacles. In contrast, the multimodal LLaVA-7B controller maintained 0% safety events, as its decisions were consistently aligned with the images captured by the front camera and the actual geometry of tables, chairs and people in the scene.

5.5. Limitations

The primary limitation observed was the dependence on network stability. Although the TCP socket provided robustness, any significant packet loss resulted in a “dead-man” stop, halting exploration. Future work will focus on distilling the LLaVA model into a quantized format (e.g., 4-bit) capable of running directly on the edge device to mitigate connectivity risks [19,30,31].

A second limitation concerns energy efficiency. In the current design, a 7B-parameter vision–language model is executed on a separate high-performance server, which entails a substantially higher inference power draw than lightweight CNN-based pipelines typically used on embedded platforms. Although electrical consumption was not instrumented in this study, this overhead is particularly relevant for fully battery-powered deployments and should be treated as a potential constraint in future designs.

6. Conclusions

This study demonstrated that Large Language Models (LLMs), and in particular multimodal variants such as LLaVA:7B, can effectively serve as cognitive controllers for autonomous mobile robots. The results confirmed that the proposed client–server architecture, in which the LLM is hosted on a high-performance server and the Raspberry Pi 4 acts as a local sensing and actuation unit, enables real-time reasoning and safe navigation in dynamic environments. The separation of perception and cognition between edge and server layers proved to be a key design factor, combining fast data collection with complex language-driven decision-making. The hybrid system allowed the robot not only to execute commands but also to plan multi-step actions, explore unknown spaces, and provide interpretable textual reasoning for each decision. The inclusion of local safety mechanisms—such as distance sensors and command validation—ensured reliable and collision-free operation even in environments with human participants.

The outcomes highlight the potential of integrating LLM-based reasoning with lightweight robotic platforms as a foundation for future embodied intelligence and swarm systems. Current research is expanding this concept by embedding a smaller local LLM on the Raspberry Pi to enable interactive “dialogue” with the main model during route negotiation. Preliminary tests indicate that such distributed reasoning and cooperative decision-making significantly reduce communication overhead and enhance the scalability of multi-robot exploration.

In multi-robot scenarios, communication overhead can be mitigated by combining local autonomy with event-driven LLM queries. Robots perform short safe actions locally, requesting LLM guidance only when encountering ambiguous or high-uncertainty situations. This hierarchical approach reduces network load and prevents bottlenecks on the central inference server.

In conclusion, the successful application of multimodal LLMs for real-time robot control marks an important step toward more autonomous, explainable, and cognitively capable robotic systems. The presented architecture provides a flexible framework for future developments in swarm intelligence, human–robot collaboration, and AI-driven spatial reasoning.

Author Contributions

Conceptualization, D.E.; methodology, D.E.; software, F.R., M.S., P.B. (Patryk Bartkowiak) and P.B. (Patryk Blumensztajn); validation, D.E., F.R. and M.S.; formal analysis, D.E.; investigation, F.R., M.S., P.B. (Patryk Bartkowiak) and P.B. (Patryk Blumensztajn); resources, D.E.; data curation, F.R. and M.S.; writing—original draft preparation, D.E. and F.R.; writing—review and editing, D.E.; visualization, M.S. and P.B. (Patryk Blumensztajn); supervision, D.E.; project administration, D.E.; funding acquisition, D.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
VLM	Vision–Language Model
MLLM	Multimodal Large Language Model
CNN	Convolutional Neural Network
RL	Reinforcement Learning
REST	Representational State Transfer
TCP	Transmission Control Protocol
JSON	JavaScript Object Notation
API	Application Programming Interface
OCR	Optical Character Recognition
PWM	Pulse Width Modulation
FPS	Frames Per Second

References

Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
Li, Y.; Katsumata, K.; Javanmardi, E.; Tsukada, M. Large Language Models for Human-Like Autonomous Driving: A Survey. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), St. Louis, MO, USA, 24–27 September 2024. [Google Scholar] [CrossRef]
Han, S.; Wang, M.; Zhang, J.; Li, D.; Duan, J. A Review of Large Language Models: Fundamental Architectures, Key Technological Evolutions, Interdisciplinary Technologies Integration, Optimization and Compression Techniques, Applications, and Challenges. Electronics 2024, 13, 5040. [Google Scholar] [CrossRef]
Shah, D.; Osiński, B.; Levine, S. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Proceedings of the Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022. [Google Scholar]
Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.K.; Li, Z.; Zhao, H. DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model. IEEE Robot. Autom. Lett. 2024, 9, 8186–8193. [Google Scholar] [CrossRef]
Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, Y.; et al. A Survey on Multimodal Large Language Models for Autonomous Driving. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2024. [Google Scholar] [CrossRef]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
Dorigo, M.; Theraulaz, G.; Trianni, V. Swarm Robotics: Past, Present, and Future. Proc. IEEE 2021, 109, 1152–1165. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; Fei-Fei, L. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. arXiv 2024, arXiv:2307.05973. [Google Scholar] [CrossRef]
Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A Survey on Vision-Language-Action Models for Embodied AI. arXiv 2024, arXiv:2405.14093. [Google Scholar] [CrossRef]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Ho, D.; Hsu, J.; et al. Do as I Can, Not as I Say: Grounding Language in Robotic Affordances. In Proceedings of the 6th Conference on Robot Learning (CoRL), Auckland, New Zealand, 14–18 December 2022. [Google Scholar] [CrossRef]
Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Wu, B.; Le, A.; Lu, C.; Xu, E.; Vuong, Q.; et al. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv 2024, arXiv:2406.09246. [Google Scholar] [CrossRef]
Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In Proceedings of the 7th Conference on Robot Learning, PMLR, Atlanta, GA, USA, 6–9 November 2023; Volume 229, pp. 2165–2183. [Google Scholar]
Brambilla, M.; Ferrante, E.; Birattari, M.; Dorigo, M. Swarm Robotics: A Review from the Swarm Engineering Perspective. Swarm Intell. 2013, 7, 1–41. [Google Scholar] [CrossRef]
Li, P.; An, Z.; Abrar, S.; Zhou, L. Large Language Models for Multi-Robot Systems: A Survey. arXiv 2025, arXiv:2502.03814. [Google Scholar] [CrossRef]
Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Meta AI. 2024. Available online: https://ai.meta.com/results/?q=Llama+3%3A+The+Next+Generation+of+Llama+Foundation+Models (accessed on 13 December 2025).
Chu, X.; Qiao, L.; Lin, X.; Xu, S.; Yang, Y.; Hu, Y.; Wei, F.; Zhang, B.; Wei, X.; Shen, C. MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices. arXiv 2024, arXiv:2312.16886. [Google Scholar] [CrossRef]
Wang, Z.; Hu, J.; Mu, R. Safety of Embodied Navigation: A Survey. arXiv 2025, arXiv:2508.05855. [Google Scholar] [CrossRef]
Zawalski, M.; Chen, W.; Pertsch, K.; Mess, O.; Finn, C.; Levine, S. Embodied Chain-of-Thought Reasoning for Vision-Language-Action Models. arXiv 2024, arXiv:2407.08693. [Google Scholar] [CrossRef]
Yang, Z.; Raman, S.S.; Shah, A.; Tellex, S. Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
Ravichandran, Z.; Robey, A.; Kumar, V.; Pappas, G.J.; Hassani, H. Safety Guardrails for LLM-Enabled Robots. arXiv 2025, arXiv:2503.07885. [Google Scholar] [CrossRef]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
LLaVA: Large Language and Vision Assistant. Ollama Library. Available online: https://ollama.com/library/llava:7b (accessed on 13 December 2025).
llava-hf/llava-1.5-7b-hf Model Card. Hugging Face. (Model Info. LLaVA-1.5 7B). Available online: https://huggingface.co/llava-hf/llava-1.5-7b-hf (accessed on 13 December 2025).
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2025, arXiv:2407.21783. [Google Scholar] [CrossRef]
Payandeh, A.; Song, D.; Nazeri, M.; Liang, J.; Mukherjee, P.; Raj, A.H.; Kong, Y.; Manocha, D.; Xiao, X. Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces. arXiv 2025, arXiv:2311.12320. [Google Scholar] [CrossRef]
Rawte, V.; Chakraborty, S.; Pathak, A.; Sarkar, A.; Zaki, M.; Das, A.; Sheth, A.; Saha, T.; Gunti, N.; Roy, K.; et al. The Troubling Emergence of Hallucination in Large Language Models: An Extensive Definition, Quantification, and Prescriptive Remediations. arXiv 2023, arXiv:2310.04988. [Google Scholar] [CrossRef]
Williams, J.; Gupta, K.D.; George, R.; Sarkar, M. Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots. arXiv 2025, arXiv:2511.05642. [Google Scholar] [CrossRef]
Haque, M.A.; Rahman, F.; Gupta, K.D.; Shujaee, K.; George, R. TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices. arXiv 2025, arXiv:2511.22138. [Google Scholar] [CrossRef]

Figure 1. System architecture: Hybrid communication model using REST for high-bandwidth telemetry (Robot → Server) and a persistent TCP connection for low-latency real-time control (Server → Robot). The LLaVA model processes visual input on the server side to generate structured JSON commands.

Figure 2. Experimental platform: Raspberry Pi 4-based 4WD robot with camera on a

180^{°}

servo, HC-SR04 range sensor, motor driver shield, OLED and LED displays.

Figure 2. Experimental platform: Raspberry Pi 4-based 4WD robot with camera on a

180^{°}

servo, HC-SR04 range sensor, motor driver shield, OLED and LED displays.

Figure 3. Logical workflow of the autonomous navigation control loop. The cycle integrates server-side inference with client-side validation and safety overrides.

Table 1. Hardware specifications of the experimental robot.

Component	Model/Module	Manufacturer, City, Country	Purpose
On-board computer	Raspberry Pi 4 Model B (4 GB)	Raspberry Pi Ltd., Cambridge, UK	Handles edge-level control, sensor data acquisition, local safety supervision, and communication with the main LLaVA inference server.
Camera	RPi 5 MP with $180^{°}$ servo mount	Raspberry Pi Ltd., Cambridge, UK	Provides real-time visual input and adjustable viewpoint for scene exploration.
Range sensor	HC-SR04 ultrasonic sensor	Yahboom, Shenzhen, China	Measures obstacle distance and triggers safety stop when the threshold is below $0.25$ m.
Drive system	4WD motor driver shield + 4 × DC $4.5$ V 200 rpm motors	Yahboom, Shenzhen, China	Controls differential steering and speed using PWM signals.
Displays	0.96 in OLED + 8 × 16 LED matrix	Yahboom, Shenzhen, China	Displays connection mode, debug data, and system status feedback.
Power supply	2 × 18650 Li-Ion cells + DC–DC regulators	Yahboom, Shenzhen, China	Provides stabilized and isolated voltage for logic and motor subsystems.

Table 2. Software environment and model configuration parameters.

Parameter	Description/Value
Model Type	Multimodal (image + text)
Output Format	One sentence (scene description) + structured JSON command
Inference Settings	Temperature 0.2, top-p 0.9, context window 4096

Table 3. Performance comparison of evaluated models across 10 navigation cycles.

Model	Latency [ms]	Valid JSON [%]	Coherence [%]	Safety Events [%]	Smoothness [%]
LLaVA:7B (custom)	185 ± 12	96.2	94.5	0.0	91.8
LLaVA (standard)	240 ± 18	88.7	82.1	4.5	83.4
Llama3 (text-only)	155 ± 9	100.0	61.3	21.5	67.8

Table 4. Qualitative stability of key metrics across repeated runs.

Model	Latency Stability	JSON Stability	Safety Stability
LLaVA:7B (custom)	High	High	High
LLaVA (standard)	Medium	Medium	High
Llama3 (text-only)	High	High	Low

Table 5. Comparison of classical navigation paradigms and the proposed VLM-based controller. Latency values are indicative.

Feature	Modular (CNN + Logic)	End-to-End RL	VLM Zero-Shot (Ours)
Training Required	Moderate (Object Det.)	High (Sim2Real)	None (Pre-trained)
Semantic Understanding	Low (Class Labels)	None (Black box)	High (Natural Language)
Inference Latency	Low (<50 ms)	Very Low (<10 ms)	High (∼200 ms)
Explainability	Medium	Low	High (Text Reasoning)

Table 6. Observed reasoning-related error modes during qualitative analysis.

Error Type	Description/Example
Lighting artefacts	Hallucinated obstacles due to glare or shadows.
Overgeneralization	Commands based on incorrect assumptions about free space.
Delayed scene update	Reasoning referencing earlier frames in dynamic situations.
Ambiguous semantics	Vague or overly cautious descriptions (e.g., “something ahead”).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ewald, D.; Rogowski, F.; Suśniak, M.; Bartkowiak, P.; Blumensztajn, P. Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems. Electronics 2026, 15, 35. https://doi.org/10.3390/electronics15010035

AMA Style

Ewald D, Rogowski F, Suśniak M, Bartkowiak P, Blumensztajn P. Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems. Electronics. 2026; 15(1):35. https://doi.org/10.3390/electronics15010035

Chicago/Turabian Style

Ewald, Dawid, Filip Rogowski, Marek Suśniak, Patryk Bartkowiak, and Patryk Blumensztajn. 2026. "Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems" Electronics 15, no. 1: 35. https://doi.org/10.3390/electronics15010035

APA Style

Ewald, D., Rogowski, F., Suśniak, M., Bartkowiak, P., & Blumensztajn, P. (2026). Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems. Electronics, 15(1), 35. https://doi.org/10.3390/electronics15010035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems

Abstract

1. Introduction

1.1. Introduction to the Project

1.2. Related Work

1.3. Extended Literature Review

1.3.1. From Foundation Models to Edge Deployment

1.3.2. Addressing Hallucination and Safety

1.3.3. Toward Language-Driven Swarms

2. Materials and Methods

2.1. System Overview

2.2. Robot Hardware Setup (Raspberry Pi 4)

2.3. Software Environment and Model Configuration

2.4. Prompt Engineering

2.5. Communication Protocol

2.6. Safety and Validation

2.7. Experimental Procedure and Metrics

3. Experimental Setup and Evaluation Procedure

3.1. Tested Models

3.2. Experimental Procedure

3.3. Evaluation Metrics

3.4. Experimental Environment

3.5. Ethical and Safety Considerations

3.6. Results Summary

3.7. Preliminary Observations

4. Results

4.1. Quantitative Performance Analysis

4.2. Scenario-Based Behavioral Analysis

4.2.1. Scenario A: Static Obstacles (Furniture)

4.2.2. Scenario B: Dynamic Actors (Pedestrians)

5. Discussion

5.1. Multimodal vs. Unimodal Reasoning

5.2. Qualitative Analysis of Reasoning and Failure Modes

5.3. Latency and Architecture Trade-Offs

5.4. Safety and Determinism

5.5. Limitations

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI