Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics

Farkh, Rihem; Oudinet, Ghislain; Deleruyelle, Thibaut

doi:10.3390/ai6060115

Open AccessArticle

Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics

by

Rihem Farkh

^1,*

,

Ghislain Oudinet

²

and

Thibaut Deleruyelle

³

¹

LabISEN, KLaIM, ISEN Méditerranée, 83000 Toulon, France

²

LabISEN, VISION-AD, ISEN Méditerranée, 83000 Toulon, France

³

IM2NP, ISEN Méditerranée, 83000 Toulon, France

^*

Author to whom correspondence should be addressed.

AI 2025, 6(6), 115; https://doi.org/10.3390/ai6060115

Submission received: 10 April 2025 / Revised: 15 May 2025 / Accepted: 28 May 2025 / Published: 4 June 2025

(This article belongs to the Section AI in Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a pioneering hybrid framework that integrates Q-learning/deep Q-network (DQN) with a locally deployed large language model (LLM) to enhance obstacle avoidance in embedded robotic systems. The STM32WB55RG microcontroller handles real-time decision-making using sensor data, while a Raspberry Pi 5 computer runs a quantized TinyLlama LLM to dynamically refine navigation strategies. The LLM addresses traditional Q-learning limitations, such as slow convergence and poor adaptability, by analyzing action histories and optimizing decision-making policies in complex, dynamic environments. A selective triggering mechanism ensures efficient LLM intervention, minimizing computational overhead. Experimental results demonstrate significant improvements, including up to 41% higher deadlock recovery (81% vs. 40% for Q-learning + LLM), up to 34% faster time to goal (38 s vs. 58 s for Q-learning + LLM), and up to 14% lower collision rates (11% vs. 25% for Q-learning + LLM) compared to standalone Q-learning/DQN. This novel approach presents a solution for scalable, adaptive navigation in resource-constrained embedded robotics, with potential applications in logistics and healthcare.

Keywords:

avoidance obstacle mobile robot; hybrid control; Q learning; DQN; LLM; Stm32Wb55rg; raspberry pi 5

1. Introduction

Obstacle avoidance robots are autonomous or semi-autonomous systems designed to navigate their environment while detecting and avoiding obstacles in their path [1,2]. These robots leverage a combination of sensors, algorithms, and control systems to perceive their surroundings, make real-time decisions, and adjust their trajectory to prevent collisions. Common sensors used include ultrasonic sensors, LiDAR, infrared sensors, cameras, and radar, which provide data about the robot’s environment. This data is processed using algorithms, such as simultaneous localization and mapping (SLAM), potential field methods, or machine learning-based approaches to plan safe paths [3,4].

Obstacle avoidance is a critical capability for robots operating in dynamic or unstructured environments, such as autonomous vehicles, drones, warehouse robots, and service robots. Research in this field focuses on improving sensor accuracy, enhancing decision-making algorithms, and enabling robots to handle complex scenarios, like moving obstacles, narrow spaces, or uncertain environments. Classical path planning approaches encompass a range of methods, including traditional techniques like artificial potential fields (APFs) [5] and more advanced strategies such as A*, D* [6,7,8], and the dynamic window approach (DWA), all of which focus on generating safe trajectories using geometric or heuristic techniques [9]. Bio-inspired algorithms, including genetic algorithms (GAs) [10], particle swarm optimization (PSO) [11,12], and ant colony optimization (ACO) [13], draw from natural systems to optimize paths. PSO, for instance, mimics swarm behavior to explore path options, while ACO emulates ant foraging to find efficient routes. However, these methods are computationally intensive, requiring significant memory and processing power, which makes them impractical for resource-constrained embedded systems.

Recent advancements in artificial intelligence and deep learning have further improved the robustness and adaptability of obstacle avoidance systems, making them more reliable for real-world applications [14,15]. This research area continues to evolve, driven by the growing demand for autonomous systems in industries like logistics, healthcare, and agriculture [16,17].

Q-learning is a foundational algorithm in the field of reinforcement learning (RL), a subdomain of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Introduced by Watkins in 1989 [18], Q-learning is a model-free, off-policy algorithm that enables an agent to learn optimal actions in a Markov decision process (MDP) without requiring prior knowledge of the environment’s dynamics. The algorithm works by iteratively updating a Q-value function, which estimates the expected future rewards for taking a specific action in a given state and following the optimal policy thereafter.

The core idea of Q-learning is to balance exploration (trying new actions) and exploitation (choosing known rewarding actions) to discover the optimal policy. It uses the Bellman equation to update Q-values, ensuring convergence to the optimal policy under certain conditions, such as sufficient exploration and a decaying learning rate. Q-learning has been widely applied in various domains, including robotics, game playing, autonomous systems, and control tasks, due to its simplicity and effectiveness [19,20].

Traditional Q-learning faces several limitations in dynamic environments. It struggles with slow convergence in large state spaces due to the exponential growth of the Q-table, making it impractical for complex scenarios. It also has difficulty handling uncertainty and non-stationary environments, as it assumes fixed state transitions and rewards. Q-learning has limitations when it comes to adapting to new or unseen states, often leading to suboptimal decisions in unfamiliar situations. Additionally, the exploration–exploitation trade-off can result in inefficient exploration, causing the robot to get stuck in local optima. Finally, Q-learning cannot leverage high-level reasoning or contextual knowledge, limiting its ability to handle complex, unstructured environments effectively. These limitations hinder its performance in real-world, dynamic applications [21,22,23,24,25].

Deep Q-learning (DQN) is an advanced reinforcement learning algorithm that combines deep neural networks with Q-learning to tackle complex decision-making tasks. It enables agents to learn optimal behaviors from high-dimensional input data, such as images, by approximating the action–value function [26]. The DQN addresses challenges in traditional Q-learning by using experience replay and fixed Q-targets to stabilize training. Recent innovations, like double DQN, help to mitigate the overestimation of Q-values, thereby improving the learning efficiency and stability. The DQN has demonstrated remarkable success in various domains, including playing Atari games, robotics, and autonomous driving, achieving up to 96% success rates in some tasks. As of 2025, ongoing research focuses on enhancing DQN’s scalability and efficiency, making it a cornerstone of modern AI application [27,28].

Despite these advancements, a significant research gap persists in achieving efficient, adaptive obstacle avoidance on resource-constrained embedded systems, such as low-power microcontrollers like the STM32WB55RG. For instance, a cleaning robot using Q-learning might repeatedly fail to navigate around a suddenly appearing obstacle (e.g., a moving cart), resulting in collisions or deadlocks due to its inability to adapt quickly or reason about environmental context [29]. This gap highlights the need for a lightweight, adaptive approach that combines real-time decision-making with high-level reasoning, enabling robust navigation on resource-constrained hardware.

Large language models (LLMs) are increasingly being explored for their potential in robotics, where they can enhance the interaction, decision-making, and adaptability of robotic systems [30]. By leveraging their advanced natural language understanding and generation capabilities, LLMs can serve as a bridge between human instructions and robotic actions, enabling more intuitive and flexible human–robot collaboration [31,32].

Large language models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and manipulate human language. These models, such as OpenAI’s GPT series, Google’s BERT, and Meta’s LLaMA, are built using deep learning architectures, particularly transformers, which enable them to process and generate text with remarkable coherence and context awareness. LLMs are trained on vast amounts of text data from diverse sources, allowing them to learn linguistic patterns, grammar, and even world knowledge [33,34,35].

Recent advancements in robotics have been profoundly influenced by the integration of LLMs, which enable robots to interpret complex instructions, adapt to dynamic environments, and execute sophisticated decision-making tasks [36,37].

In a recent paper [38], the authors provided a comprehensive overview of recent LLM-driven robotic systems, highlighting their applications in navigation, human–robot interaction (HRI), and autonomous control. LLMs are rapidly transforming robotics by empowering agents to understand high-level directives, reason about intricate environments, and adapt to unpredictable scenarios. Recent research and industry developments has highlighted several key capabilities of LLM-enhanced robotic systems, including semantic mapping and contextual waypoint planning, natural-language dialogue with clarification and justification, dynamic task prioritization and contingency-plan generation, and closed-loop autonomous control based on ongoing mission goals [39].

Combining LLMs with Q-learning/DQN for obstacle avoidance in robotics represents an innovative and interdisciplinary approach that leverages the strengths of both technologies. Integrating LLMs with Q-learning or deep Q-networks (DQNs) represent a novel paradigm for robotic navigation, merging symbolic reasoning with reinforcement learning (RL). LLMs contribute high-level environmental understanding (e.g., semantic obstacle classification, task decomposition) and natural language interaction (interpreting verbal instructions like “avoid the glass table”), while Q-learning/DQN provides local, real-time policy optimization through reward-driven exploration.

This paper introduces a novel framework integrating deep Q-networks (DQNs)/Q-learning with large language models (LLMs) to address key challenges in adaptive obstacle avoidance. The key innovation of our approach lies in integrating a locally deployed LLM on a Raspberry Pi 5 single-board computer with Q-learning and DQN algorithms running on an STM32WB55RG microcontroller, enabling real-time adaptive navigation. This hybrid architecture employs a split mechanism as follows: a quantized LLM (TinyLlama) analyzes historical action sequences and environmental contexts (e.g., obstacle patterns, navigation history) to dynamically refine Q-values and optimize exploration policies, directly addressing traditional Q-learning’s slow convergence in dynamic environments. Simultaneously, the STM32 microcontroller handles time-critical tasks—processing raw sensor data from time-of-flight (ToF) sensors at 100 Hz and executing motor control signals within 2 ms latency—while deferring high-level strategic adjustments to the LLM.

Crucially, we introduce a dynamic LLM triggering mechanism (Section 5) that invokes the LLM during edge cases (e.g., deadlocks, repeated collisions), minimizing latency and computational overhead compared to continuous LLM involvement. Furthermore, our robotics-specific prompt engineering (Section 6) tailors LLM outputs to low-level navigation tasks (e.g., “High-Risk Obstacle Avoidance”), unlike generic LLM applications in prior works

Empirical evaluations highlight the effectiveness of integrating a locally deployed LLM with reinforcement learning algorithms, such as Q-learning and DQNs (Section 9). This hybrid approach significantly enhances the adaptive navigation performance in mobile robots. Notably, the system demonstrates improved deadlock recovery rates, reduced time to reach goals, decreased collision rates, and increased successful navigation attempts, especially in dynamic environments. These improvements underscore the LLM’s capacity to augment traditional reinforcement learning methods, leading to more robust and efficient navigation strategies.

2. System Architecture

The AlphaBot2 (Figure 1) is a flexible robotic platform tailored for educational purposes and rapid prototyping. It supports a variety of microcontrollers, including Raspberry Pi, Arduino, and STM32, and boasts a modular design that allows for integration with additional components like sensors, cameras, and wireless modules. The platform is equipped with DC motors for differential drive, enabling precise movement control [40].

Among its features are infrared sensors for line tracking and ultrasonic sensors for obstacle detection, which allow the AlphaBot2 to perform tasks such as line-following, obstacle avoidance, and AI-powered navigation. Control options include Bluetooth, Wi-Fi, or fully autonomous operation using real-time sensor feedback. With support for multiple programming environments, the AlphaBot2 serves as an excellent tool for learning robotics, developing autonomous systems, and experimenting with artificial intelligence.

The following diagram in Figure 2 represents a robot control system integrating Q-learning for adaptive navigation. The system consists of three time-of-flight (ToF) sensors (TOF1, TOF2, TOF3) that measure distances and send data through a TCA9548A I²C multiplexer to the WB55RG microcontroller.

The STM32WB55RG was selected for its low-power, real-time capabilities. The WB55RG processes the data from sensors and runs a Q-learning algorithm to optimize movement decisions. It then sends motor control signals via PWM to a TB6612FNG motor driver, which controls two motors (Motor A and Motor B).

To enhance decision-making, the system incorporates a large language model (LLM) through a Raspberry Pi 5 computer, which runs a local LLM for on-device inference. This approach reduces latency and dependence on external connectivity. The setup enables the robot to dynamically adjust its navigation strategy based on sensor inputs, Q-learning updates, and real-time LLM optimizations, improving adaptability and efficiency in complex environments. Additionally, a serial peripheral interface (SPI) is used for real-time decision-making and obstacle avoidance, ensuring rapid communication between components for timely adjustments.

3. DQN and Q-Learning for Obstacle Avoidance

Q-learning and deep Q-network (DQN) are both reinforcement learning algorithms, but they differ in scalability and approach. Q-learning relies on a Q-table, making it suitable for simple tasks but impractical for high-dimensional state spaces due to memory limitations. By contrast, the DQN uses deep neural networks to approximate Q-values, enabling it to process complex environments and raw inputs like images. To improve stability, the DQN incorporates experience replay and fixed Q-targets, making it more effective than Q-learning in dynamic scenarios. While Q-learning stores Q-values explicitly, the DQN compresses information through neural networks, allowing for better generalization and efficient memory usage.

3.1. Q-Learning

Q-Learning is a model-free RL algorithm where an agent learns to make decisions by interacting with its environment. It uses a Q-table to store expected rewards for taking specific actions in given states. Over time, the agent updates the Q-table using the following formula:

Q (s, a) \leftarrow Q (s, a) + α [R (s, a) + γ {m a x}_{a^{'}} Q (s^{'}, a^{'} - Q (s, a)]

(1)

where

Q(s,a) is the current Q-value for state s and action a;
α is the learning rate (controls how much new information overrides old information);
R(s,a) is the immediate reward for taking action (a) in state (s);
γ is a discount factor (weighs the importance of future rewards vs. immediate rewards);
${m a x}_{a^{'}}$ Q(s′, a′) is the maximum Q-value for the next state s′.

The ε-greedy policy is a strategy used in reinforcement learning (RL), including Q-learning, to balance exploration and exploitation when selecting actions.

Exploitation: Choosing the action with the highest Q-value to maximize immediate rewards.
Exploration: Randomly selecting an action to discover better strategies and to improve long-term performance.

A purely greedy approach, which always selects the best-known action, can trap the agent in a local optimum, while a purely random approach prevents effective learning. The ε-greedy policy strikes a balance by allowing for both exploration and exploitation.

In obstacle avoidance, exploration helps the robot discover new paths or strategies by trying out different actions, even if they are not initially optimal. Exploitation ensures the robot uses its learned knowledge to navigate efficiently and avoid collisions by choosing actions that are known to yield high rewards based on past experiences

The following Table 1 shows how the Q-learning parameters were tuned experimentally.

The values for α = 0.1 and γ = 0.9 were selected based on empirical testing, balancing convergence speed with stability. Future work could explore adaptive learning rate strategies.

Algorithm: Standard Q-Learning for Obstacle Avoidance

State Space: (s) = [d_front, d_left, d_right], Discretized into “Close”, “Medium”, “Far” bins.
where:
d_front = Distance to the nearest obstacle in front.
d_left = Distance to the nearest obstacle on the left.
d_right = Distance to the nearest obstacle on the right.
$d = \{\begin{matrix} C l o s e i f d \leq 3 c m \\ Medium if 3 < d \leq 5 \\ f a r i f d > 5 \end{matrix}$
Action Space: A={Move Forward,Turn Left,Turn Right,Stop}
1. Initialize Q-tables
-Initialize Q-table with zeros for all state-action pairs.
-Set exploration rate ϵ = 1.0 (initial probability of random actions)
2. Training Loop:
For each training episode:

Reset the environment → Set initial state (s)
Loop for each time step until episode ends:
⚬
2.1 Choose Action:
▪
With probability ε, choose a random action (a) (exploration).
▪
Otherwise, choose the action with the highest Q-value

$a = {\arg m a x}_{a^{'}} Q (s, a^{'})$

(2)

⚬
2.2 Execute Action & Observe Reward:
▪
Perform action (a) in the environment.
▪
Observe reward^® and next state (s′).
⚬
2.3 Update Q-Table (Standard Q-learning Update Rule):

$Q (s, a) \leftarrow Q (s, a) + α [r + γ {m a x}_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]$

▪
α → Learning rate (controls how much new experience overrides old values).
▪
γ → Discount factor (weighs future rewards vs. immediate rewards).
⚬
2.4 Decay Exploration Rate:
▪
ε_decay = 0.995
▪
ε_min = 0.01 (ensures some exploration).
⚬
2.5 Update State:
▪
Set (s) = (s′).
Repeat Until Episode Ends (collision or max steps reached)

3.2. DQN for Obstacle Avoidance

The DQN is an extension of Q-learning, a popular RL algorithm, but leverages deep neural networks to approximate the Q-value function, allowing it to handle high-dimensional state spaces. By continuously interacting with the environment and receiving reward signals based on successful navigation, a robot using a DQN learns to make optimal movement decisions, avoiding obstacles efficiently while minimizing collisions (Figure 3).

Compared to traditional Q-learning, the DQN offers better generalization and adaptability, making it well-suited for real-world robotic applications where environments are complex and unpredictable. When combined with additional techniques, such as experience replay or target networks, the DQN can further enhance obstacle avoidance performance.

Since the STM32WB55RG has limited computational power and memory, running a full DQN with experience replay and backpropagation is not feasible on-device. However, the WB55RG can execute Q-Learning or a pre-trained DQN for real-time decision-making, while training is performed off-board. Instead, we can use such an approach as follows:

STM32WB55RG collects data (states, actions, rewards) from sensors and stores them.
▪
Initialize sensors ToF.
▪
For each step in the environment:
⚬
Read sensor data → Get state (s) = [d_front, d_left, d_right];
⚬
Select an action (a) randomly or based on simple Q-table;
⚬
Execute action and observe reward (r);
⚬
Store (s,a,r,s′) in memory (e.g., SD card, Flash, or send via UART).
A computer (PC with GPU) trains the DQN model using collected data.
The trained network is deployed back to STM32WB55RG, where it runs only inference for decision-making.
▪
Load trained neural network (stored in Flash or loaded via UART).
▪
For each step:
⚬
Read sensor data → Get state (s);
⚬
Run the trained DQN model to get action $a = {\arg m a x}_{a^{'}} Q (s, a^{'})$ ;
⚬
Execute action (motor movement).
▪
Repeat until the task is completed.

4. Hybrid Learning Approach

The hybrid learning approach integrates local Q-learning or a deep Q-network (DQN) on the STM32WB55RG microcontroller, which is responsible for real-time decision-making, sensor data processing, and motor control, while a Raspberry Pi 5 computer assists with dynamic strategy refinement using an on-device large language model (LLM). The WB55RG collects time-of-flight (ToF) sensor data and updates the Q-table (in Q-learning) or estimates the Q-values using a trained neural network (in DQN) to select the optimal action for navigation. However, when the WB55RG detects uncertainty (e.g., repeated failures, deadlocks, or unpredictable scenarios), it sends relevant information to the Raspberry Pi 5. The Raspberry Pi 5 then analyzes the sensor readings, the Q-table state (for Q-learning) or the DQN model predictions, and recent actions to generate a structured prompt for processing by an optimized local LLM. The LLM suggests an optimized action or refines the Q-values, which the Raspberry Pi 5 transmits back to the WB55RG for execution. This division of tasks ensures that the WB55RG remains focused on real-time navigation, while the Raspberry Pi 5 efficiently handles computationally intensive tasks, such as strategy refinement, LLM-based decision-making adjustments, or dynamic Q-value update

The flowchart in Figure 4 illustrates the data processing pipeline from sensor readings to decision-making and motor.

The process is structured as follows: the LLM improves Q-learning/DQN by analyzing sensor data and action history to suggest better decisions or modify Q-values in uncertain situations. The process works as follows:

Q-learning/DQN runs locally on STM32WB55RG:
▪
The robot observes state (s) using ToF sensors.
▪
The decision function for LLM prompt selection, implemented on STM32WB55RG, evaluates the robot’s current state by analyzing key parameters, such as obstacle proximity, recent states, Q-values, and action history.
▪
The Q-table (for Q-learning) or the predicted Q-value (for the DQN) is queried to select an action (a) using an ε-greedy policy.
▪
The robot executes action (a) and observes the reward (r) and the next state (s’).
▪
If Q-Learning is used, the Q-table is updated using the standard Bellman equation:

$Q (S, A) \leftarrow Q (S, A) + A [R + Γ {m a x}_{a^{'}} Q (S^{'}, A^{'}) - Q (S, A)]$

(3)
If the decision function detects challenges (e.g., repeated collisions, deadlocks, or ineffective learning), the WB55RG triggers the RPi5 LLM for strategic assistance.
The Raspberry Pi 5 structures a prompt including:
▪
Current state (s), recent actions (a), observed reward^®, and current Q-table values for (s);
▪
Current Q-values (Q-learning) or Q-network outputs (DQN) for (s).
The prompt is sent to the RPi5 LLM.
The LLM responds with a strategic decision:
▪
Option 1: If the LLM recommends a new action (e.g., “Turn left instead of moving forward”), the STM32WB55RG executes it directly;
▪
Option 2: If the LLM adjusts Q-values (for Q-learning) (e.g., “Increase reward for turning left in state (s’)”), the WB55RG updates its model accordingly and triggers a new action.
To ensure the LLM-generated decision is effective, a feedback control mechanism is introduced:
▪
Reward comparison: After executing the LLM-suggested action, the new reward (r’) is compared to the previous reward (r)
⚬
If (r’ > r) → The LLM decision is considered effective;
⚬
If (r’ ≤ r) → The system sends a correction prompt to the LLM for refinement.
The updated Q-values influence future learning and actions
As the robot learns from refined data, Q-learning/DQN gradually becomes more autonomous, reducing dependence on the LLM

This hybrid model enhances adaptability by allowing real-time obstacle avoidance via fast local decision-making while leveraging LLM-driven strategy refinement for long-term efficiency improvements.

The following diagram in Figure 5 clarifies the system’s split architecture and the flow of data and decisions between the microcontroller and the LLM.

5. Decision Function for LLM Prompt Selection

To ensure optimal decision-making in dynamic environments, we propose a decision function for LLM prompt selection implemented directly on the STM32WB55RG. This function continuously evaluates the robot’s real-time state based on key parameters, such as sensor readings, Q-values, repeated states, exploration rate, and obstacle density.

By defining the specific thresholds (e.g., obstacle proximity, deadlock detection, and low Q-values), the WB55RG can dynamically select the most relevant prompt. When necessary, it triggers the RPi5 LLM to assist in complex decision-making tasks, as follows:

If an obstacle is critically close, the system prioritizes a high-risk obstacle avoidance prompt;
If the robot repeats the same state, it triggers a deadlock resolution prompt.

This approach ensures adaptive and efficient navigation while minimizing reliance on external computation.

6. LLM Prompting Strategies for the Robot

To effectively utilize the LLM for navigation assistance, the system employs structured prompts tailored to different scenarios. To minimize data transmission from the STM32WB55RG to the Raspberry Pi 5 computer, redundant information is eliminated, ensuring that only the essential variables required for each specific prompt are sent. Multiple scenarios have been tested using the TinyLlama LLM to validate the effectiveness of these prompts in real-time decision-making. The results confirm that optimized prompts significantly reduce communication overhead while maintaining full functionality and response accuracy. Below are the optimized prompts, designed to enhance efficiency and adaptability across various navigation challenges.

6.1. LLM Prompting with Q-Learning

Navigation Optimization:

def prompt_navigation_optimization (sensor_data, q_table, action_history):

prompt = f”““

Given the sensor readings: {sensor_data}

The last action taken: {last_action}

What is the optimal next action to avoid obstacles and move towards the goal?

“““

return send_to_llm(prompt)

Deadlock Resolution:

def prompt_deadlock_resolution(recent_states):

prompt = f”““

The robot is stuck in a loop with the following recent states: {recent_states}

Suggest an escape strategy based on historical data and navigation patterns.

“““

return send_to_llm(prompt)

Q-Table Enhancement:

def prompt_q_table_enhancement(low_q_table):

prompt = f”““

The following Q-values are underperforming:{low_q_table}

Suggest optimized reward adjustments.

“““

return send_to_llm(prompt)

Adaptive Exploration vs. Exploitation Prompt Purpose:

def prompt_adaptive_exploration(exploration_rate):

prompt = f”““

Current exploration rate: {exploration_rate}

Should the robot increase or decrease exploration??

“““

return send_to_llm(prompt)

High-Risk Obstacle Avoidance:

def prompt_high_risk_avoidance(critical_sensor_readings):

prompt = f”““

The robot has detected an object at a critically close range: {critical_sensor_readings}

Recommend an immediate evasive action to avoid a collision.

“““

return send_to_llm(prompt)

Adaptive Speed Adjustment

def prompt_adaptive_speed (current_speed,obstacle_density):

prompt = f”““

Current speed: {current_speed}

Sensor data shows the following obstacle density: {obstacle_density}

Should the robot adjust its speed? If so, suggest the optimal speed.

“““

return send_to_llm(prompt)

Learning Rate Optimization

def prompt_learning_rate_optimization (learning_rate):

prompt = f”““

Current learning rate: {learning_rate}

Should the robot increase or decrease its learning rate?

“““

return send_to_llm(prompt)

6.2. LLM Prompting with DQN

Navigation Optimization:

def prompt_navigation_optimization (sensor_data, action_history):

prompt = f”““

Given the following sensor readings: {sensor_data}

Robot’s last actions: {action_history}

Predict the best next action.

“““

return send_to_llm(prompt)

Deadlock Resolution:

def prompt_deadlock_resolution (recent_states, recent_actions):

prompt = f”““

The robot is stuck in a loop with the following recent states: {recent_states}

The last actions taken were: {recent_actions}

Based on these patterns, suggest an escape strategy to break the loop and continue towards the goal.

“““

return send_to_llm(prompt)

High-Risk Obstacle Avoidance:

def prompt_adaptive_speed(critical_sensor_data, last_decision):

prompt = f”““

A high-risk obstacle detected at: {critical_sensor_data}

Last decision: {last_decision}

Recommend an immediate evasive action.

“““

return send_to_llm(prompt)

Adaptive Speed Adjustment

def prompt_adaptive_speed(current_speed,obstacle_density):

prompt = f”““

Current speed: {current_speed}

Sensor data shows the following obstacle density: {obstacle_density}

Should the robot adjust its speed? If so, suggest the optimal speed.

“““

return send_to_llm(prompt)

By structuring prompts in this manner, the LLM can provide precise, context-aware guidance, improving real-time decision-making and long-term learning refinement.

7. Preparing LLM Prompts

The optimal approach is to send raw sensor data and action history from the STM32WB55RG to the Raspberry Pi 5 computer, allowing it to handle prompt formatting. This reduces processing overhead on the STM32WB55RG, ensuring it remains focused on real-time navigation and motor control. The Raspberry Pi 5, with its greater computational power, can efficiently structure prompts and optimize LLM interactions, reducing latency and communication overhead. Additionally, keeping prompt formatting on the RPi5 improves debugging and maintainability, allowing updates without modifying the STM32WB55RG firmware. This setup enhances overall system efficiency, ensuring faster decision-making and better adaptability.

8. LLM on Raspberry Pi 5

Running large language models (LLMs) on a system with the computing power of a Raspberry Pi 5 is challenging due to the limited resources (CPU, RAM, etc.). However, with optimizations and lightweight models, it is possible to run smaller LLMs (see the Table 2) or use cloud-based solutions for offloading computation.

Generalized gradient update format (GGUF) is a modern quantization format designed for efficient storage and inference of machine learning models. It is particularly well-suited for LLMs, and supports advanced quantization techniques like 4-bit quantization.

TinyLlama 1.1B is a compact and efficient LLM designed to provide high performance while being lightweight enough to run on resource-constrained devices. It is a smaller variant of the LLaMA (Large Language Model Meta AI) architecture, optimized for efficiency and speed [41]. With only 1.1 billion parameters, it strikes a balance between efficient reasoning and minimal resource usage, making it ideal for real-time obstacle avoidance in robotic systems. In our project, we have specifically adopted TinyLlama 1.1B for its proven balance of capability and footprint. When deployed on a Raspberry Pi 5 computer, TinyLlama (quantized to 4-bit GGUF format) requires less than 1 GB of RAM and delivers response times under 1 s, ensuring that navigation decisions are made without significant delay. To further improve TinyLlama’s relevance and accuracy in dynamic environments, we integrate a retrieval-augmented generation (RAG) [46] pipeline that retrieves pertinent environmental context and past action records before each inference. This RAG layer allows the model to ground its responses in up-to-date sensor logs and obstacle maps, reducing hallucinations and improving consistency. By combining retrieval from a lightweight on-device knowledge store with TinyLlama’s generative capabilities, we achieve more precise, context-aware action suggestions. Unlike larger models, which demand high memory and computational power, TinyLlama runs smoothly even on the 4 GB version of Raspberry Pi 5, making it a practical choice for real-time decision-making in our project.

9. Comparison of Q-Learning/DQN vs. Q-Learning/DQN + LLM

We conducted comprehensive experiments to evaluate the impact of LLM-assisted reinforcement learning in embedded robotic navigation. Specifically, we compared four approaches—Q-learning, DQN, Q-learning + LLM, and DQN + LLM—under identical conditions across both static and dynamic obstacle scenarios. The evaluation focused on key performance metrics, including deadlock recovery rate, time to reach goal, collision rate, and successful navigation attempts.

In our system architecture, the Q-learning or DQN policies are executed on the STM32WB55RG microcontroller, while a lightweight LLM (TinyLlama) runs on an external Raspberry Pi 5 computer. The LLM is activated selectively using a decision function for LLM prompt selection, which intervenes when traditional learning-based navigation shows signs of failure—such as repeated collisions or deadlock scenarios.

The robot was tested in physical environments, including varying obstacle densities and dynamic elements such as moving obstacles. Each test scenario was repeated multiple times to ensure statistical reliability. The following Table 3 summarizes our experimental results across both static and dynamic conditions.

These results clearly demonstrate that LLM-assisted learning significantly improves navigation robustness and efficiency. Integrating the LLM yields clear, consistent gains across every key metric. In static scenarios, LLM augmentation boosts deadlock recovery by 33% (from 55% to 88%) and slashes the goal-reaching time by a third (from 45 s to 30 s), while cutting collisions by more than half (from 18% down to 7%) and raising successful runs from 72% to 91%. In dynamic environments—where adaptability matters most—the LLM’s impact is even greater: deadlock recovery jumps 41% (40% → 81%), time-to-goal falls by 34% (58 s → 38 s), collision rates drop 14% (25% → 11%), and the success rate climbs from 66% to 87%. Similar improvements are observed when pairing the LLM with DQN, underscoring that high-level reasoning from the LLM materially accelerates learning, enhances safety, and boosts overall navigation success.

10. Advantages of the Hybrid Q-Learning + LLM Approach over Classical and Deep RL Methods

10.1. Limitations of Classical Path Planning and RL Methods

While classical techniques like RRT and swarm intelligence excel in global path planning, and artificial potential fields (APFs) provide reactive local obstacle avoidance, these methods face significant limitations in dynamic, unstructured environments [47]. RRT and swarm-based planners depend on static or semi-static maps, requiring full recomputation when environmental conditions change, which hinders real-time adaptability. For RRT, complex environments increase the node count, potentially exceeding memory capacity, and frequent collision-checking with high-frequency sensor data strains the computational resources [48]. Similarly, swarm-based planners with large populations or complex fitness functions demand excessive memory and processing power, and their convergence speed relies heavily on parameter-tuning, risking failure to meet real-time constraints [49]. Although computationally lightweight and suitable for low-power microcontrollers like the STM32WB55RG, APF struggles with local minima, potentially trapping the system in suboptimal paths, and lacks learning or memory capabilities, limiting its effectiveness in complex scenarios [50,51]. Modern reinforcement learning approaches, such as proximal policy optimization (PPO) and soft actor–critic (SAC), excel in continuous control tasks and complex policies. However, their reliance on large neural networks and backpropagation makes them computationally memory-intensive, rendering them impractical for deployment on resource-constrained embedded systems like the STM32WB55RG.

10.2. Hardware Constraints of Deep RL on STM32WB55RG

Running SAC or PPO directly on the STM32WB55RG microcontroller is not practical due to severe hardware constraints. Both SAC and PPO rely on neural networks with tens of thousands to millions of parameters, typically requiring significant RAM and compute resources for both inference and training [52,53]. The STM32WB55RG, based on the Cortex-M4F core, has only 320 KB of RAM and 1 MB of flash, which is insufficient to store and run the neural networks used in standard implementations of these algorithms. Additionally, the microcontroller’s 64 MHz clock speed and lack of hardware acceleration for floating-point operations make real-time inference infeasible, as even small neural networks would introduce excessive latency [54]. On-device training is effectively impossible due to these memory and computational limitations, and deploying standard SAC or PPO models would require aggressive quantization, pruning, or distillation into extremely tiny networks—likely at the cost of significant performance degradation [55].

10.3. Limitations of Concurrent LLM and Deep RL Execution on Raspberry Pi 5 in Real-World Embedded Robotics

In real-world embedded robotics applications, such as the one we target in this work, computational resources are significantly constrained, and decisions must be made in real time. While deep reinforcement learning algorithms, like PPO and SAC, are powerful, their execution—especially training—involves multiple neural networks, backpropagation, and replay buffers, all of which require considerable computing and memory [56]. Simultaneously, even a small LLM, like TinyLlama (1.1B parameters, 4-bit quantized), demands high CPU usage and consumes over 700 MB of RAM on a Raspberry Pi 5 computer during inference. Attempting to run both components concurrently leads to memory saturation, CPU overload, and latency spikes that are unacceptable for real-time control, often causing system instability or throttling due to thermal and power constraints. Moreover, since our system is designed to operate entirely in a real environment—not in a simulated or cloud-offloaded setting—offline training of PPO or SAC is not feasible. All learning and decision-making must occur on-device and online, further restricting the viability of resource-intensive deep RL methods. This makes our hybrid approach—using lightweight Q-learning on the microcontroller for fast, reactive control and an external LLM for occasional high-level guidance—a practical and scalable solution for embedded autonomous navigation.

10.4. Advantages of the Hybrid Q-Learning + LLM Approach

By contrast, our hybrid Q-learning + LLM approach is designed specifically for obstacle avoidance, not global path planning, offering a significant advantage by combining adaptive, lightweight learning with context-aware reasoning. The tabular or low-complexity Q-learning component enables real-time, on-device learning with minimal computational demand, while the external LLM (hosted on Raspberry Pi 5) supports high-level intervention in challenging scenarios, like deadlocks or repeated collisions. This synergy enables both reactive obstacle avoidance and adaptive replanning, bridging the gap between lightweight embedded control and advanced cognitive reasoning—without overwhelming the constraints of the hardware.

11. Conclusions

This paper presents a pioneering hybrid framework that integrates Q-learning/deep Q-network (DQN) with a locally deployed large language model (LLM) to enhance obstacle avoidance in embedded robotic systems. The STM32WB55RG microcontroller handles real-time decision-making using sensor data, while a Raspberry Pi 5 computer runs a quantized TinyLlama LLM to dynamically refine navigation strategies. The LLM overcomes traditional Q-learning limitations, such as slow convergence and poor adaptability, by analyzing action histories and optimizing decision-making policies in complex, dynamic environments. A selective triggering mechanism ensures efficient LLM intervention, minimizing the computational overhead. Experimental results in dynamic environments demonstrate significant improvements, including up to 41% higher deadlock recovery (81% vs. 40% for Q-learning + LLM), up to 34% faster time to goal (38 s vs. 58 s for Q-learning + LLM), and up to 14% lower collision rates (11% vs. 25% for Q-learning + LLM) compared to standalone Q-learning/DQN. These enhancements enable robust, real-time obstacle avoidance on resource-constrained microcontrollers, addressing critical challenges in embedded robotics.

The findings open promising avenues for future research and practical applications. Future studies could focus on optimizing on-device LLM inference for even smaller microcontrollers (e.g., STM32F4 series), integrating federated learning to enable fleets of robots to share navigation strategies without centralizing sensitive data, or exploring multi-modal sensor fusion (e.g., combining ToF with vision) to enhance adaptability. Practically, this framework could revolutionize hospital logistics by enabling service robots to navigate crowded wards and transport medical supplies around moving patients, or improve warehouse automation by allowing robots to dynamically adjust to shifting inventory layouts, potentially reducing operational downtime by up to 30% based on current efficiency gains. Additionally, the approach could be adapted for autonomous delivery drones, ensuring safe navigation in urban environments with unpredictable obstacles.

However, the system’s reliance on an LLM introduces potential biases and ethical ramifications that require careful consideration. The TinyLlama model, trained on diverse text corpora, may inherit biases that could lead to uneven performance, such as misjudging obstacle types (e.g., prioritizing avoidance of certain objects due to skewed training data), posing safety risks in critical applications like healthcare. Ethical concerns include the risk of unintended consequences, such as robots making unsafe decisions in unfamiliar contexts, or privacy issues, if LLM prompts inadvertently leak sensitive environmental data. To mitigate these issues, rigorous bias auditing, transparent uncertainty reporting, and robust fallback policies (e.g., reverting to Q-learning in case of LLM failure) are essential. Addressing these challenges will ensure the framework’s scalability and trustworthiness for real-world deployment, reinforcing its transformative potential for autonomous robotic systems.

Author Contributions

Conceptualization, R.F. and G.O.; methodology, R.F., G.O. and T.D.; software, R.F., G.O. and T.D.; validation R.F., G.O. and T.D.; formal analysis, R.F., G.O. and T.D.; investigation, R.F., G.O. and T.D.; resources, R.F., G.O. and T.D.; data curation, R.F. and G.O.; writing—original draft preparation, R.F.; writing—review and editing, R.F., G.O. and T.D. visualization, R.F., G.O. and T.D.; supervision, R.F., G.O. and T.D.; project administration, R.F., G.O. and T.D.; funding acquisition, T.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pang, W.; Zhu, D.; Sun, C. Multi-AUV Formation Reconfiguration Obstacle Avoidance Algorithm Based on Affine Transformation and Improved Artificial Potential Field Under Ocean Currents Disturbance. IEEE Trans. Autom. Sci. Eng. 2023, 21, 1469–1487. [Google Scholar] [CrossRef]
Guo, B.; Guo, N.; Cen, Z. Obstacle Avoidance with Dynamic Avoidance Risk Region for Mobile Robots in Dynamic Environments. IEEE Robot. Autom. Lett. 2022, 7, 5850–5857. [Google Scholar] [CrossRef]
Zhai, L.; Liu, C.; Zhang, X.; Wang, C. Local Trajectory Planning for Obstacle Avoidance of Unmanned Tracked Vehicles Based on Artificial Potential Field Method. IEEE Access 2024, 12, 19665–19681. [Google Scholar] [CrossRef]
Li, J.; Xiong, X.; Yan, Y.; Yang, Y. A Survey of Indoor UAV Obstacle Avoidance Research. IEEE Access 2023, 11, 51861–51891. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, L.; Li, Y.; Fan, Y. A dynamic path planning method for social robots in the home environment. Electronics 2020, 9, 1173. [Google Scholar] [CrossRef]
Yang, B.; Yan, J.; Cai, Z.; Ding, Z.; Li, D.; Cao, Y.; Guo, L. A novel heuristic emergency path planning method based on vector grid map. ISPRS Int. J. Geo-Inf. 2021, 10, 370. [Google Scholar] [CrossRef]
Xiao, S.; Tan, X.; Wang, J. A simulated annealing algorithm and grid map-based UAV coverage path planning method for 3D reconstruction. Electronics 2021, 10, 853. [Google Scholar] [CrossRef]
Guo, J.; Liu, L.; Liu, Q.; Qu, Y. An Improvement of D* Algorithm for Mobile Robot Path Planning in Partial Unknown Environment. In Proceedings of the 2009 Second International Conference on Intelligent Computation Technology and Automation, Changsha, China, 10–11 October 2009. [Google Scholar]
Lin, T. A path planning method for mobile robot based on A and antcolony algorithms. J. Innov. Soc. Sci. Res. 2020, 7, 157–162. [Google Scholar]
Wang, J.; Zhang, Y.; Xia, L. Adaptive Genetic Algorithm Enhancements for Path Planning of Mobile Robots. In Proceedings of the 2010 International Conference on Measuring Technology and Mechatronics Automation, Changsha, China, 13–14 March 2010; pp. 416–419. [Google Scholar]
Saska, M.; Macăs, M.; Přeučil, L.; Lhotská, L. Robot path planning using particle swarm optimization of Ferguson splines. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Prague, Czech Republic, 20–22 September 2006; pp. 833–839. [Google Scholar]
Li, C.-L.; Wang, N.; Wang, J.-F.; Xu, S.-Y. A Path Planing Algorithm for Mobile Robot Based on Particle Swarm. In Proceedings of the 2023 2nd International Symposium on Control Engineering and Robotics (ISCER), Hangzhou, China, 17–19 February 2023; pp. 319–322. [Google Scholar]
Zhao, J.-P.; Gao, X.-W.; Liu, J.-G.; Chen, Y.-Q. Research of path planning for mobile robot based on improved ant colony optimization algorithm. In Proceedings of the 2010 2nd International Conference on Advanced Computer Control, Shenyang, China, 27–29 March 2010; pp. 241–245. [Google Scholar]
Rasheed, J.; Irfan, H. Q-Learning of Bee-Like Robots through Obstacle Avoidance. In Proceedings of the 2024 12th International Conference on Control, Mechatronics and Automation (ICCMA), London, UK, 11–13 November 2024; pp. 166–170. [Google Scholar]
Mohanty, P.K.; Saurabh, S.; Yadav, S.; Pooja; Kundu, S. A Q-Learning Strategy for Path Planning of Robots in Unknown Terrains. In Proceedings of the 2022 1st International Conference on Sustainable Technology for Power and Energy Systems (STPES), Srinagar, India, 4–6 July 2022; pp. 1–6. [Google Scholar]
Kumaar, A.A.N.; Kochuvila, S. Mobile Service Robot Path Planning Using Deep Reinforcement Learning. IEEE Access 2023, 11, 100083–100096. [Google Scholar] [CrossRef]
Masoud, M.; Hami, T.; Kourosh, F. Path Planning and Obstacle Avoidance of a Climbing Robot Subsystem using Q-learning. In Proceedings of the 2024 12th RSI International Conference on Robotics and Mechatronics (ICRoM), Tehran, Iran, 17–19 December 2024; pp. 149–155. [Google Scholar]
Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
Pieters, M.; Wiering, M.A. Q-learning with experience replay in a dynamic environment. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–8. [Google Scholar]
Du, H.B.; Zhao, H.; Zhang, J.; Wang, J.; Qi, Q. A Path-Planning Approach Based on Potential and Dynamic Q-Learning for Mobile Robots in Unknown Environment. Mob. Robot. Autom. 2022, 2022, 2540546. [Google Scholar]
Hanh, L.D.; Cong, V.D. Path following and avoiding obstacle for mobile robot under dynamic environments using reinforcement learning. J. Robot. Control. 2022, 13, 158–167. [Google Scholar] [CrossRef]
Gharbi, A. A dynamic reward-enhanced Q-learning approach for efficient path planning and obstacle avoidance of mobile robots. Appl. Comput. Inform. 2024. ahead of print. [Google Scholar] [CrossRef]
Ribeiro, T.; Gonçalves, F.; Garcia, I.; Lopes, G.; Ribeiro, A.F. Q-Learning for Autonomous Mobile Robot Obstacle Avoidance. In Proceedings of the 2019 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Porto, Portugal, 24–26 April 2019; pp. 1–7. [Google Scholar]
Li, X. Path planning in dynamic environments based on Q-learning. In Proceedings of the 5th International Conference on Mechanics, Simulation; Control (ICMSC 2023), San Jose, CA, USA, 24–25 June 2023; Volume 63, pp. 222–230. [Google Scholar]
Ha, V.T.; Vinh, V.Q. Experimental Research on Avoidance Obstacle Control for Mobile Robots Using Q-Learning (QL) and Deep Q-Learning (DQL) Algorithms in Dynamic Environments. Actuators 2024, 13, 26. [Google Scholar] [CrossRef]
Tadele, S.B.; Kar, B.; Wakgra, F.G.; Khan, A.U. Optimization of End-to-End AoI in Edge-Enabled Vehicular Fog Systems: A Dueling-DQN Approach. IEEE Internet Things J. 2025, 12, 843–853. [Google Scholar] [CrossRef]
Zhou, X.; Han, G.; Zhou, G.; Xue, Y.; Lv, M.; Chen, A. Hybrid DQN-Based Low-Computational Reinforcement Learning Object Detection with Adaptive Dynamic Reward Function and ROI Align-Based Bounding Box Regression. IEEE Trans. Image Process. 2025, 34, 1712–1725. [Google Scholar] [CrossRef]
Guo, L.; Jia, J.; Chen, J.; Wang, X. QRMP-DQN Empowered Task Offloading and Resource Allocation for the STAR-RIS Assisted MEC Systems. IEEE Trans. Veh. Technol. 2025, 74, 1252–1266. [Google Scholar] [CrossRef]
Luo, C.; Yang, S.X.; Stacey, D.A. Real-time path planning with deadlock avoidance of multiple cleaning robots. In Proceedings of the 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), Taipei, Taiwan, 14–19 September 2003; Volume 3, pp. 4080–4085. [Google Scholar] [CrossRef]
Li, G.; Han, X.; Zhao, P.; Hu, P.; Nie, L.; Zhao, X. RoboChat: A Unified LLM-Based Interactive Framework for Robotic Systems. In Proceedings of the 2023 5th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), Hangzhou, China, 1–3 December 2023; pp. 466–471. [Google Scholar]
Zhou, H.; Lin, Y.; Yan, L.; Zhu, J.; Min, H. LLM-BT: Performing Robotic Adaptive Tasks based on Large Language Models and Behavior Trees. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 16655–16661. [Google Scholar]
Tan, R.; Lou, S.; Zhou, Y.; Lv, C. Multi-modal LLM-enabled Long-horizon Skill Learning for Robotic Manipulation. In Proceedings of the 2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM), Hangzhou, China, 8–11 August 2024; pp. 14–19. [Google Scholar]
Shuvo, M.I.R.; Alam, N.; Fime, A.A.; Lee, H.; Lin, X.; Kim, J.-H. A Novel Large Language Model (LLM) Based Approach for Robotic Collaboration in Search and Rescue Operations. In Proceedings of the IECON 2024—50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, IL, USA, 3–6 November 2024; pp. 1–6. [Google Scholar]
Alto, V. Building LLM Powered Applications: Create Intelligent Apps and Agents with Large Language Models; Packt Publishing: Birmingham, UK, 2024. [Google Scholar]
Auffarth, B. Generative AI with LangChain: Build Large Language Model (LLM) Apps with Python, ChatGPT, and Other LLMs; Packt Publishing: Birmingham, UK, 2024. [Google Scholar]
Chen, J.; Yang, Z.; Xu, H.G.; Zhang, D.; Mylonas, G. Multi-Agent Systems for Robotic Autonomy with LLMs. arXiv 2024, arXiv:2505.05762. [Google Scholar]
Zu, W.; Song, W.; Chen, R.; Guo, Z.; Sun, F.; Tian, Z.; Pan, W.; Wang, J. Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework. arXiv 2024, arXiv:2311.08244. [Google Scholar]
Mon-Williams, R.; Li, G.; Long, R.; Du, W.; Lucas, C.G. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nat. Mach. Intell. 2025, 7, 592–601. [Google Scholar] [CrossRef]
Tao, Y.; Yang, J.; Ding, D.; Erickson, Z. LAMS: LLM-Driven Automatic Mode Switching for Assistive Teleoperation. In Proceedings of the 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Melbourne, Australia, 4–6 March 2025; pp. 242–251. [Google Scholar]
Tardioli, D.; Matellán, V.; Heredia, G.; Silva, M.F.; Marques, L. (Eds.) ROBOT2022: Fifth Iberian Robotics Conference, Advances in Robotics; Springer Nature: Berlin, Germany, 2023; Volume 2. [Google Scholar]
Zhang, P.; Zeng, G.; Wang, T.; Lu, W. TinyLlama: An Open-Source Small Language Model. arXiv 2024, arXiv:2401.02385. [Google Scholar] [CrossRef]
Lamaakal, L.; Maleh, Y.; El Makkaoui, K.; Ouahbi, I.; Pławiak, P.; Alfarraj, O.; Almousa, M.; El-Latif, A.A.A. Tiny Language Models for Automation and Control: Overview, Potential Applications, and Future Research Directions. Sensors 2025, 25, 1318. [Google Scholar] [CrossRef] [PubMed]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B: A 7-billion-parameter language model engineered for superior performance and efficiency. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019. [Google Scholar] [CrossRef]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv 2020, arXiv:1909.10351. [Google Scholar]
Su, C.; Wen, J.; Kang, J.; Wang, Y.; Su, Y.; Pan, H.; Zhong, Z.; Hossain, M.S. Hybrid RAG-Empowered Multimodal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach. IEEE Internet Things J. 2025, 12, 13428–13440. [Google Scholar] [CrossRef]
Seif, R.; Oskoei, M.A. Mobile Robot Path Planning by RRT* in Dynamic Environments. Int. J. Intell. Syst. Appl. 2015, 7, 24–30. [Google Scholar] [CrossRef]
LaValle, S.M.; Kuffner, J.J. Randomized Kinodynamic Planning. Int. J. Robot. Res. 2001, 20, 378–400. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Chapter 5: Particle Swarm Optimization. In Swarm Intelligence; Morgan Kaufmann Publishers: Burlington, MA, USA, 2001; pp. 287–318. ISBN 978-1558605954. [Google Scholar]
Pal, N.S.; Sharma, S. Robot Path Planning using Swarm Intelligence: A Survey. Int. J. Comput. Appl. 2013, 83, 5–12. [Google Scholar] [CrossRef]
Barraquand, J.; Langlois, B.; Latombe, J.-C. Numerical potential field techniques for robot path planning. IEEE Trans. Syst. Man Cybern. 1992, 22, 224–241. [Google Scholar] [CrossRef]
Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; Madry, A. Implementation Matters in Deep RL: A Case Study on PPO and TRPO. arXiv 2020, arXiv:2005.12729. [Google Scholar]
Lu, J.; Wu, X.; Cao, S.; Wang, X.; Yu, H. An Implementation of Actor-Critic Algorithm on Spiking Neural Network Using Temporal Coding Method. Appl. Sci. 2022, 12, 10430. [Google Scholar] [CrossRef]
Zephyr Project. Nucleo WB55RG—Zephyr Project Documentation. Available online: https://docs.zephyrproject.org/latest/boards/st/nucleo_wb55rg/doc/nucleo_wb55rg.html (accessed on 14 May 2025).
Badalian, K.; Koch, L.; Brinkmann, T.; Picerno, M.; Wegener, M.; Lee, S.-Y.; Andert, J. LExCI: A framework for reinforcement learning with embedded systems. Appl. Intell. 2024, 54, 8384–8398. [Google Scholar] [CrossRef]
Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Mach. Learn. 2021, 110, 2419–2468. [Google Scholar] [CrossRef]

Figure 1. AlphaBot2 robot.

Figure 2. System architecture.

Figure 3. Q-learning and DQN learning.

Figure 4. Data processing pipeline.

Figure 5. Detailed flow diagram of the hybrid decision-making loop.

Table 1. Q-learnings parameters.

Parameter	Description	Value
α (Learning Rate)	Controls the weight of new experiences in Q-value updates. Higher values make learning faster but noisier	0.1
γ (Discount Factor)	Determines how much future rewards are considered. Closer to 1 means long-term rewards are prioritized	0.9
ε (Exploration Rate)	Probability of selecting a random action instead of the best-known action. Decays over time to prioritize exploitation.	Initially 1.0, decays to 0.1
Episodes	Number of training iterations	10,000

Table 2. Small Language Models Comparison.

Model	Size (Quantized GGUF/ONNX)	Min RAM Needed	Inference Speed (RPi 5)	Best for	Obstacle Avoidance Suitability and Justification
TinyLlama 1.1B [41]	~700 MB (4-bit GGUF)	1 GB	0.5–1.2 s/response	Fast real-time decisions	Best choice for fast, real-time navigation decisions on 4 GB RPi 5.
Phi-2 (Quantized) [42]	1.8 GB (4-bit GGUF)	2 GB	1.5–3 s/response	Small-scale reasoning	Good for low-power reasoning
DistilGPT-2 [42]	~350 MB (ONNX)	512 MB–1 GB	0.3–1 s/response	Simple rule-based responses	Can work for basic action suggestions, but lacks deep reasoning
GPT-2 Small [42]	~500 MB (ONNX)	1 GB	1–2 s/response	Basic decision-making	Similar to DistilGPT-2, can be used for simple command-based decisions
Mistral 7B (Quantized) [43]	~3.8 GB (4-bit GGUF)	5 GB	2–5 s/response	Better reasoning (8 GB RPi 5 only)	More advanced reasoning, but too slow for real-time use on 4 GB RPi. Works for high-level decision-making in 8 GB RPi 5.
DistilBERT [44]	~300 MB (ONNX)	512 MB–1 GB	0.2–0.8 s/response	Fast text-based reasoning	Not suitable for obstacle avoidance. Designed for NLP tasks
TinyBERT [45]	~120 MB (ONNX)	256 MB–512 MB	0.1–0.5 s/response	Super lightweight NLP	Not optimized for real-time decisions.

Table 3. Experimental results.

Metric	Q-Learning	Q-Learning + LLM	DQN	DQN + LLM	Improvement (Q → Q + LLM) (%)	Improvement (DQN → DQN + LLM) (%)
Deadlock Recovery Rate (Static)	55%	88%	70%	92%	+33%	+22%
Deadlock Recovery Rate (Dynamic)	40%	81%	62%	89%	+41%	+27%
Time to Reach Goal (Static)	45 s	30 s	35 s	25 s	−33%	−29%
Time to Reach Goal (Dynamic)	58 s	38 s	44 s	31 s	−34%	−30%
Collision Rate (Static)	18%	7%	12%	5%	−11%	−7%
Collision Rate (Dynamic)	25%	11%	17%	8%	−14%	−9%
Successful Navigation Attempts (Static)	72%	91%	85%	94%	+19%	+9%
Successful Navigation Attempts (Dynamic)	66%	87%	78%	91%	+21%	+13%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Farkh, R.; Oudinet, G.; Deleruyelle, T. Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics. AI 2025, 6, 115. https://doi.org/10.3390/ai6060115

AMA Style

Farkh R, Oudinet G, Deleruyelle T. Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics. AI. 2025; 6(6):115. https://doi.org/10.3390/ai6060115

Chicago/Turabian Style

Farkh, Rihem, Ghislain Oudinet, and Thibaut Deleruyelle. 2025. "Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics" AI 6, no. 6: 115. https://doi.org/10.3390/ai6060115

APA Style

Farkh, R., Oudinet, G., & Deleruyelle, T. (2025). Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics. AI, 6(6), 115. https://doi.org/10.3390/ai6060115

Article Menu

Evaluating a Hybrid LLM Q-Learning/DQN Framework for Adaptive Obstacle Avoidance in Embedded Robotics

Abstract

1. Introduction

2. System Architecture

3. DQN and Q-Learning for Obstacle Avoidance

3.1. Q-Learning

3.2. DQN for Obstacle Avoidance

4. Hybrid Learning Approach

5. Decision Function for LLM Prompt Selection

6. LLM Prompting Strategies for the Robot

6.1. LLM Prompting with Q-Learning

6.2. LLM Prompting with DQN

7. Preparing LLM Prompts

8. LLM on Raspberry Pi 5

9. Comparison of Q-Learning/DQN vs. Q-Learning/DQN + LLM

10. Advantages of the Hybrid Q-Learning + LLM Approach over Classical and Deep RL Methods

10.1. Limitations of Classical Path Planning and RL Methods

10.2. Hardware Constraints of Deep RL on STM32WB55RG

10.3. Limitations of Concurrent LLM and Deep RL Execution on Raspberry Pi 5 in Real-World Embedded Robotics

10.4. Advantages of the Hybrid Q-Learning + LLM Approach

11. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI