You are currently viewing a new version of our website. To view the old version click .
Symmetry
  • Article
  • Open Access

17 November 2024

LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning

,
and
1
School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450052, China
2
School of Information Engineering, Chang’an University, Xi’an 710064, China
3
Department of Computer Science, University of Bristol, Bristol BS8 1QU, UK
*
Author to whom correspondence should be addressed.
This article belongs to the Section Computer

Abstract

Coverage Path Planning (CPP) in unknown environments presents unique challenges that often require the system to maintain a symmetry between exploration and exploitation in order to efficiently cover unknown areas. This paper introduces latent imagination-based reinforcement learning (LIRL), a novel framework that addresses these challenges by integrating three key components: memory-augmented experience replay (MAER), a latent imagination module (LIM), and multi-step prediction learning (MSPL) within a soft actor–critic architecture. MAER enhances sample efficiency by prioritizing experience retrieval, LIM facilitates long-term planning via simulated trajectories, and MSPL optimizes the trade-off between immediate rewards and future outcomes through adaptive n-step learning. MAER, LIM, and MSPL work within a soft actor–critic architecture, and LIRL creates a dynamic equilibrium that enables efficient, adaptive decision-making. We evaluate LIRL across diverse simulated environments, demonstrating substantial improvements over state-of-the-art methods. Through this method, the agent optimally balances short-term actions with long-term planning, maintaining symmetrical responses to varying environmental changes. The results highlight LIRL’s potential for advancing autonomous CPP in real-world applications such as search and rescue, agricultural robotics, and warehouse automation. Our work contributes to the broader fields of robotics and reinforcement learning, offering insights into integrating memory, imagination, and adaptive learning for complex sequential decision-making tasks.

1. Introduction

Coverage path planning (CPP) is a fundamental problem in robotics with wide-ranging applications, from autonomous vacuum cleaning and precision agriculture to search-and-rescue operations and environmental monitoring [1]. At its core, CPP involves determining a path that symmetrically balances the exploration of uncharted territory with the exploitation of known areas to optimize for efficiency. The ability to maintain this balance—between covering new areas and refining already explored paths—is key to achieving optimal coverage performance in unknown or dynamic environments [2]. While significant strides have been made in CPP for known environments, the challenge of efficient coverage in unknown or partially observable environments remains a critical frontier in robotics research [3].
Recent advancements in deep reinforcement learning (RL) have demonstrated remarkable potential in addressing complex decision-making problems, including path planning [4]. The ability of RL to learn from experience and adapt to new situations makes it particularly appealing for CPP in unknown environments. However, the application of RL to CPP faces several interrelated challenges. Traditional RL approaches often require extensive interaction with the environment to learn effective policies, which can be impractical or costly in real-world scenarios [5]. Moreover, CPP necessitates consideration of long-term consequences of actions, a task that many RL algorithms focused on immediate or short-term rewards struggle to accomplish [6]. The need for rapid adaptation to new, unseen environments further complicates the application of current RL approaches to CPP [7]. Additionally, in real-world scenarios, the entire environment is rarely observable at once, requiring strategies to deal with partial information [8].
To address these challenges, we propose LIRL (latent imagination-based reinforcement learning), a novel framework that integrates memory augmentation, latent imagination, and multi-step prediction learning to enhance the efficiency and adaptability of CPP. Our method builds upon recent developments in model-based RL [9] and memory-augmented neural networks [10], extending them to the specific challenges of coverage path planning.
The LIRL framework comprises several interconnected components designed to work in synergy. At its core is a memory-augmented experience replay mechanism that leverages coverage efficiency metrics to prioritize learning from the most informative experiences. This is coupled with an attention-based memory network that enables the agent to store and retrieve critical decision points, enhancing its ability to make informed choices in complex environments [11]. A key innovation in LIRL is the latent imagination module, inspired by recent work on world models [12]. This module develops a latent space model allowing the agent to simulate potential trajectories without direct interaction with the environment, significantly improving sample efficiency and enabling reasoning about long-term consequences of actions [13].
To balance short-term rewards with long-term planning, LIRL implements an adaptive n-step learning algorithm. This approach mitigates the challenges of credit assignment in CPP tasks, where the benefits of actions may only become apparent after several steps [14]. The framework is completed by a policy network that integrates current state information, retrieved memories, and imagined future states to output action probabilities, and an environment interaction module that collects real experiences to update the memory and train the models.
The LIRL agent operates in a continuous cycle of perception, imagination, and action. At each time step, it observes the current state, retrieves relevant past experiences, and uses its latent imagination module to simulate potential future trajectories. Based on this rich combination of real and imagined experiences, the policy network decides on the next action. The resulting experience is then used to update the agent’s models and memory, creating a feedback loop that continuously enhances performance.
We evaluate LIRL on a diverse range of simulated CPP scenarios, encompassing both structured environments like indoor floor plans and unstructured terrains such as outdoor agricultural fields. Our comprehensive experiments demonstrate significant improvements in coverage efficiency, adaptation speed, and robustness compared to state-of-the-art methods [15].
The main contributions of this paper are as follows:
  • A novel RL framework, LIRL, that integrates memory augmentation, latent imagination, and multi-step learning for efficient CPP in unknown environments.
  • A prioritized experience replay mechanism based on coverage efficiency metrics, enhancing learning from the most informative experiences.
  • A latent imagination module that enables efficient planning and reasoning about long-term consequences in CPP tasks.
  • An adaptive n-step learning algorithm tailored for the unique challenges of CPP.
  • Comprehensive evaluations demonstrating LIRL’s effectiveness across various CPP scenarios, including comparisons with state-of-the-art methods and ablation studies.
The remainder of this paper is structured as follows. Section 2 reviews related work in CPP, RL, and memory-augmented neural networks. Section 3 presents a detailed description of the LIRL framework, including the formulation of the CPP problem as a partially observable markov decision process (POMDP) and the architecture of each component. Section 4 describes our experimental setup and our results, including performance comparisons and ablation studies. Section 5 discusses the implications of our work, its limitations, and directions for future research.

3. Method

In this section, we present the LIRL framework for efficient CPP in unknown environments. We first formulate the CPP problem as a POMDP and then describe the key components of our approach: the memory-augmented experience replay, the latent imagination module, and the multi-step prediction learning algorithm.

3.1. Problem Formulation

The CPP problem is formulated as a POMDP with discrete time steps t [ 1 . . T ] . At each time step, the agent executes continuous actions a t π ( a t | s t ) according to policy π , while receiving states s t and rewards r t from the environment. The state s t represents the environment, including geometry, visited points, the agent’s pose, and uncertainty estimates from a Transformer-based world model. To maximize the expected discounted reward J, a neural network is used to represent the policy, where J is defined as follows:
J = E t = 1 T γ t 1 r t
Figure 1 illustrates the RL loop, in which observations, actions, and rewards are collected in a replay buffer, from which training batches are sampled for gradient-based learning. This setup enables the agent to learn an effective coverage strategy by interacting with the environment and adapting to a range of scenarios encountered during the CPP task.
Figure 1. Schematic of the proposed LIRL framework for efficient CPP.

3.1.1. State Space

To facilitate efficient coverage learning, the observation space must be appropriately mapped for input into the neural network policy. Visited points are represented as a coverage map, as shown in Figure 2. The environment is discretized into a high-resolution 2D grid, ensuring accurate representation. To address scalability for larger areas, we use multi-scale maps, crucial for encoding both local details and global structure. This representation enables the policy to capture spatial relationships at multiple levels of granularity, improving the agent’s ability to plan coverage paths across diverse environments.
Figure 2. Illustration of coverage maps at multiple scales.
Inspired by multi-layered maps for search-and-rescue operations [41], we use multi-scale maps for coverage and obstacle representation to enhance scalability. Our approach extracts multiple local neighborhoods with increasing size and decreasing resolution while maintaining a fixed grid size. Specifically, we begin with a local square crop M ( 1 ) of side length d 1 at the finest scale. The multi-scale representation M = { M ( i ) } i = 1 m consists of increasingly larger areas based on a fixed scale factor s, where the side length of each map M ( i ) is d i = l d i 1 .
The resolution at the finest scale is chosen to capture sufficient detail for local navigation, while larger scales support long-term planning where less precision is required. This approach can represent an area of size d × d in O ( log d ) scales, with O ( w h log d ) grid cells, where w and h are fixed dimensions, independent of d. This significantly improves efficiency compared to a single fixed-resolution map with O ( d 2 ) cells.
We use a multi-scale map M c for coverage representation (Figure 2), enhancing the agent’s ability to balance local detail with global context, thereby improving coverage path planning efficiency.
To respond rapidly to short-term obstacles, sensor data are incorporated into the input representation. Depth measurements from the LiDAR sensor are normalized to the range [ 0 , 1 ] and concatenated into the state vector S, ensuring consistent scaling across different configurations. To improve decision-making robustness, uncertainty estimates from the world model are also included in S, providing a comprehensive representation of the environment’s structure and confidence in the agent’s perception. This integrated approach allows the agent to adapt based on both immediate sensory input and the reliability of its predictive model, thereby enhancing navigation efficiency in dynamic or partially observable environments.

3.1.2. Action Space

Our policy model directly predicts the agent’s control signals, enabling adaptation to specific environmental conditions and avoiding the limitations of discrete action spaces. We consider an agent that controls its linear velocity v and angular velocity ω , with actions normalized to the range [ 1 , 1 ] based on their respective maximum values:
a t = v t v max , ω t ω max
The normalized velocities determine the direction of travel and rotation, providing fine-grained control over movements and enabling more efficient and smooth coverage paths compared to discrete alternatives. This approach also facilitates policy transfer to different robotic platforms with varying kinematic constraints.

3.1.3. Reward Function

To achieve comprehensive coverage of free space, we introduce a reward mechanism based on the exploration of previously uncovered areas. The reward term R area is proportional to the newly covered area A new at each time step. It is normalized to the range [ 0 , 1 ] and scaled by a factor λ area , resulting in the following reward formulation:
R area = λ area A new 2 r v max Δ t ,
where r is the agent’s radius and Δ t is the time step size.
Exclusively maximizing the coverage reward may discourage the agent from revisiting previously explored paths, potentially leading to suboptimal coverage patterns characterized by residual gaps. Such gaps impede complete coverage and necessitate costly revisits in later stages. To address this, we introduce an additional reward term based on minimizing the total variation (TV) of the coverage map, which reduces coverage boundaries and promotes a more uniform coverage pattern. Given a 2D signal x, the discrete isotropic total variation is defined as:
TV ( x ) = i , j ( x i + 1 , j x i , j ) 2 + ( x i , j + 1 x i , j ) 2
The incremental TV reward, R TV , is derived from the change in TV between consecutive time steps. This reward encourages smoother and more continuous coverage patterns by assigning positive rewards for reductions in TV and negative rewards for increases. The incremental reward is normalized by the maximum possible increase in TV per time step, expressed as:
R TV ( t ) = λ TV T V ( C t ) T V ( C t 1 ) 2 v max Δ t
where λ TV is chosen such that | R TV |   <   | R area | on average, ensuring that standing still is not an optimal strategy. This combined reward structure balances immediate coverage gains with long-term path planning efficiency, promoting effective and complete coverage.

3.2. Memory-Augmented Experience Replay

The Memory-Augmented Experience Replay (MAER) plays a pivotal role in our LIRL framework, contributing to maintaining symmetry between past experiences and future predictions. This mechanism enables the agent to retrieve and prioritize experiences symmetrically, focusing on both successful explorations and suboptimal paths that need improvement. This balance enhances both sample efficiency and long-term planning capabilities, allowing the agent to continuously refine its strategy for efficient coverage.
In our MAER, experiences are stored as tuples ( o t , a t , r t , o t + 1 ) , where o t represents the observation at time t, a t is the action taken, r t is the reward received, and o t + 1 is the resulting observation. This formulation allows the agent to learn from past experiences while accounting for the partial observability inherent in many CPP scenarios.
To prioritize the most informative experiences, we introduce a novel sampling mechanism based on coverage efficiency. The probability of sampling an experience i is given by:
p i = ( Δ C i / Δ t i + ϵ ) α j ( Δ C j / Δ t j + ϵ ) α
Here, Δ C i represents the increase in the covered area resulting from the experience, and Δ t i is the time taken to achieve this coverage. The term ϵ is a small constant (e.g., 10 5 ) that ensures non-zero sampling probabilities for all experiences, preventing rarely sampled experiences from being completely ignored. The exponent α controls the strength of prioritization, with higher values increasing the focus on high-efficiency experiences. In our experiments, we found α = 0.6 to provide a good balance between prioritization and diversity in sampling.
This prioritization scheme ensures that experiences leading to efficient coverage are more likely to be sampled, allowing the agent to learn more effectively from successful strategies. However, to prevent overfitting to a narrow set of experiences, we combine this prioritized sampling with a standard uniform sampling approach, using a mixture coefficient β (typically set to 0.7 in our experiments) to balance between prioritized and uniform sampling.
To further enhance the agent’s ability to leverage past experiences, we implement an attention-based memory retrieval mechanism. This mechanism allows the agent to selectively recall relevant information from its memory when making decisions. The retrieved memory at time t is computed as a weighted sum of memory entries:
m t = i = 1 K w i M i
where M i represents the i-th memory entry, K is the total number of memory entries, and w i are attention weights computed as:
w i = exp ( f ( o t ) T g ( M i ) ) j = 1 K exp ( f ( o t ) T g ( M j ) )
In this equation, f and g are embedding functions that map observations and memory entries to a common embedding space. Specifically, f is implemented as a multi-layer perceptron that encodes the current observation o t , while g is another multi-layer perceptron that encodes each memory entry M i . The dot product f ( o t ) T g ( M i ) measures the relevance of memory entry M i to the current observation o t . The softmax operation normalizes these relevance scores to produce the attention weights w i .
This attention mechanism allows the agent to focus on the most relevant past experiences when making decisions, effectively increasing the context available for decision-making beyond the current observation. By combining information from the current observation with selectively retrieved memories, the agent can make more informed decisions, particularly in partially observable environments where the current observation alone may be insufficient.
The retrieved memory m t is then concatenated with the current observation o t to form an augmented state representation s t = [ o t ; m t ] . This augmented state is used as input to the policy network, allowing the agent to condition its actions on both current observations and relevant past experiences.
By integrating prioritized sampling and attention-based memory retrieval, our MAER system significantly enhances the agent’s ability to learn efficient coverage strategies. It allows the agent to focus on the most informative experiences while also leveraging relevant past information, thereby improving sample efficiency and enabling more effective long-term planning in CPP tasks.

3.3. Latent Imagination Module

The Latent Imagination Module (LIM) is a key component of our LIRL framework, designed to enable the agent to reason about potential future states without direct environment interaction. This capability is particularly crucial in CPP tasks, where efficient exploration of unknown environments is paramount. The LIM is based on a variational world model that learns to encode observations, predict future states, and reconstruct observations from latent states.
Our LIM employs a variational world model consisting of three main components: an encoder, a dynamics model, and a decoder. This module introduces a critical symmetry between the agent’s real and imagined experiences, enabling it to reason about potential future states without direct interaction with the environment. The encoder, denoted as q ϕ ( z t | o t ) , maps an observation o t at time t to a distribution over latent states z t . The dynamics model, p θ ( z t + 1 | z t , a t ) , predicts the next latent state z t + 1 given the current latent state z t and action a t . Finally, the decoder, p ψ ( o t | z t ) , reconstructs the observation o t from the latent state z t . The symmetrical structure of the LIM allows the agent to balance immediate, sensory-based decision-making with future-oriented planning by imagining potential trajectories. This symmetry between real and imagined experiences fosters greater efficiency in path planning and long-term coverage strategies.
The encoder q ϕ ( z t | o t ) is implemented as a convolutional neural network followed by a fully connected layer, outputting the mean and diagonal covariance of a Gaussian distribution over the latent state. This allows the model to capture uncertainty in the encoding process. The dynamics model p θ ( z t + 1 | z t , a t ) is a recurrent neural network that takes the current latent state and action as input and outputs the parameters of a Gaussian distribution over the next latent state. This recurrent structure enables the model to capture temporal dependencies in the environment dynamics. The decoder p ψ ( o t | z t ) is implemented as a deconvolutional network that reconstructs the observation from the latent state.
To train this world model, we maximize the evidence lower bound (ELBO), which is derived from the variational inference framework:
L W M = E q ϕ ( z 1 : T | o 1 : T ) t = 1 T log p ψ ( o t | z t ) β D K L ( q ϕ ( z t | o t ) | | p θ ( z t | z t 1 , a t 1 ) )
In this equation, T represents the length of the trajectory, and the expectation is taken with respect to the variational posterior q ϕ ( z 1 : T | o 1 : T ) . The first term, log p ψ ( o t | z t ) , encourages the model to accurately reconstruct observations from latent states. The second term is the Kullback–Leibler (KL) divergence between the encoded distribution q ϕ ( z t | o t ) and the predicted distribution p θ ( z t | z t 1 , a t 1 ) , which acts as a regularizer to ensure that the encoded and predicted latent states are consistent.
The hyperparameter β balances the trade-off between reconstruction accuracy and regularization. A higher β encourages the model to learn a more compressed and disentangled latent representation, while a lower β prioritizes reconstruction accuracy. In our experiments, we found that a value of β = 0.5 provides a good balance, allowing for accurate predictions while maintaining a structured latent space.
Once trained, the LIM enables the agent to imagine potential future trajectories by sampling from the learned dynamics model. Starting from an initial latent state z 0 , we can generate a sequence of imagined latent states and corresponding observations:
z t + 1 p θ ( z t + 1 | z t , a t )
o ^ t + 1 = E p ψ ( o t + 1 | z t + 1 ) [ o t + 1 ]
Here, o ^ t + 1 represents the imagined observation at time t + 1 . By generating multiple such trajectories, the agent can evaluate different action sequences and their potential outcomes without actually executing them in the environment. This imagination-based planning allows the agent to reason about the long-term consequences of its actions, which is particularly beneficial in CPP tasks where efficient coverage often requires considering multi-step strategies.
The imagined trajectories are used to augment the real experiences in the replay buffer, effectively increasing the sample efficiency of the learning process. However, to prevent the agent from overfitting to potentially inaccurate imagined experiences, we introduce an imagination ratio γ i m that controls the proportion of imagined experiences used in each training batch. Typically, we set γ i m = 0.3 , meaning that 30 % of each training batch consists of imagined experiences, while the remaining 70 % are real experiences.
Furthermore, to ensure that the imagined experiences remain grounded in reality, we periodically update the initial latent state z 0 used for imagination based on real observations. This update frequency is controlled by a hyperparameter τ u p d a t e , which we typically set to 10 environment steps.
By integrating the latent imagination module into our LIRL framework, we enable the agent to learn more efficient coverage strategies through a combination of real and imagined experiences. This approach significantly enhances the agent’s ability to plan and reason about long-term consequences in CPP tasks, particularly in unknown or partially observable environments where direct exploration can be costly or time-consuming.

3.4. Multi-Step Prediction Learning

The multi-step prediction learning (MSPL) component of our LIRL framework introduces a crucial symmetry in the prediction horizon, ensuring that the agent considers both short-term and long-term outcomes of its actions. MSPL employs an adaptive n-step learning algorithm that dynamically adjusts the prediction horizon based on the current state and the agent’s uncertainty. This adaptive symmetry enables the agent to dynamically balance the immediate benefits of actions with their delayed effects, ensuring efficient coverage across various environments. This symmetrical approach to learning allows the agent to seamlessly integrate current rewards with future gains, optimizing for both short-term efficiency and long-term success.
At the core of MSPL is the n-step return, which provides a bridge between one-step temporal difference (TD) learning and Monte Carlo methods. The n-step return for a state s t is defined as:
G t ( n ) = k = 0 n 1 γ k r t + k + γ n V ( s t + n )
where γ [ 0 , 1 ] is the discount factor, r t + k is the reward received k steps after time t, and V ( s t + n ) is the estimated value of the state encountered after n steps. This formulation allows the agent to consider rewards over multiple time steps while still bootstrapping from the value function estimate.
The choice of n in the n-step return represents a trade-off. Larger values of n incorporate more actual rewards and can lead to faster learning in some cases, but they also introduce more variance into the estimates. Smaller values of n result in lower variance but may suffer from bias due to inaccurate value function estimates. In traditional RL, n is often a fixed hyperparameter. However, in our MSPL approach, we adaptively determine n based on the agent’s current uncertainty and the temporal difference error.
To implement this adaptive approach, we maintain a set of n-step returns for different values of n, typically ranging from 1 to a maximum value N m a x (e.g., N m a x = 10 ). At each time step, we compute the temporal difference error for each n-step return:
δ t ( n ) = G t ( n ) V ( s t )
where V ( s t ) is the current estimate of the state value function. The temporal difference error δ t ( n ) provides a measure of how much the n-step return differs from the current value estimate.
We then select the value of n that minimizes the absolute temporal difference error:
n t = arg min n | δ t ( n ) |
This selection criterion balances between bias and variance. When the value function estimates are accurate, shorter n-step returns (smaller n) will typically have smaller temporal difference errors. Conversely, when value function estimates are inaccurate, longer n-step returns may provide better estimates.
Once the appropriate n t is selected, we update the value function using the corresponding n-step return:
V ( s t ) V ( s t ) + α δ t ( n t )
where α is the learning rate. This update rule moves the value function estimate towards the n-step return that is deemed most appropriate for the current state.
To further enhance the stability and efficiency of learning, we incorporate an uncertainty estimate into our n-step selection process. We maintain an uncertainty measure U ( s t ) for each state, updated using the Bellman error:
U ( s t ) ( 1 β ) U ( s t ) + β | δ t ( 1 ) |
where β is a smoothing factor (typically set to 0.1 in our experiments). This uncertainty estimate is then used to modulate the n-step selection:
n t = arg min n ( | δ t ( n ) | + λ U ( s t ) )
where λ is a hyperparameter that controls the influence of the uncertainty estimate. Higher values of λ encourage the use of shorter n-step returns in states with high uncertainty, promoting more conservative updates in these states.
The MSPL approach is particularly beneficial in CPP tasks, where the consequences of actions may not be immediately apparent. For instance, a decision to explore a particular area might only yield significant rewards after several time steps when a large unexplored region is discovered. By adaptively choosing the n-step return, our agent can better capture these delayed effects while maintaining stable learning.
Moreover, the adaptive nature of MSPL allows it to automatically adjust to different phases of the learning process. In the early stages of learning, when value estimates are likely to be inaccurate, it may favor longer n-step returns to accelerate learning. As the agent’s estimates improve, it can shift towards shorter n-step returns for more fine-grained updates.
The MSPL component integrates seamlessly with the other elements of our LIRL framework. The n-step returns are computed using both real experiences from the memory-augmented experience replay and imagined trajectories from the latent imagination module. This integration allows the agent to learn from a rich combination of actual and simulated experiences, further enhancing its ability to develop efficient coverage strategies.
By employing this adaptive multi-step prediction learning approach, our LIRL framework can effectively balance short-term and long-term predictions, leading to more efficient and robust learning in complex CPP tasks. The dynamic adjustment of the prediction horizon enables the agent to capture delayed effects of actions while maintaining stable learning, a crucial capability for effective coverage path planning in unknown environments.

3.5. Policy Learning Module

The Policy Learning Module is the cornerstone of our LIRL framework, integrating the MAER, LIM, and MSPL components to learn an effective policy for CPP. The module operates by maintaining a symmetrical balance between exploration and exploitation, leveraging the SAC algorithm’s sample efficiency and stability in continuous action spaces. The integration of past memory, future imagination, and multi-step prediction creates a harmonious feedback loop that preserves symmetry between short-term rewards and long-term planning. This equilibrium is critical in ensuring comprehensive coverage in unknown environments, where balanced decision-making is key to success.
In our framework, the policy π ϕ ( a | s ) is parameterized by a neural network with parameters ϕ , which maps states s to a distribution over actions a. The state s is an augmented representation that combines the current observation o t , the retrieved memory m t from MAER, and the latent state z t from LIM: s t = [ o t ; m t ; z t ] . This rich state representation allows the policy to make decisions based on current observations, relevant past experiences, and predictions about future states.
The SAC algorithm aims to learn a policy that maximizes both the expected return and the entropy of the policy:
J ( π ) = E τ π t = 0 γ t ( r t + α H ( π ( · | s t ) ) )
where τ = ( s 0 , a 0 , s 1 , a 1 , ) is a trajectory, γ is the discount factor, r t is the reward at time t, H is the entropy of the policy, and α is the temperature parameter that balances between exploitation and exploration.
The training process involves alternating between collecting experiences, updating the policy, and updating the value functions. We use two Q-functions, Q θ 1 and Q θ 2 , to mitigate overestimation bias, and a value function V ψ to stabilize training. The policy is updated to maximize the expected Q-value and entropy:
ϕ J ( π ) = E s t D , a t π ϕ [ ϕ log π ϕ ( a t | s t ) ( min i = 1 , 2 Q θ i ( s t , a t ) α log π ϕ ( a t | s t ) ) ]
where D is the replay buffer containing both real and imagined experiences.
The Q-functions are updated to minimize the Bellman error:
L Q ( θ i ) = E ( s t , a t , r t , s t + 1 ) D [ ( Q θ i ( s t , a t ) ( r t + γ ( V ψ ( s t + 1 ) α log π ϕ ( a t + 1 | s t + 1 ) ) ) ) 2 ]
The value function is updated to minimize the mean squared error between its predictions and the Q-values:
L V ( ψ ) = E s t D [ ( V ψ ( s t ) E a t π ϕ [ min i = 1 , 2 Q θ i ( s t , a t ) α log π ϕ ( a t | s t ) ] ) 2 ]
The temperature parameter α is also learned to automatically adjust the balance between exploration and exploitation:
L ( α ) = E a t π ϕ [ α log π ϕ ( a t | s t ) α H t a r g e t ]
where H t a r g e t is a target entropy.
Algorithm 1 integrates all components of the LIRL framework. Below is the pseudocode for the main training loop:
This training process allows the agent to continuously improve its policy by leveraging real experiences, imagined trajectories, and retrieved memories. The integration of MAER enables the agent to focus on the most informative experiences and leverage past knowledge. The LIM allows for efficient exploration through imagination, while MSPL facilitates learning from both immediate and delayed rewards. By combining these components within the SAC framework, our LIRL approach can learn effective coverage strategies that balance exploration and exploitation, adapt to unknown environments, and plan for long-term coverage objectives in complex CPP tasks.
Algorithm 1 LIRL Training Algorithm
  1:
Initialize policy π ϕ , Q-functions Q θ 1 , Q θ 2 , value function V ψ , encoder q ϕ , dynamics model p θ , decoder p ψ
  2:
Initialize replay buffer D , empty memory M
  3:
for each episode do
  4:
  Observe initial state s 0
  5:
  for  t = 0 to T do
  6:
     a t π ϕ ( · | s t )
  7:
    Execute a t , observe r t , s t + 1
  8:
    Store ( s t , a t , r t , s t + 1 ) in D
  9:
    Update memory M with ( s t , a t , r t , s t + 1 )
10:
     if update step then
11:
       for  j = 1 to N do
12:
         Sample batch B from D
13:
         Augment B with imagined experiences from LIM
14:
         Retrieve relevant memories from M
15:
         Compute adaptive n-step returns using MSPL
16:
         Update Q θ 1 , Q θ 2 , V ψ , π ϕ , α
17:
         Update world model (encoder q ϕ , dynamics p θ , decoder p ψ )
18:
       end for
19:
     end if
20:
   end for
21:
end for

3.6. Computational Complexity and Cost Analysis

To provide greater clarity regarding the computational feasibility of LIRL, we analyze the time complexity and computational costs associated with each key component of the framework: MAER, the LIM, and MSPL. This analysis aims to offer insights into the resources required for efficient implementation, as well as strategies for managing computational demands.

3.6.1. Memory-Augmented Experience Replay (MAER)

The MAER component utilizes prioritized experience retrieval to improve learning efficiency. For each training step, the prioritization process incurs a complexity of O ( log N ) for sampling experiences from the buffer, where N is the number of stored experiences. Memory retrieval, enhanced by an attention-based mechanism, adds an additional complexity of O ( N ) due to attention weight computations over the stored memory entries. This approach is efficient for learning from high-utility experiences but can be scaled down by adjusting the sampling frequency or buffer size for resource-constrained applications.

3.6.2. Latent Imagination Module (LIM)

The LIM allows the agent to generate imagined trajectories, reducing the need for extensive environment interactions. The computational cost of LIM is primarily associated with generating latent states and simulating imagined trajectories through the variational world model. Given a trajectory length T, the complexity for generating imagined rollouts is O ( T ) per rollout, where T can be controlled based on the task requirements. By tuning the frequency and length of imagined trajectories, LIM can be adapted to balance between sample efficiency and computational cost.

3.6.3. Multi-Step Prediction Learning (MSPL)

MSPL dynamically adjusts the prediction horizon by evaluating multiple n-step returns. This component has a complexity of O ( K ) per time step, where K is the maximum number of n-steps considered. In typical implementations, K is kept small (e.g., K = 5 ) to prevent excessive computational overhead while still capturing the delayed effects of actions. For settings where computation is limited, a fixed n-step approach may be substituted for MSPL’s adaptive process to reduce this overhead.

3.6.4. Overall Complexity Within the SAC Framework

Combining MAER, LIM, and MSPL within the SAC framework introduces an overall time complexity of O ( N + T + K ) per training iteration, reflecting the sum of prioritized sampling, latent rollouts, and multi-step prediction computations. While this integration offers enhanced efficiency and adaptability in unknown environments, the computational load can be adjusted by simplifying components or using approximations based on deployment constraints.
In summary, this complexity analysis provides a basis for understanding the computational demands of LIRL and highlights tuning options that allow practitioners to manage these requirements in real-world applications. We have included these considerations to ensure that LIRL’s implementation remains feasible and adaptable, even in resource-constrained settings.

4. Experiments

To evaluate the effectiveness of our proposed LIRL framework for efficient CPP, we conducted a series of comprehensive experiments. This section details our experimental setup, baseline comparisons, and results, followed by ablation studies to analyze the contribution of each component of our framework.

4.1. Experimental Setup

Environments

We tested our LIRL framework in three distinct simulated environments, each representing different challenges in CPP, as shown in Figure 3:
(1)
Map1: A structured indoor space (50 m × 50 m) with rooms, corridors, and obstacles, simulating a typical office layout.
(2)
Map2: A large open space (100 m × 100 m) with irregularly placed shelves and objects, representing a warehouse setting.
(3)
Map3: An unstructured area (200 m × 200 m) with various terrains and obstacles, simulating a search and rescue scenario.
Figure 3. Schematic diagram of (a) Map1, (b) Map2, and (c) Map3.

4.2. Implementation and Test Environment

The implementation of the LIRL framework and all experiments were conducted using the following software and hardware specifications to ensure reproducibility and transparency in computational requirements.
The framework was developed primarily in Python, leveraging libraries such as PyTorch for neural network implementation and optimization, and OpenAI Gym for simulation environment management. All experiments were run on a Linux-based system (Ubuntu 20.04), which provides compatibility with the machine learning libraries and facilitates efficient computation management in a high-performance environment. The tests were conducted on a workstation equipped with an Intel Core i9 processor, 32 GB of RAM, and an NVIDIA RTX 3090 GPU with 24 GB of dedicated memory. This setup allowed us to manage the high computational demands associated with RL and extensive simulation-based testing. Key libraries included PyTorch for model development, NumPy and Pandas for data handling, Matplotlib for result visualization, and the OpenAI Gym toolkit for environment simulation. These libraries enabled efficient model training, data processing, and result analysis. Each experiment involved running simulations over 1000 episodes, with each episode consisting of a variable number of steps depending on the environment and task requirements. This provided sufficient training and evaluation data to assess the adaptability and efficiency of the LIRL framework.
This configuration ensured that the computational requirements of the LIRL framework were met, facilitating smooth model training and evaluation. The specified setup also provides a basis for others to replicate our results or conduct further experiments on comparable hardware.
To provide transparency and aid reproducibility, we detail the key parameters used in the LIRL framework, along with their assigned values and the rationale for each choice.
  • Learning Rate: The learning rate for both the policy and value networks was set to 3 × 10 4 . This value, commonly used in RL algorithms like SAC, balances convergence speed with stability. Preliminary testing indicated that this rate optimized learning efficiency and stability within the LIRL framework.
  • Discount Factor ( γ ): We used a discount factor of 0.99 to emphasize long-term rewards, which is crucial for efficient coverage in large environments. This setting encourages the agent to prioritize actions that contribute to future coverage rather than immediate gains.
  • Temperature Parameter ( α ): The SAC temperature parameter, which controls exploration-exploitation balance, was set to 0.2. This value supports effective exploration without an excessive focus on random actions, allowing the agent to balance exploration with goal-oriented planning.
  • MAER Buffer Size: The memory-augmented experience replay (MAER) buffer was set to store up to 100,000 experiences, ensuring that the agent has access to a wide variety of past experiences for prioritized sampling. This buffer size is sufficient for complex environments while remaining manageable in terms of memory usage.
  • LIM Trajectory Length: In the latent imagination module (LIM), we set the imagined trajectory length T = 5 . This length allows the agent to predict short-term paths without incurring significant computational costs, achieving a balance between sample efficiency and processing load.
  • MSPL Horizon ( K ): The multi-step prediction learning (MSPL) horizon parameter was set to K = 5 , allowing the agent to consider multi-step returns while adapting based on uncertainty. This horizon length enhances long-term planning without the complexity associated with longer horizons.
These parameters were chosen based on RL best practices and were fine-tuned through preliminary experiments to achieve optimal performance within our test environments.

4.2.1. Performance Metrics

We evaluated the performance using the following metrics: (1) Coverage Percentage: The proportion of the accessible area covered. (2) Coverage Time: Time taken to achieve 95 % coverage. (3) Path Efficiency: Ratio of the covered area to the total path length. (4) Collision Rate: Number of collisions per minute of operation.

4.2.2. Baselines and Implementation Details

To evaluate the effectiveness of the LIRL framework, we selected several widely used algorithms for comparison: deep deterministic policy gradient (DDPG), soft actor–critic (SAC), proximal policy optimization (PPO), and the spiral-STC algorithm. Below, we provide the rationale for choosing each method and clarify the implementation sources.
The selected algorithms represent a range of approaches to CPP and RL:
  • DDPG and SAC are model-free RL algorithms suitable for continuous action spaces, similar to our LIRL framework. Their established use in CPP tasks makes them suitable benchmarks for performance evaluation.
  • PPO is a popular on-policy RL method known for stable performance in sequential decision-making tasks, providing a useful benchmark for sample efficiency and policy optimization.
  • Spiral-STC is a classical CPP algorithm, which allows for a comparison of LIRL’s performance against traditional path planning methods in terms of coverage efficiency and path smoothness.
For DDPG, SAC, and PPO, we based our implementations on open-source libraries (e.g., OpenAI Baselines) and matched the parameters as closely as possible to our LIRL framework for consistency. Minor adjustments were made to optimize their performance within our simulation environments. Spiral-STC was implemented in-house following its original algorithmic formulation, ensuring direct comparability in coverage and efficiency metrics.
These details clarify the rationale for selecting these comparison algorithms and ensure transparency in the implementation sources, contributing to a fair and comprehensive evaluation of the LIRL framework.

4.3. Results and Analysis

4.3.1. Overall Performance

Table 2 presents the performance of LIRL compared to the baselines across all three environments, averaged over 50 runs.
Table 2. Performance comparison of LIRL with baselines.
As shown in Table 2, LIRL consistently outperforms all baselines across all environments and metrics. In Map1, LIRL achieves 98.2 % coverage in 12.3 minutes, demonstrating a 20.4 % reduction in coverage time compared to the next best learning-based method (MB-I). The performance gap widens in more complex environments, with LIRL showing a 27.3 % reduction in coverage time in the warehouse environment and a 34.7 % reduction in the outdoor environment compared to MB-I.
LIRL also demonstrates superior path efficiency and collision avoidance. In Map3, LIRL achieves a path efficiency of 0.83 , which is 5.1 % higher than MB-I and 27.7 % higher than DDPG. The collision rate of LIRL is consistently low across all environments, with only 0.09 collisions per minute in the challenging outdoor environment, representing a 40 % reduction compared to MB-I and a 71 % reduction compared to DDPG.

4.3.2. Learning Efficiency

Figure 4 illustrates the learning curves of LIRL and the baselines in Map2, showing the coverage percentage achieved over training episodes.
Figure 4. Learning curves of LIRL and baselines in Map2.
LIRL demonstrates faster learning and convergence to higher performance levels compared to other methods. It achieves 90 % coverage within 500 episodes, while SAC requires approximately 800 and 700 episodes, respectively, to reach the same level of performance. This faster learning can be attributed to the effective combination of the memory-augmented experience replay and the latent imagination module, which allow for more efficient use of experiences and better long-term planning.

4.3.3. Adaptability to Environmental Changes

To evaluate the adaptability of LIRL, we introduced sudden changes to the environments during testing. These changes included the addition or removal of obstacles and alterations to the terrain. Figure 5 shows the coverage performance of LIRL and baselines before and after these changes in Map3.
Figure 5. Adaptability to environmental changes in Map3.
LIRL demonstrates superior adaptability, maintaining 97.2 % of its pre-change performance after significant environmental alterations, compared to 85.6 % for SAC. This adaptability can be attributed to the combination of the latent imagination module, which allows for rapid replanning, and the multi-step prediction learning, which enables effective learning from both immediate and delayed consequences of actions.

4.4. Ablation Studies

To understand the contribution of each component of LIRL, we conducted ablation studies by removing or replacing key components of the framework. Table 3 presents the results of these studies in Map2.
Table 3. Ablation study results in Map2.
The ablation studies reveal that each component of LIRL contributes significantly to its performance:
(1)
Removing the MAER results in a 16.5 % increase in coverage time and a 71.4 % increase in collision rate, highlighting its importance in efficient learning and safe navigation.
(2)
Without the LIM, coverage time increases by 21.8 % , and path efficiency decreases by 8.1 % , demonstrating LIM’s crucial role in effective planning and exploration.
(3)
Removing the MSPL leads to a 9.1 % increase in coverage time and a 28.6 % increase in collision rate, indicating its importance in balancing short-term and long-term decision-making.
(4)
Replacing the adaptive n-step learning with a fixed n-step approach (n = 5) results in a 6.3 % increase in coverage time, showing the benefits of dynamically adjusting the prediction horizon.
These results confirm that each component of LIRL plays a vital role in its overall performance, with their combination leading to significant improvements in coverage efficiency, path planning, and collision avoidance across various CPP scenarios.

5. Conclusions

This paper introduces LIRL, a novel framework designed to address the challenges of CPP in unknown environments. A core strength of the LIRL framework lies in its ability to maintain symmetry between exploration and exploitation, balancing short-term decision-making with long-term planning. This symmetry is achieved through the seamless integration of three key components: MAER, LIM, and MSPL, all operating within a SAC architecture. This innovative combination enables LIRL to effectively leverage past experiences, simulate potential future scenarios, and balance short-term and long-term decision-making in complex CPP tasks. The MAER component enhances sample efficiency by prioritizing and retrieving relevant past experiences, allowing the agent to learn more effectively from its history. This is particularly crucial in CPP tasks where certain experiences, such as discovering new areas or avoiding obstacles, may be rare but highly informative. The LIM empowers the agent with the ability to imagine and evaluate potential future trajectories without direct environment interaction. This capability is essential for long-term planning and efficient exploration in partially observable or dynamic environments, which are common challenges in real-world CPP applications. The MSPL component introduces an adaptive n-step learning mechanism that dynamically balances between immediate rewards and long-term consequences. This adaptability is key to developing efficient coverage strategies that can navigate the trade-offs between thorough exploration and timely task completion. Our experimental results, conducted across various simulated environments, demonstrate LIRL’s superior performance in terms of coverage efficiency, path planning, and collision avoidance compared to state-of-the-art methods. The framework’s ability to maintain high performance even after significant environmental changes underscores its potential for real-world deployment where adaptability is paramount.
Future work will focus on deploying LIRL in real-world settings, particularly in indoor and outdoor environments where it can be tested on tasks that involve navigating dynamic and partially observable terrains. This testing will be essential for confirming LIRL’s effectiveness in physical applications and exploring how it can be extended or adapted to multi-robot systems for large-scale CPP tasks. Additionally, incorporating advanced perception systems to recognize and prioritize regions based on semantic information could further enhance the framework’s adaptability, making it more context-aware and intelligent in handling complex environments.

Author Contributions

Conceptualization, Z.W. and M.Z.; methodology, Z.W. and T.S.; software, Z.W.; validation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W.; visualization, Z.W.; supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

List of Abbreviations and Notations

AbbreviationMeaning
CPPCoverage path planning
RLReinforcement learning
LIRLLatent imagination-based reinforcement learning
MAERMemory-augmented experience replay
LIMLatent Imagination module
MSPLMulti-step prediction learning
POMDPPartially observable markov decision process
SACSoft actor–critic
DNCDifferentiable neural computer
UAVUnmanned aerial vehicle
DDPGDeep deterministic policy gradient
MERLINMemory-augmented reinforcement learning in goal-directed agent
NotationMeaning
NNumber of stored experiences in the replay buffer
TLength of a trajectory
KMaximum number of n-steps considered in Multi-Step Prediction Learning
s t State of the agent at time step t
a t Action taken by the agent at time step t
r t Reward received by the agent at time step t
γ Discount factor for future rewards
π ( a | s ) Policy, mapping states to action probabilities
V ( s ) Value function of a state
Q ( s , a ) Action-value function for a state-action pair
α Temperature parameter in Soft Actor–Critic for exploration-exploitation balance
G t ( n ) n-step return for state s t
δ t ( n ) Temporal difference error for n-step return
λ Hyperparameter for uncertainty modulation in adaptive n-step selection
U ( s t ) Uncertainty estimate for state s t

References

  1. Bormann, R.; Jordan, F.; Hampp, J.; Hägele, M. Indoor coverage path planning: Survey, implementation, analysis. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1718–1725. [Google Scholar]
  2. Galceran, E.; Carreras, M. A survey on coverage path planning for robotics. Robot. Auton. Syst. 2013, 61, 1258–1276. [Google Scholar] [CrossRef]
  3. Jin, J.; Tang, L. Coverage path planning on three-dimensional terrain for arable farming. J. Field Robot. 2011, 28, 424–440. [Google Scholar] [CrossRef]
  4. Huang, Q. Model-based or model-free, a review of approaches in reinforcement learning. In Proceedings of the 2020 International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 1–2 August 2020; pp. 219–221. [Google Scholar]
  5. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor–critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  6. Sutton, R.S. Reinforcement learning: An introduction. In A Bradford Book; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  7. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning PMLR, 2017, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  8. Igl, M.; Zintgraf, L.; Le, T.A.; Wood, F.; Whiteson, S. Deep variational reinforcement learning for POMDPs. In Proceedings of the International Conference on Machine Learning PMLR, 2018, Stockholm, Sweden, 10–15 July 2018; pp. 2117–2126. [Google Scholar]
  9. Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering atari with discrete world models. arXiv 2020, arXiv:2010.02193. [Google Scholar]
  10. Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwińska, A.; Colmenarejo, S.G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. Hybrid computing using a neural network with dynamic external memory. Nature 2016, 538, 471–476. [Google Scholar] [CrossRef]
  11. Fortunato, M.; Tan, M.; Faulkner, R.; Hansen, S.; Puigdomènech Badia, A.; Buttimore, G.; Deck, C.; Leibo, J.Z.; Blundell, C. Generalization of reinforcement learners with working and episodic memory. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  12. Ha, D.; Schmidhuber, J. Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  13. Kaiser, L.; Babaeizadeh, M.; Milos, P.; Osinski, B.; Campbell, R.H.; Czechowski, K.; Erhan, D.; Finn, C.; Kozakowski, P.; Levine, S.; et al. Model-based reinforcement learning for atari. arXiv 2019, arXiv:1903.00374. [Google Scholar]
  14. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
  15. Faust, A.; Oslund, K.; Ramirez, O.; Francis, A.; Tapia, L.; Fiser, M.; Davidson, J. Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 5113–5120. [Google Scholar]
  16. Brown, S.; Waslander, S.L. The constriction decomposition method for coverage path planning. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 3233–3238. [Google Scholar]
  17. Cabreira, T.M.; Ferreira, P.R.; Di Franco, C.; Buttazzo, G.C. Grid-based coverage path planning with minimum energy over irregular-shaped areas with UAVs. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 11–14 June 2019; pp. 758–767. [Google Scholar]
  18. Liu, S.; Xia, J.; Sun, X.; Zhang, Y. An efficient complete coverage path planning in known environments. J. Northeast Norm. Univ. 2011, 43, 39–43. [Google Scholar]
  19. Theile, M.; Bayerlein, H.; Nai, R.; Gesbert, D.; Caccamo, M. UAV coverage path planning under varying power constraints using deep reinforcement learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 1444–1449. [Google Scholar]
  20. Narottama, B.; Shin, S.Y. UAV Coverage Path Planning with Quantum-based Recurrent Deep Deterministic Policy Gradient. IEEE Trans. Veh. Technol. 2023, 73, 7424–7429. [Google Scholar]
  21. Heydari, J.; Saha, O.; Ganapathy, V. Reinforcement learning-based coverage path planning with implicit cellular decomposition. arXiv 2021, arXiv:2110.09018. [Google Scholar]
  22. Aydemir, F.; Cetin, A. Multi-agent dynamic area coverage based on reinforcement learning with connected agents. Comput. Syst. Sci. Eng. 2023, 45, 215–230. [Google Scholar] [CrossRef]
  23. Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A. Bayesian reinforcement learning: A survey. Found. Trends® Mach. Learn. 2015, 8, 359–483. [Google Scholar] [CrossRef]
  24. Brown, D.; Coleman, R.; Srinivasan, R.; Niekum, S. Safe imitation learning via fast bayesian reward inference from preferences. In Proceedings of the International Conference on Machine Learning PMLR, Virtual Event. 13–18 July 2020; pp. 1165–1177. [Google Scholar]
  25. Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
  26. Loquercio, A.; Segu, M.; Scaramuzza, D. A general framework for uncertainty estimation in deep learning. IEEE Robot. Autom. Lett. 2020, 5, 3153–3160. [Google Scholar] [CrossRef]
  27. Wei, M.; Zheng, L.; Li, H.; Cheng, H. Adaptive Neural Network-based Model Path-Following Contouring Control for Quadrotor Under Diversely Uncertain Disturbances. IEEE Robot. Autom. Lett. 2024, 9, 3751–3758. [Google Scholar] [CrossRef]
  28. Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proceedings of the International Conference on Machine Learning PMLR 2016, New York City, NY, USA, 19–24 June 2016; pp. 1842–1850. [Google Scholar]
  29. Karunaratne, G.; Schmuck, M.; Le Gallo, M.; Cherubini, G.; Benini, L.; Sebastian, A.; Rahimi, A. Robust high-dimensional memory-augmented neural networks. Nat. Commun. 2021, 12, 2468. [Google Scholar] [CrossRef]
  30. Bae, H.; Kim, G.; Kim, J.; Qian, D.; Lee, S. Multi-robot path planning method using reinforcement learning. Appl. Sci. 2019, 9, 3057. [Google Scholar] [CrossRef]
  31. Wayne, G.; Hung, C.C.; Amos, D.; Mirza, M.; Ahuja, A.; Grabska-Barwinska, A.; Rae, J.; Mirowski, P.; Leibo, J.Z.; Santoro, A.; et al. Unsupervised predictive memory in a goal-directed agent. arXiv 2018, arXiv:1803.10760. [Google Scholar]
  32. Edgar, I. A Guide to Imagework: Imagination-Based Research Methods; Routledge: London, UK, 2004. [Google Scholar]
  33. Pascanu, R.; Li, Y.; Vinyals, O.; Heess, N.; Buesing, L.; Racanière, S.; Reichert, D.; Weber, T.; Wierstra, D.; Battaglia, P. Learning model-based planning from scratch. arXiv 2017, arXiv:1707.06170. [Google Scholar]
  34. Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to control: Learning behaviors by latent imagination. arXiv 2019, arXiv:1912.01603. [Google Scholar]
  35. Liu, K.; Stadler, M.; Roy, N. Learned sampling distributions for efficient planning in hybrid geometric and object-level representations. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9555–9562. [Google Scholar]
  36. Argenson, A.; Dulac-Arnold, G. Model-based offline planning. arXiv 2020, arXiv:2008.05556. [Google Scholar]
  37. Tang, Y.; Munos, R.; Rowland, M.; Avila Pires, B.; Dabney, W.; Bellemare, M. The nature of temporal difference errors in multi-step distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 2022, 35, 30265–30276. [Google Scholar]
  38. Schoknecht, R.; Riedmiller, M. Speeding-up reinforcement learning with multi-step actions. In Proceedings of the Artificial Neural Networks—ICANN 2002: International Conference, Madrid, Spain, 28–30 August 2002; Proceedings 12. pp. 813–818. [Google Scholar]
  39. De Asis, K.; Hernandez-Garcia, J.; Holland, G.; Sutton, R. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence 2018, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  40. Han, S.; Chen, Y.; Chen, G.; Yin, J.; Wang, H.; Cao, J. Multi-step reinforcement learning-based offloading for vehicle edge computing. In Proceedings of the 2023 15th International Conference on Advanced Computational Intelligence (ICACI), Seoul, Republic of Korea, 6–9 May 2023; pp. 1–8. [Google Scholar]
  41. Klamt, T.; Behnke, S. Planning hybrid driving-stepping locomotion on multiple levels of abstraction. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1695–1702. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.