LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning

Wei, Zhenglin; Sun, Tiejiang; Zhou, Mengjie

doi:10.3390/sym16111537

Open AccessArticle

LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning

by

Zhenglin Wei

¹,

Tiejiang Sun

² and

Mengjie Zhou

^3,*

¹

School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450052, China

²

School of Information Engineering, Chang’an University, Xi’an 710064, China

³

Department of Computer Science, University of Bristol, Bristol BS8 1QU, UK

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(11), 1537; https://doi.org/10.3390/sym16111537

Submission received: 11 October 2024 / Revised: 12 November 2024 / Accepted: 14 November 2024 / Published: 17 November 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Coverage Path Planning (CPP) in unknown environments presents unique challenges that often require the system to maintain a symmetry between exploration and exploitation in order to efficiently cover unknown areas. This paper introduces latent imagination-based reinforcement learning (LIRL), a novel framework that addresses these challenges by integrating three key components: memory-augmented experience replay (MAER), a latent imagination module (LIM), and multi-step prediction learning (MSPL) within a soft actor–critic architecture. MAER enhances sample efficiency by prioritizing experience retrieval, LIM facilitates long-term planning via simulated trajectories, and MSPL optimizes the trade-off between immediate rewards and future outcomes through adaptive n-step learning. MAER, LIM, and MSPL work within a soft actor–critic architecture, and LIRL creates a dynamic equilibrium that enables efficient, adaptive decision-making. We evaluate LIRL across diverse simulated environments, demonstrating substantial improvements over state-of-the-art methods. Through this method, the agent optimally balances short-term actions with long-term planning, maintaining symmetrical responses to varying environmental changes. The results highlight LIRL’s potential for advancing autonomous CPP in real-world applications such as search and rescue, agricultural robotics, and warehouse automation. Our work contributes to the broader fields of robotics and reinforcement learning, offering insights into integrating memory, imagination, and adaptive learning for complex sequential decision-making tasks.

Keywords:

coverage path planning; reinforcement learning; memory-augmented neural networks; latent imagination; multi-step learning

1. Introduction

Coverage path planning (CPP) is a fundamental problem in robotics with wide-ranging applications, from autonomous vacuum cleaning and precision agriculture to search-and-rescue operations and environmental monitoring [1]. At its core, CPP involves determining a path that symmetrically balances the exploration of uncharted territory with the exploitation of known areas to optimize for efficiency. The ability to maintain this balance—between covering new areas and refining already explored paths—is key to achieving optimal coverage performance in unknown or dynamic environments [2]. While significant strides have been made in CPP for known environments, the challenge of efficient coverage in unknown or partially observable environments remains a critical frontier in robotics research [3].

Recent advancements in deep reinforcement learning (RL) have demonstrated remarkable potential in addressing complex decision-making problems, including path planning [4]. The ability of RL to learn from experience and adapt to new situations makes it particularly appealing for CPP in unknown environments. However, the application of RL to CPP faces several interrelated challenges. Traditional RL approaches often require extensive interaction with the environment to learn effective policies, which can be impractical or costly in real-world scenarios [5]. Moreover, CPP necessitates consideration of long-term consequences of actions, a task that many RL algorithms focused on immediate or short-term rewards struggle to accomplish [6]. The need for rapid adaptation to new, unseen environments further complicates the application of current RL approaches to CPP [7]. Additionally, in real-world scenarios, the entire environment is rarely observable at once, requiring strategies to deal with partial information [8].

To address these challenges, we propose LIRL (latent imagination-based reinforcement learning), a novel framework that integrates memory augmentation, latent imagination, and multi-step prediction learning to enhance the efficiency and adaptability of CPP. Our method builds upon recent developments in model-based RL [9] and memory-augmented neural networks [10], extending them to the specific challenges of coverage path planning.

The LIRL framework comprises several interconnected components designed to work in synergy. At its core is a memory-augmented experience replay mechanism that leverages coverage efficiency metrics to prioritize learning from the most informative experiences. This is coupled with an attention-based memory network that enables the agent to store and retrieve critical decision points, enhancing its ability to make informed choices in complex environments [11]. A key innovation in LIRL is the latent imagination module, inspired by recent work on world models [12]. This module develops a latent space model allowing the agent to simulate potential trajectories without direct interaction with the environment, significantly improving sample efficiency and enabling reasoning about long-term consequences of actions [13].

To balance short-term rewards with long-term planning, LIRL implements an adaptive n-step learning algorithm. This approach mitigates the challenges of credit assignment in CPP tasks, where the benefits of actions may only become apparent after several steps [14]. The framework is completed by a policy network that integrates current state information, retrieved memories, and imagined future states to output action probabilities, and an environment interaction module that collects real experiences to update the memory and train the models.

The LIRL agent operates in a continuous cycle of perception, imagination, and action. At each time step, it observes the current state, retrieves relevant past experiences, and uses its latent imagination module to simulate potential future trajectories. Based on this rich combination of real and imagined experiences, the policy network decides on the next action. The resulting experience is then used to update the agent’s models and memory, creating a feedback loop that continuously enhances performance.

We evaluate LIRL on a diverse range of simulated CPP scenarios, encompassing both structured environments like indoor floor plans and unstructured terrains such as outdoor agricultural fields. Our comprehensive experiments demonstrate significant improvements in coverage efficiency, adaptation speed, and robustness compared to state-of-the-art methods [15].

The main contributions of this paper are as follows:

A novel RL framework, LIRL, that integrates memory augmentation, latent imagination, and multi-step learning for efficient CPP in unknown environments.
A prioritized experience replay mechanism based on coverage efficiency metrics, enhancing learning from the most informative experiences.
A latent imagination module that enables efficient planning and reasoning about long-term consequences in CPP tasks.
An adaptive n-step learning algorithm tailored for the unique challenges of CPP.
Comprehensive evaluations demonstrating LIRL’s effectiveness across various CPP scenarios, including comparisons with state-of-the-art methods and ablation studies.

The remainder of this paper is structured as follows. Section 2 reviews related work in CPP, RL, and memory-augmented neural networks. Section 3 presents a detailed description of the LIRL framework, including the formulation of the CPP problem as a partially observable markov decision process (POMDP) and the architecture of each component. Section 4 describes our experimental setup and our results, including performance comparisons and ablation studies. Section 5 discusses the implications of our work, its limitations, and directions for future research.

2. Related Work

Our work on LIRL for CPP builds upon and extends several key areas of research. In this section, we review the relevant literature in CPP, RL for CPP, memory-augmented neural networks, and imagination-based planning, as listed in Table 1.

2.1. Coverage Path Planning

CPP remains a fundamental problem in robotics with applications ranging from lawn mowing and vacuum cleaning to search and rescue operations [2]. Recent advancements in CPP have focused on addressing challenges in unknown and dynamic environments.

Cellular decomposition methods have evolved to handle more complex scenarios. For instance, Brown et al. [16] employed a unique region constriction strategy to enhance path efficiency, demonstrating notable improvements in coverage uniformity and computational performance for various environment types.

Grid-based methods continue to be refined for efficiency and adaptability. Cabreira et al. [17] presented a grid-based coverage path planning approach aimed at minimizing energy consumption for UAVs over irregular-shaped areas, demonstrating effective coverage with significant energy savings in real-world scenarios.

The integration of machine learning techniques with traditional CPP approaches has gained significant traction. Liu et al. [18] developed a learning-based CPP method that combines deep RL with heuristic search, demonstrating superior performance in complex, unknown environments compared to conventional algorithms.

2.2. Reinforcement Learning in CPP

The application of RL to CPP has seen substantial growth, addressing the challenges of adaptability and efficiency in unknown environments.

Model-free RL approaches have shown promise in CPP tasks. Theile et al. [19] proposed a UAV coverage path planning method that utilizes model-free RL to handle varying power constraints, achieving adaptive and efficient coverage performance under different energy availability scenarios.

Policy gradient methods have been adapted for CPP scenarios. Narottama et al. [20] introduced a UAV coverage path planning approach employing a quantum-based recurrent deep deterministic policy gradient algorithm, demonstrating improved adaptability and convergence speed for dynamic environment scenarios. Their method showed improved convergence and performance compared to traditional RL approaches in CPP.

Recent work has also explored the integration of RL with classical planning techniques for CPP. Heydari et al. [21] proposed an RL-based coverage path planning technique using implicit cellular decomposition, achieving efficient path planning with reduced computational overhead for complex environments.

Multi-agent RL approaches have been applied to collaborative CPP tasks. Aydemir and Cetin [22] developed an RL-based framework for multi-agent dynamic area coverage, emphasizing the connectivity between agents. Their approach effectively adapts to changes in the environment, promoting cooperative behavior among agents to optimize overall coverage while maintaining communication and minimizing redundancy.

2.3. Uncertainty Management in RL

In RL, effective uncertainty management is crucial for robust decision-making, especially in partially observable or dynamic environments. Various techniques have been developed to address uncertainty, which enhances an agent’s adaptability and decision quality in challenging settings.

One commonly used approach involves probabilistic models and Bayesian RL [23], where uncertainty is modeled directly to inform decision-making. Bayesian frameworks allow the agent to quantify and update beliefs about unknown states or dynamics, thereby reducing uncertainty over time through interaction with the environment [24]. Bayesian approaches to RL provide a formal structure for handling both epistemic (model-related) and aleatoric (intrinsic) uncertainty, allowing for more informed exploration and risk assessment.

In addition to Bayesian methods, predictive uncertainty in neural network-based models has gained traction [25]. Techniques such as dropout-based uncertainty estimation, ensemble methods, and bootstrapping help quantify uncertainty in model predictions, particularly in high-dimensional or partially observable spaces [26]. For instance, studies that employ predictive uncertainty in neural networks have shown improved robustness in decision-making under partial observability by enabling the agent to distinguish between well-understood and uncertain aspects of the environment [27].

Our framework, specifically the multi-step prediction learning (MSPL) component of LIRL, aligns with these uncertainty management strategies by dynamically adjusting the prediction horizon based on the agent’s uncertainty. By incorporating an adaptive n-step return mechanism, MSPL leverages temporal difference errors and uncertainty estimates to balance short- and long-term planning. This adaptability is particularly valuable in CPP tasks, where delayed consequences often accompany actions taken in uncertain or partially observable spaces. Integrating this uncertainty management capability within LIRL enhances the agent’s ability to plan robustly and adapt to unstructured environments, addressing the unique challenges of real-world CPP.

2.4. Memory-Augmented Neural Networks

Memory-augmented neural networks have continued to evolve, offering improved capabilities for long-term reasoning in complex tasks [28,29].

Recent advancements include the differentiable neural computer (DNC) architecture by Graves et al. [10], which has shown remarkable performance in tasks requiring complex reasoning and long-term memory. In the context of robotics, Bae et al. [30] developed a memory-augmented reinforcement learning approach for navigation in partially observable environments, which could potentially be adapted for CPP tasks.

For RL applications, Wayne et al. [31] developed the unsupervised predictive memory in a goal-directed agent (MERLIN), demonstrating improved performance in partially observable environments. This approach could potentially be adapted for CPP tasks requiring long-term memory and planning.

2.5. Imagination-Based Planning

Imagination-based planning has seen significant advancements, particularly in its application to robotics and control tasks [32,33].

Building on the World Models approach, Hafner et al. [34] introduced Dreamer v2, an improved imagination-based RL algorithm that achieves state-of-the-art performance in continuous control tasks. This method’s ability to learn long-horizon behaviors efficiently could be particularly beneficial for CPP in complex environments.

In the realm of robotics, Liu et al. [35] proposed a method that combines learned latent representations with sampling-based planning for efficient motion planning in high-dimensional spaces. This approach demonstrates the potential of integrating learning-based methods with traditional planning techniques.

Argenson et al. [36] introduced model-based offline RL, which uses a learned dynamics model for imagination-based planning in scenarios where online interaction is limited or costly. This could be particularly relevant for CPP applications where real-world training is expensive or risky.

2.6. Multi-Step RL

Multi-step learning methods in RL continue to be refined for improved stability and efficiency [37,38]. Recent work by Asis et al. [39] revisited the theory of multi-step RL, providing new insights into the trade-offs between bias and variance in n-step returns. Their findings could inform the design of more effective multi-step learning algorithms for CPP tasks.

Han et al. [40] proposed a multi-step reinforcement learning-based offloading strategy for vehicular edge computing, focusing on optimizing computational resource allocation and task offloading efficiency. Their approach demonstrates improved decision-making capabilities for dynamic vehicle environments, reducing latency and enhancing overall system performance. This adaptive approach could be particularly useful in CPP scenarios where the optimal planning horizon may vary across different parts of the environment.

Table 1. Summary of the literature review.

Reference	Authors	Problem Addressed	Method
Brown and Waslander (2016) [16]	Brown, Waslander	Efficient CPP in complex environments	Region constriction strategy
Cabreira et al. (2019) [17]	Cabreira, Ferreira, et al.	Energy-efficient CPP for UAVs	Grid-based energy optimization
Liu et al. (2018) [18]	Liu, Xia, Sun, Zhang	CPP in unknown environments	Deep RL with heuristic search
Theile et al. (2020) [19]	Theile, Bayerlein, et al.	CPP under varying power constraints	Model-free RL for power adaptation
Narottama and Shin (2023) [20]	Narottama, Shin	Adaptive CPP for UAVs	Quantum-based recurrent DDPG
Heydari et al. (2021) [21]	Heydari, Saha, Ganapathy	Efficient CPP in complex environments	RL with implicit cellular decomposition
Aydemir and Cetin (2023) [22]	Aydemir, Cetin	Multi-agent CPP	Multi-agent RL with connectivity
Graves et al. (2016) [10]	Graves et al.	Long-term memory in RL	Differentiable Neural Computer
Bae et al. (2019) [30]	Bae, Kim, et al.	Navigation in partially observable spaces	Memory-augmented RL
Wayne et al. (2018) [31]	Wayne et al.	Predictive memory for RL agents	MERLIN for goal-directed memory
Hafner et al. (2019) [34]	Hafner, Lillicrap, Ba, Norouzi	Efficient continuous control	Dreamer for imagination-based planning
Liu et al. (2020) [35]	Liu, Stadler, Roy	Motion planning in high dimensions	Learned latent space sampling
Argenson et al. (2020) [36]	Argenson, Dulac-Arnold	Offline RL with limited interaction	Model-based offline planning
De Asis et al. (2018) [39]	De Asis, Hernandez-Garcia, et al.	Efficiency and stability in multi-step RL	Unifying n-step returns
Han et al. (2023) [40]	Han, Chen, et al.	Task offloading in vehicular computing	Adaptive multi-step RL

While significant progress has been made in each of these areas, there remains a gap in effectively combining these approaches for the specific challenges of CPP in unknown environments. Existing CPP methods, even those incorporating RL, often struggle with long-term planning and rapid adaptation in complex, unknown settings. State-of-the-art RL approaches, while powerful, have not been fully adapted to the unique requirements of CPP tasks, such as ensuring complete coverage while optimizing for efficiency.

Our work aims to bridge this gap by integrating memory augmentation, latent imagination, and multi-step learning into a unified framework specifically designed for CPP. By doing so, we address the key challenges of sample efficiency, long-term planning, and rapid adaptation that are crucial for effective CPP in real-world scenarios. The LIRL framework leverages the strengths of each of these components to create a more robust and efficient approach to CPP in unknown environments.

3. Method

In this section, we present the LIRL framework for efficient CPP in unknown environments. We first formulate the CPP problem as a POMDP and then describe the key components of our approach: the memory-augmented experience replay, the latent imagination module, and the multi-step prediction learning algorithm.

3.1. Problem Formulation

The CPP problem is formulated as a POMDP with discrete time steps

t \in [1 . . T]

. At each time step, the agent executes continuous actions

a_{t} \sim π (a_{t} | s_{t})

according to policy

π

, while receiving states

s_{t}

and rewards

r_{t}

from the environment. The state

s_{t}

represents the environment, including geometry, visited points, the agent’s pose, and uncertainty estimates from a Transformer-based world model. To maximize the expected discounted reward J, a neural network is used to represent the policy, where J is defined as follows:

J = E [\sum_{t = 1}^{T} γ^{t - 1} r_{t}]

(1)

Figure 1 illustrates the RL loop, in which observations, actions, and rewards are collected in a replay buffer, from which training batches are sampled for gradient-based learning. This setup enables the agent to learn an effective coverage strategy by interacting with the environment and adapting to a range of scenarios encountered during the CPP task.

3.1.1. State Space

To facilitate efficient coverage learning, the observation space must be appropriately mapped for input into the neural network policy. Visited points are represented as a coverage map, as shown in Figure 2. The environment is discretized into a high-resolution 2D grid, ensuring accurate representation. To address scalability for larger areas, we use multi-scale maps, crucial for encoding both local details and global structure. This representation enables the policy to capture spatial relationships at multiple levels of granularity, improving the agent’s ability to plan coverage paths across diverse environments.

Inspired by multi-layered maps for search-and-rescue operations [41], we use multi-scale maps for coverage and obstacle representation to enhance scalability. Our approach extracts multiple local neighborhoods with increasing size and decreasing resolution while maintaining a fixed grid size. Specifically, we begin with a local square crop

M^{(1)}

of side length

d_{1}

at the finest scale. The multi-scale representation

M = {M^{(i)}}_{i = 1}^{m}

consists of increasingly larger areas based on a fixed scale factor s, where the side length of each map

M^{(i)}

is

d_{i} = l d_{i - 1}

.

The resolution at the finest scale is chosen to capture sufficient detail for local navigation, while larger scales support long-term planning where less precision is required. This approach can represent an area of size

d \times d

in

O (log d)

scales, with

O (w h log d)

grid cells, where w and h are fixed dimensions, independent of d. This significantly improves efficiency compared to a single fixed-resolution map with

O (d^{2})

cells.

We use a multi-scale map

M_{c}

for coverage representation (Figure 2), enhancing the agent’s ability to balance local detail with global context, thereby improving coverage path planning efficiency.

To respond rapidly to short-term obstacles, sensor data are incorporated into the input representation. Depth measurements from the LiDAR sensor are normalized to the range

[0, 1]

and concatenated into the state vector S, ensuring consistent scaling across different configurations. To improve decision-making robustness, uncertainty estimates from the world model are also included in S, providing a comprehensive representation of the environment’s structure and confidence in the agent’s perception. This integrated approach allows the agent to adapt based on both immediate sensory input and the reliability of its predictive model, thereby enhancing navigation efficiency in dynamic or partially observable environments.

3.1.2. Action Space

Our policy model directly predicts the agent’s control signals, enabling adaptation to specific environmental conditions and avoiding the limitations of discrete action spaces. We consider an agent that controls its linear velocity v and angular velocity

ω

, with actions normalized to the range

[- 1, 1]

based on their respective maximum values:

a_{t} = [\frac{v_{t}}{v_{\max}}, \frac{ω_{t}}{ω_{\max}}]

(2)

The normalized velocities determine the direction of travel and rotation, providing fine-grained control over movements and enabling more efficient and smooth coverage paths compared to discrete alternatives. This approach also facilitates policy transfer to different robotic platforms with varying kinematic constraints.

3.1.3. Reward Function

To achieve comprehensive coverage of free space, we introduce a reward mechanism based on the exploration of previously uncovered areas. The reward term

R_{area}

is proportional to the newly covered area

A_{new}

at each time step. It is normalized to the range

[0, 1]

and scaled by a factor

λ_{area}

, resulting in the following reward formulation:

R_{area} = λ_{area} \frac{A_{new}}{2 r v_{\max} Δ t},

(3)

where r is the agent’s radius and

Δ t

is the time step size.

Exclusively maximizing the coverage reward may discourage the agent from revisiting previously explored paths, potentially leading to suboptimal coverage patterns characterized by residual gaps. Such gaps impede complete coverage and necessitate costly revisits in later stages. To address this, we introduce an additional reward term based on minimizing the total variation (TV) of the coverage map, which reduces coverage boundaries and promotes a more uniform coverage pattern. Given a 2D signal x, the discrete isotropic total variation is defined as:

TV (x) = \sum_{i, j} \sqrt{{(x_{i + 1, j} - x_{i, j})}^{2} + {(x_{i, j + 1} - x_{i, j})}^{2}}

(4)

The incremental TV reward,

R_{TV}

, is derived from the change in TV between consecutive time steps. This reward encourages smoother and more continuous coverage patterns by assigning positive rewards for reductions in TV and negative rewards for increases. The incremental reward is normalized by the maximum possible increase in TV per time step, expressed as:

R_{TV} (t) = - λ_{TV} \frac{T V (C_{t}) - T V (C_{t - 1})}{2 v_{\max} Δ t}

(5)

where

λ_{TV}

is chosen such that

| R_{TV} | < | R_{area} |

on average, ensuring that standing still is not an optimal strategy. This combined reward structure balances immediate coverage gains with long-term path planning efficiency, promoting effective and complete coverage.

3.2. Memory-Augmented Experience Replay

The Memory-Augmented Experience Replay (MAER) plays a pivotal role in our LIRL framework, contributing to maintaining symmetry between past experiences and future predictions. This mechanism enables the agent to retrieve and prioritize experiences symmetrically, focusing on both successful explorations and suboptimal paths that need improvement. This balance enhances both sample efficiency and long-term planning capabilities, allowing the agent to continuously refine its strategy for efficient coverage.

In our MAER, experiences are stored as tuples

(o_{t}, a_{t}, r_{t}, o_{t + 1})

, where

o_{t}

represents the observation at time t,

a_{t}

is the action taken,

r_{t}

is the reward received, and

o_{t + 1}

is the resulting observation. This formulation allows the agent to learn from past experiences while accounting for the partial observability inherent in many CPP scenarios.

To prioritize the most informative experiences, we introduce a novel sampling mechanism based on coverage efficiency. The probability of sampling an experience i is given by:

p_{i} = \frac{{(Δ C_{i} / Δ t_{i} + ϵ)}^{α}}{\sum_{j} {(Δ C_{j} / Δ t_{j} + ϵ)}^{α}}

(6)

Here,

Δ C_{i}

represents the increase in the covered area resulting from the experience, and

Δ t_{i}

is the time taken to achieve this coverage. The term

ϵ

is a small constant (e.g.,

10^{- 5}

) that ensures non-zero sampling probabilities for all experiences, preventing rarely sampled experiences from being completely ignored. The exponent

α

controls the strength of prioritization, with higher values increasing the focus on high-efficiency experiences. In our experiments, we found

α = 0.6

to provide a good balance between prioritization and diversity in sampling.

This prioritization scheme ensures that experiences leading to efficient coverage are more likely to be sampled, allowing the agent to learn more effectively from successful strategies. However, to prevent overfitting to a narrow set of experiences, we combine this prioritized sampling with a standard uniform sampling approach, using a mixture coefficient

β

(typically set to 0.7 in our experiments) to balance between prioritized and uniform sampling.

To further enhance the agent’s ability to leverage past experiences, we implement an attention-based memory retrieval mechanism. This mechanism allows the agent to selectively recall relevant information from its memory when making decisions. The retrieved memory at time t is computed as a weighted sum of memory entries:

m_{t} = \sum_{i = 1}^{K} w_{i} M_{i}

(7)

where

M_{i}

represents the i-th memory entry, K is the total number of memory entries, and

w_{i}

are attention weights computed as:

w_{i} = \frac{exp (f {(o_{t})}^{T} g (M_{i}))}{\sum_{j = 1}^{K} exp (f {(o_{t})}^{T} g (M_{j}))}

(8)

In this equation, f and g are embedding functions that map observations and memory entries to a common embedding space. Specifically, f is implemented as a multi-layer perceptron that encodes the current observation

o_{t}

, while g is another multi-layer perceptron that encodes each memory entry

M_{i}

. The dot product

f {(o_{t})}^{T} g (M_{i})

measures the relevance of memory entry

M_{i}

to the current observation

o_{t}

. The softmax operation normalizes these relevance scores to produce the attention weights

w_{i}

.

This attention mechanism allows the agent to focus on the most relevant past experiences when making decisions, effectively increasing the context available for decision-making beyond the current observation. By combining information from the current observation with selectively retrieved memories, the agent can make more informed decisions, particularly in partially observable environments where the current observation alone may be insufficient.

The retrieved memory

m_{t}

is then concatenated with the current observation

o_{t}

to form an augmented state representation

s_{t} = [o_{t}; m_{t}]

. This augmented state is used as input to the policy network, allowing the agent to condition its actions on both current observations and relevant past experiences.

By integrating prioritized sampling and attention-based memory retrieval, our MAER system significantly enhances the agent’s ability to learn efficient coverage strategies. It allows the agent to focus on the most informative experiences while also leveraging relevant past information, thereby improving sample efficiency and enabling more effective long-term planning in CPP tasks.

3.3. Latent Imagination Module

The Latent Imagination Module (LIM) is a key component of our LIRL framework, designed to enable the agent to reason about potential future states without direct environment interaction. This capability is particularly crucial in CPP tasks, where efficient exploration of unknown environments is paramount. The LIM is based on a variational world model that learns to encode observations, predict future states, and reconstruct observations from latent states.

Our LIM employs a variational world model consisting of three main components: an encoder, a dynamics model, and a decoder. This module introduces a critical symmetry between the agent’s real and imagined experiences, enabling it to reason about potential future states without direct interaction with the environment. The encoder, denoted as

q_{ϕ} (z_{t} | o_{t})

, maps an observation

o_{t}

at time t to a distribution over latent states

z_{t}

. The dynamics model,

p_{θ} (z_{t + 1} | z_{t}, a_{t})

, predicts the next latent state

z_{t + 1}

given the current latent state

z_{t}

and action

a_{t}

. Finally, the decoder,

p_{ψ} (o_{t} | z_{t})

, reconstructs the observation

o_{t}

from the latent state

z_{t}

. The symmetrical structure of the LIM allows the agent to balance immediate, sensory-based decision-making with future-oriented planning by imagining potential trajectories. This symmetry between real and imagined experiences fosters greater efficiency in path planning and long-term coverage strategies.

The encoder

q_{ϕ} (z_{t} | o_{t})

is implemented as a convolutional neural network followed by a fully connected layer, outputting the mean and diagonal covariance of a Gaussian distribution over the latent state. This allows the model to capture uncertainty in the encoding process. The dynamics model

p_{θ} (z_{t + 1} | z_{t}, a_{t})

is a recurrent neural network that takes the current latent state and action as input and outputs the parameters of a Gaussian distribution over the next latent state. This recurrent structure enables the model to capture temporal dependencies in the environment dynamics. The decoder

p_{ψ} (o_{t} | z_{t})

is implemented as a deconvolutional network that reconstructs the observation from the latent state.

To train this world model, we maximize the evidence lower bound (ELBO), which is derived from the variational inference framework:

L_{W M} = E_{q_{ϕ} (z_{1 : T} | o_{1 : T})} [\sum_{t = 1}^{T} log p_{ψ} (o_{t} | z_{t}) - β D_{K L} (q_{ϕ} (z_{t} | o_{t}) | | p_{θ} (z_{t} | z_{t - 1}, a_{t - 1}))]

(9)

In this equation, T represents the length of the trajectory, and the expectation is taken with respect to the variational posterior

q_{ϕ} (z_{1 : T} | o_{1 : T})

. The first term,

log p_{ψ} (o_{t} | z_{t})

, encourages the model to accurately reconstruct observations from latent states. The second term is the Kullback–Leibler (KL) divergence between the encoded distribution

q_{ϕ} (z_{t} | o_{t})

and the predicted distribution

p_{θ} (z_{t} | z_{t - 1}, a_{t - 1})

, which acts as a regularizer to ensure that the encoded and predicted latent states are consistent.

The hyperparameter

β

balances the trade-off between reconstruction accuracy and regularization. A higher

β

encourages the model to learn a more compressed and disentangled latent representation, while a lower

β

prioritizes reconstruction accuracy. In our experiments, we found that a value of

β = 0.5

provides a good balance, allowing for accurate predictions while maintaining a structured latent space.

Once trained, the LIM enables the agent to imagine potential future trajectories by sampling from the learned dynamics model. Starting from an initial latent state

z_{0}

, we can generate a sequence of imagined latent states and corresponding observations:

\begin{matrix} z_{t + 1} & \sim p_{θ} (z_{t + 1} | z_{t}, a_{t}) \end{matrix}

(10)

\begin{matrix} {\hat{o}}_{t + 1} & = E_{p_{ψ} (o_{t + 1} | z_{t + 1})} [o_{t + 1}] \end{matrix}

(11)

Here,

{\hat{o}}_{t + 1}

represents the imagined observation at time

t + 1

. By generating multiple such trajectories, the agent can evaluate different action sequences and their potential outcomes without actually executing them in the environment. This imagination-based planning allows the agent to reason about the long-term consequences of its actions, which is particularly beneficial in CPP tasks where efficient coverage often requires considering multi-step strategies.

The imagined trajectories are used to augment the real experiences in the replay buffer, effectively increasing the sample efficiency of the learning process. However, to prevent the agent from overfitting to potentially inaccurate imagined experiences, we introduce an imagination ratio

γ_{i m}

that controls the proportion of imagined experiences used in each training batch. Typically, we set

γ_{i m} = 0.3

, meaning that

30 %

of each training batch consists of imagined experiences, while the remaining

70 %

are real experiences.

Furthermore, to ensure that the imagined experiences remain grounded in reality, we periodically update the initial latent state

z_{0}

used for imagination based on real observations. This update frequency is controlled by a hyperparameter

τ_{u p d a t e}

, which we typically set to 10 environment steps.

By integrating the latent imagination module into our LIRL framework, we enable the agent to learn more efficient coverage strategies through a combination of real and imagined experiences. This approach significantly enhances the agent’s ability to plan and reason about long-term consequences in CPP tasks, particularly in unknown or partially observable environments where direct exploration can be costly or time-consuming.

3.4. Multi-Step Prediction Learning

The multi-step prediction learning (MSPL) component of our LIRL framework introduces a crucial symmetry in the prediction horizon, ensuring that the agent considers both short-term and long-term outcomes of its actions. MSPL employs an adaptive n-step learning algorithm that dynamically adjusts the prediction horizon based on the current state and the agent’s uncertainty. This adaptive symmetry enables the agent to dynamically balance the immediate benefits of actions with their delayed effects, ensuring efficient coverage across various environments. This symmetrical approach to learning allows the agent to seamlessly integrate current rewards with future gains, optimizing for both short-term efficiency and long-term success.

At the core of MSPL is the n-step return, which provides a bridge between one-step temporal difference (TD) learning and Monte Carlo methods. The n-step return for a state

s_{t}

is defined as:

G_{t}^{(n)} = \sum_{k = 0}^{n - 1} γ^{k} r_{t + k} + γ^{n} V (s_{t + n})

(12)

where

γ \in [0, 1]

is the discount factor,

r_{t + k}

is the reward received k steps after time t, and

V (s_{t + n})

is the estimated value of the state encountered after n steps. This formulation allows the agent to consider rewards over multiple time steps while still bootstrapping from the value function estimate.

The choice of n in the n-step return represents a trade-off. Larger values of n incorporate more actual rewards and can lead to faster learning in some cases, but they also introduce more variance into the estimates. Smaller values of n result in lower variance but may suffer from bias due to inaccurate value function estimates. In traditional RL, n is often a fixed hyperparameter. However, in our MSPL approach, we adaptively determine n based on the agent’s current uncertainty and the temporal difference error.

To implement this adaptive approach, we maintain a set of n-step returns for different values of n, typically ranging from 1 to a maximum value

N_{m a x}

(e.g.,

N_{m a x} = 10

). At each time step, we compute the temporal difference error for each n-step return:

δ_{t}^{(n)} = G_{t}^{(n)} - V (s_{t})

(13)

where

V (s_{t})

is the current estimate of the state value function. The temporal difference error

δ_{t}^{(n)}

provides a measure of how much the n-step return differs from the current value estimate.

We then select the value of n that minimizes the absolute temporal difference error:

n_{t} = arg min_{n} | δ_{t}^{(n)} |

(14)

This selection criterion balances between bias and variance. When the value function estimates are accurate, shorter n-step returns (smaller n) will typically have smaller temporal difference errors. Conversely, when value function estimates are inaccurate, longer n-step returns may provide better estimates.

Once the appropriate

n_{t}

is selected, we update the value function using the corresponding n-step return:

V (s_{t}) \leftarrow V (s_{t}) + α δ_{t}^{(n_{t})}

(15)

where

α

is the learning rate. This update rule moves the value function estimate towards the n-step return that is deemed most appropriate for the current state.

To further enhance the stability and efficiency of learning, we incorporate an uncertainty estimate into our n-step selection process. We maintain an uncertainty measure

U (s_{t})

for each state, updated using the Bellman error:

U (s_{t}) \leftarrow (1 - β) U (s_{t}) + β | δ_{t}^{(1)} |

(16)

where

β

is a smoothing factor (typically set to 0.1 in our experiments). This uncertainty estimate is then used to modulate the n-step selection:

n_{t} = arg min_{n} (| δ_{t}^{(n)} | + λ U (s_{t}))

(17)

where

λ

is a hyperparameter that controls the influence of the uncertainty estimate. Higher values of

λ

encourage the use of shorter n-step returns in states with high uncertainty, promoting more conservative updates in these states.

The MSPL approach is particularly beneficial in CPP tasks, where the consequences of actions may not be immediately apparent. For instance, a decision to explore a particular area might only yield significant rewards after several time steps when a large unexplored region is discovered. By adaptively choosing the n-step return, our agent can better capture these delayed effects while maintaining stable learning.

Moreover, the adaptive nature of MSPL allows it to automatically adjust to different phases of the learning process. In the early stages of learning, when value estimates are likely to be inaccurate, it may favor longer n-step returns to accelerate learning. As the agent’s estimates improve, it can shift towards shorter n-step returns for more fine-grained updates.

The MSPL component integrates seamlessly with the other elements of our LIRL framework. The n-step returns are computed using both real experiences from the memory-augmented experience replay and imagined trajectories from the latent imagination module. This integration allows the agent to learn from a rich combination of actual and simulated experiences, further enhancing its ability to develop efficient coverage strategies.

By employing this adaptive multi-step prediction learning approach, our LIRL framework can effectively balance short-term and long-term predictions, leading to more efficient and robust learning in complex CPP tasks. The dynamic adjustment of the prediction horizon enables the agent to capture delayed effects of actions while maintaining stable learning, a crucial capability for effective coverage path planning in unknown environments.

3.5. Policy Learning Module

The Policy Learning Module is the cornerstone of our LIRL framework, integrating the MAER, LIM, and MSPL components to learn an effective policy for CPP. The module operates by maintaining a symmetrical balance between exploration and exploitation, leveraging the SAC algorithm’s sample efficiency and stability in continuous action spaces. The integration of past memory, future imagination, and multi-step prediction creates a harmonious feedback loop that preserves symmetry between short-term rewards and long-term planning. This equilibrium is critical in ensuring comprehensive coverage in unknown environments, where balanced decision-making is key to success.

In our framework, the policy

π_{ϕ} (a | s)

is parameterized by a neural network with parameters

ϕ

, which maps states s to a distribution over actions a. The state s is an augmented representation that combines the current observation

o_{t}

, the retrieved memory

m_{t}

from MAER, and the latent state

z_{t}

from LIM:

s_{t} = [o_{t}; m_{t}; z_{t}]

. This rich state representation allows the policy to make decisions based on current observations, relevant past experiences, and predictions about future states.

The SAC algorithm aims to learn a policy that maximizes both the expected return and the entropy of the policy:

J (π) = E_{τ \sim π} [\sum_{t = 0}^{\infty} γ^{t} (r_{t} + α H (π (\cdot | s_{t})))]

(18)

where

τ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)

is a trajectory,

γ

is the discount factor,

r_{t}

is the reward at time t,

H

is the entropy of the policy, and

α

is the temperature parameter that balances between exploitation and exploration.

The training process involves alternating between collecting experiences, updating the policy, and updating the value functions. We use two Q-functions,

Q_{θ_{1}}

and

Q_{θ_{2}}

, to mitigate overestimation bias, and a value function

V_{ψ}

to stabilize training. The policy is updated to maximize the expected Q-value and entropy:

\nabla_{ϕ} J (π) = E_{s_{t} \sim D, a_{t} \sim π_{ϕ}} [\nabla_{ϕ} log π_{ϕ} (a_{t} | s_{t}) (min_{i = 1, 2} Q_{θ_{i}} (s_{t}, a_{t}) - α log π_{ϕ} (a_{t} | s_{t}))]

(19)

where

D

is the replay buffer containing both real and imagined experiences.

The Q-functions are updated to minimize the Bellman error:

L_{Q} (θ_{i}) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D} [{(Q_{θ_{i}} (s_{t}, a_{t}) - (r_{t} + γ (V_{ψ} (s_{t + 1}) - α log π_{ϕ} (a_{t + 1} | s_{t + 1}))))}^{2}]

(20)

The value function is updated to minimize the mean squared error between its predictions and the Q-values:

L_{V} (ψ) = E_{s_{t} \sim D} [{(V_{ψ} (s_{t}) - E_{a_{t} \sim π_{ϕ}} [min_{i = 1, 2} Q_{θ_{i}} (s_{t}, a_{t}) - α log π_{ϕ} (a_{t} | s_{t})])}^{2}]

(21)

The temperature parameter

α

is also learned to automatically adjust the balance between exploration and exploitation:

L (α) = E_{a_{t} \sim π_{ϕ}} [- α log π_{ϕ} (a_{t} | s_{t}) - α H_{t a r g e t}]

(22)

where

H_{t a r g e t}

is a target entropy.

Algorithm 1 integrates all components of the LIRL framework. Below is the pseudocode for the main training loop:

This training process allows the agent to continuously improve its policy by leveraging real experiences, imagined trajectories, and retrieved memories. The integration of MAER enables the agent to focus on the most informative experiences and leverage past knowledge. The LIM allows for efficient exploration through imagination, while MSPL facilitates learning from both immediate and delayed rewards. By combining these components within the SAC framework, our LIRL approach can learn effective coverage strategies that balance exploration and exploitation, adapt to unknown environments, and plan for long-term coverage objectives in complex CPP tasks.

Algorithm 1 LIRL Training Algorithm

1:: Initialize policy $π_{ϕ}$ , Q-functions $Q_{θ_{1}}, Q_{θ_{2}}$ , value function $V_{ψ}$ , encoder $q_{ϕ}$ , dynamics model $p_{θ}$ , decoder $p_{ψ}$
2:: Initialize replay buffer $D$ , empty memory $M$
3:: for each episode do
4:: Observe initial state $s_{0}$
5:: for $t = 0$ to T do
6:: $a_{t} \sim π_{ϕ} (\cdot | s_{t})$
7:: Execute $a_{t}$ , observe $r_{t}$ , $s_{t + 1}$
8:: Store $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $D$
9:: Update memory $M$ with $(s_{t}, a_{t}, r_{t}, s_{t + 1})$
10:: if update step then
11:: for $j = 1$ to N do
12:: Sample batch B from $D$
13:: Augment B with imagined experiences from LIM
14:: Retrieve relevant memories from $M$
15:: Compute adaptive n-step returns using MSPL
16:: Update $Q_{θ_{1}}, Q_{θ_{2}}, V_{ψ}, π_{ϕ}, α$
17:: Update world model (encoder $q_{ϕ}$ , dynamics $p_{θ}$ , decoder $p_{ψ}$ )
18:: end for
19:: end if
20:: end for
21:: end for

3.6. Computational Complexity and Cost Analysis

To provide greater clarity regarding the computational feasibility of LIRL, we analyze the time complexity and computational costs associated with each key component of the framework: MAER, the LIM, and MSPL. This analysis aims to offer insights into the resources required for efficient implementation, as well as strategies for managing computational demands.

3.6.1. Memory-Augmented Experience Replay (MAER)

The MAER component utilizes prioritized experience retrieval to improve learning efficiency. For each training step, the prioritization process incurs a complexity of

O (log N)

for sampling experiences from the buffer, where N is the number of stored experiences. Memory retrieval, enhanced by an attention-based mechanism, adds an additional complexity of

O (N)

due to attention weight computations over the stored memory entries. This approach is efficient for learning from high-utility experiences but can be scaled down by adjusting the sampling frequency or buffer size for resource-constrained applications.

3.6.2. Latent Imagination Module (LIM)

The LIM allows the agent to generate imagined trajectories, reducing the need for extensive environment interactions. The computational cost of LIM is primarily associated with generating latent states and simulating imagined trajectories through the variational world model. Given a trajectory length T, the complexity for generating imagined rollouts is

O (T)

per rollout, where T can be controlled based on the task requirements. By tuning the frequency and length of imagined trajectories, LIM can be adapted to balance between sample efficiency and computational cost.

3.6.3. Multi-Step Prediction Learning (MSPL)

MSPL dynamically adjusts the prediction horizon by evaluating multiple n-step returns. This component has a complexity of

O (K)

per time step, where K is the maximum number of n-steps considered. In typical implementations, K is kept small (e.g.,

K = 5

) to prevent excessive computational overhead while still capturing the delayed effects of actions. For settings where computation is limited, a fixed n-step approach may be substituted for MSPL’s adaptive process to reduce this overhead.

3.6.4. Overall Complexity Within the SAC Framework

Combining MAER, LIM, and MSPL within the SAC framework introduces an overall time complexity of

O (N + T + K)

per training iteration, reflecting the sum of prioritized sampling, latent rollouts, and multi-step prediction computations. While this integration offers enhanced efficiency and adaptability in unknown environments, the computational load can be adjusted by simplifying components or using approximations based on deployment constraints.

In summary, this complexity analysis provides a basis for understanding the computational demands of LIRL and highlights tuning options that allow practitioners to manage these requirements in real-world applications. We have included these considerations to ensure that LIRL’s implementation remains feasible and adaptable, even in resource-constrained settings.

4. Experiments

To evaluate the effectiveness of our proposed LIRL framework for efficient CPP, we conducted a series of comprehensive experiments. This section details our experimental setup, baseline comparisons, and results, followed by ablation studies to analyze the contribution of each component of our framework.

4.1. Experimental Setup

Environments

We tested our LIRL framework in three distinct simulated environments, each representing different challenges in CPP, as shown in Figure 3:

(1): Map1: A structured indoor space (50 m × 50 m) with rooms, corridors, and obstacles, simulating a typical office layout.
(2): Map2: A large open space (100 m × 100 m) with irregularly placed shelves and objects, representing a warehouse setting.
(3): Map3: An unstructured area (200 m × 200 m) with various terrains and obstacles, simulating a search and rescue scenario.

Figure 3. Schematic diagram of (a) Map1, (b) Map2, and (c) Map3.

4.2. Implementation and Test Environment

The implementation of the LIRL framework and all experiments were conducted using the following software and hardware specifications to ensure reproducibility and transparency in computational requirements.

The framework was developed primarily in Python, leveraging libraries such as PyTorch for neural network implementation and optimization, and OpenAI Gym for simulation environment management. All experiments were run on a Linux-based system (Ubuntu 20.04), which provides compatibility with the machine learning libraries and facilitates efficient computation management in a high-performance environment. The tests were conducted on a workstation equipped with an Intel Core i9 processor, 32 GB of RAM, and an NVIDIA RTX 3090 GPU with 24 GB of dedicated memory. This setup allowed us to manage the high computational demands associated with RL and extensive simulation-based testing. Key libraries included PyTorch for model development, NumPy and Pandas for data handling, Matplotlib for result visualization, and the OpenAI Gym toolkit for environment simulation. These libraries enabled efficient model training, data processing, and result analysis. Each experiment involved running simulations over 1000 episodes, with each episode consisting of a variable number of steps depending on the environment and task requirements. This provided sufficient training and evaluation data to assess the adaptability and efficiency of the LIRL framework.

This configuration ensured that the computational requirements of the LIRL framework were met, facilitating smooth model training and evaluation. The specified setup also provides a basis for others to replicate our results or conduct further experiments on comparable hardware.

To provide transparency and aid reproducibility, we detail the key parameters used in the LIRL framework, along with their assigned values and the rationale for each choice.

Learning Rate: The learning rate for both the policy and value networks was set to $3 \times 10^{- 4}$ . This value, commonly used in RL algorithms like SAC, balances convergence speed with stability. Preliminary testing indicated that this rate optimized learning efficiency and stability within the LIRL framework.
Discount Factor ( $γ$ ): We used a discount factor of 0.99 to emphasize long-term rewards, which is crucial for efficient coverage in large environments. This setting encourages the agent to prioritize actions that contribute to future coverage rather than immediate gains.
Temperature Parameter ( $α$ ): The SAC temperature parameter, which controls exploration-exploitation balance, was set to 0.2. This value supports effective exploration without an excessive focus on random actions, allowing the agent to balance exploration with goal-oriented planning.
MAER Buffer Size: The memory-augmented experience replay (MAER) buffer was set to store up to 100,000 experiences, ensuring that the agent has access to a wide variety of past experiences for prioritized sampling. This buffer size is sufficient for complex environments while remaining manageable in terms of memory usage.
LIM Trajectory Length: In the latent imagination module (LIM), we set the imagined trajectory length $T = 5$ . This length allows the agent to predict short-term paths without incurring significant computational costs, achieving a balance between sample efficiency and processing load.
MSPL Horizon ( $K$ ): The multi-step prediction learning (MSPL) horizon parameter was set to $K = 5$ , allowing the agent to consider multi-step returns while adapting based on uncertainty. This horizon length enhances long-term planning without the complexity associated with longer horizons.

These parameters were chosen based on RL best practices and were fine-tuned through preliminary experiments to achieve optimal performance within our test environments.

4.2.1. Performance Metrics

We evaluated the performance using the following metrics: (1) Coverage Percentage: The proportion of the accessible area covered. (2) Coverage Time: Time taken to achieve

95 %

coverage. (3) Path Efficiency: Ratio of the covered area to the total path length. (4) Collision Rate: Number of collisions per minute of operation.

4.2.2. Baselines and Implementation Details

To evaluate the effectiveness of the LIRL framework, we selected several widely used algorithms for comparison: deep deterministic policy gradient (DDPG), soft actor–critic (SAC), proximal policy optimization (PPO), and the spiral-STC algorithm. Below, we provide the rationale for choosing each method and clarify the implementation sources.

The selected algorithms represent a range of approaches to CPP and RL:

DDPG and SAC are model-free RL algorithms suitable for continuous action spaces, similar to our LIRL framework. Their established use in CPP tasks makes them suitable benchmarks for performance evaluation.
PPO is a popular on-policy RL method known for stable performance in sequential decision-making tasks, providing a useful benchmark for sample efficiency and policy optimization.
Spiral-STC is a classical CPP algorithm, which allows for a comparison of LIRL’s performance against traditional path planning methods in terms of coverage efficiency and path smoothness.

For DDPG, SAC, and PPO, we based our implementations on open-source libraries (e.g., OpenAI Baselines) and matched the parameters as closely as possible to our LIRL framework for consistency. Minor adjustments were made to optimize their performance within our simulation environments. Spiral-STC was implemented in-house following its original algorithmic formulation, ensuring direct comparability in coverage and efficiency metrics.

These details clarify the rationale for selecting these comparison algorithms and ensure transparency in the implementation sources, contributing to a fair and comprehensive evaluation of the LIRL framework.

4.3. Results and Analysis

4.3.1. Overall Performance

Table 2 presents the performance of LIRL compared to the baselines across all three environments, averaged over 50 runs.

As shown in Table 2, LIRL consistently outperforms all baselines across all environments and metrics. In Map1, LIRL achieves

98.2 %

coverage in

12.3

minutes, demonstrating a

20.4 %

reduction in coverage time compared to the next best learning-based method (MB-I). The performance gap widens in more complex environments, with LIRL showing a

27.3 %

reduction in coverage time in the warehouse environment and a

34.7 %

reduction in the outdoor environment compared to MB-I.

LIRL also demonstrates superior path efficiency and collision avoidance. In Map3, LIRL achieves a path efficiency of

0.83

, which is

5.1 %

higher than MB-I and

27.7 %

higher than DDPG. The collision rate of LIRL is consistently low across all environments, with only

0.09

collisions per minute in the challenging outdoor environment, representing a

40 %

reduction compared to MB-I and a

71 %

reduction compared to DDPG.

4.3.2. Learning Efficiency

Figure 4 illustrates the learning curves of LIRL and the baselines in Map2, showing the coverage percentage achieved over training episodes.

LIRL demonstrates faster learning and convergence to higher performance levels compared to other methods. It achieves

90 %

coverage within 500 episodes, while SAC requires approximately 800 and 700 episodes, respectively, to reach the same level of performance. This faster learning can be attributed to the effective combination of the memory-augmented experience replay and the latent imagination module, which allow for more efficient use of experiences and better long-term planning.

4.3.3. Adaptability to Environmental Changes

To evaluate the adaptability of LIRL, we introduced sudden changes to the environments during testing. These changes included the addition or removal of obstacles and alterations to the terrain. Figure 5 shows the coverage performance of LIRL and baselines before and after these changes in Map3.

LIRL demonstrates superior adaptability, maintaining

97.2 %

of its pre-change performance after significant environmental alterations, compared to

85.6 %

for SAC. This adaptability can be attributed to the combination of the latent imagination module, which allows for rapid replanning, and the multi-step prediction learning, which enables effective learning from both immediate and delayed consequences of actions.

4.4. Ablation Studies

To understand the contribution of each component of LIRL, we conducted ablation studies by removing or replacing key components of the framework. Table 3 presents the results of these studies in Map2.

The ablation studies reveal that each component of LIRL contributes significantly to its performance:

(1): Removing the MAER results in a $16.5 %$ increase in coverage time and a $71.4 %$ increase in collision rate, highlighting its importance in efficient learning and safe navigation.
(2): Without the LIM, coverage time increases by $21.8 %$ , and path efficiency decreases by $8.1 %$ , demonstrating LIM’s crucial role in effective planning and exploration.
(3): Removing the MSPL leads to a $9.1 %$ increase in coverage time and a $28.6 %$ increase in collision rate, indicating its importance in balancing short-term and long-term decision-making.
(4): Replacing the adaptive n-step learning with a fixed n-step approach (n = 5) results in a $6.3 %$ increase in coverage time, showing the benefits of dynamically adjusting the prediction horizon.

These results confirm that each component of LIRL plays a vital role in its overall performance, with their combination leading to significant improvements in coverage efficiency, path planning, and collision avoidance across various CPP scenarios.

5. Conclusions

This paper introduces LIRL, a novel framework designed to address the challenges of CPP in unknown environments. A core strength of the LIRL framework lies in its ability to maintain symmetry between exploration and exploitation, balancing short-term decision-making with long-term planning. This symmetry is achieved through the seamless integration of three key components: MAER, LIM, and MSPL, all operating within a SAC architecture. This innovative combination enables LIRL to effectively leverage past experiences, simulate potential future scenarios, and balance short-term and long-term decision-making in complex CPP tasks. The MAER component enhances sample efficiency by prioritizing and retrieving relevant past experiences, allowing the agent to learn more effectively from its history. This is particularly crucial in CPP tasks where certain experiences, such as discovering new areas or avoiding obstacles, may be rare but highly informative. The LIM empowers the agent with the ability to imagine and evaluate potential future trajectories without direct environment interaction. This capability is essential for long-term planning and efficient exploration in partially observable or dynamic environments, which are common challenges in real-world CPP applications. The MSPL component introduces an adaptive n-step learning mechanism that dynamically balances between immediate rewards and long-term consequences. This adaptability is key to developing efficient coverage strategies that can navigate the trade-offs between thorough exploration and timely task completion. Our experimental results, conducted across various simulated environments, demonstrate LIRL’s superior performance in terms of coverage efficiency, path planning, and collision avoidance compared to state-of-the-art methods. The framework’s ability to maintain high performance even after significant environmental changes underscores its potential for real-world deployment where adaptability is paramount.

Future work will focus on deploying LIRL in real-world settings, particularly in indoor and outdoor environments where it can be tested on tasks that involve navigating dynamic and partially observable terrains. This testing will be essential for confirming LIRL’s effectiveness in physical applications and exploring how it can be extended or adapted to multi-robot systems for large-scale CPP tasks. Additionally, incorporating advanced perception systems to recognize and prioritize regions based on semantic information could further enhance the framework’s adaptability, making it more context-aware and intelligent in handling complex environments.

Author Contributions

Conceptualization, Z.W. and M.Z.; methodology, Z.W. and T.S.; software, Z.W.; validation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W.; visualization, Z.W.; supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

List of Abbreviations and Notations

Abbreviation	Meaning
CPP	Coverage path planning
RL	Reinforcement learning
LIRL	Latent imagination-based reinforcement learning
MAER	Memory-augmented experience replay
LIM	Latent Imagination module
MSPL	Multi-step prediction learning
POMDP	Partially observable markov decision process
SAC	Soft actor–critic
DNC	Differentiable neural computer
UAV	Unmanned aerial vehicle
DDPG	Deep deterministic policy gradient
MERLIN	Memory-augmented reinforcement learning in goal-directed agent
Notation	Meaning
N	Number of stored experiences in the replay buffer
T	Length of a trajectory
K	Maximum number of n-steps considered in Multi-Step Prediction Learning
$s_{t}$	State of the agent at time step t
$a_{t}$	Action taken by the agent at time step t
$r_{t}$	Reward received by the agent at time step t
$γ$	Discount factor for future rewards
$π (a \| s)$	Policy, mapping states to action probabilities
$V (s)$	Value function of a state
$Q (s, a)$	Action-value function for a state-action pair
$α$	Temperature parameter in Soft Actor–Critic for exploration-exploitation balance
$G_{t}^{(n)}$	n-step return for state $s_{t}$
$δ_{t}^{(n)}$	Temporal difference error for n-step return
$λ$	Hyperparameter for uncertainty modulation in adaptive n-step selection
$U (s_{t})$	Uncertainty estimate for state $s_{t}$

References

Bormann, R.; Jordan, F.; Hampp, J.; Hägele, M. Indoor coverage path planning: Survey, implementation, analysis. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1718–1725. [Google Scholar]
Galceran, E.; Carreras, M. A survey on coverage path planning for robotics. Robot. Auton. Syst. 2013, 61, 1258–1276. [Google Scholar] [CrossRef]
Jin, J.; Tang, L. Coverage path planning on three-dimensional terrain for arable farming. J. Field Robot. 2011, 28, 424–440. [Google Scholar] [CrossRef]
Huang, Q. Model-based or model-free, a review of approaches in reinforcement learning. In Proceedings of the 2020 International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 1–2 August 2020; pp. 219–221. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor–critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Sutton, R.S. Reinforcement learning: An introduction. In A Bradford Book; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning PMLR, 2017, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Igl, M.; Zintgraf, L.; Le, T.A.; Wood, F.; Whiteson, S. Deep variational reinforcement learning for POMDPs. In Proceedings of the International Conference on Machine Learning PMLR, 2018, Stockholm, Sweden, 10–15 July 2018; pp. 2117–2126. [Google Scholar]
Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering atari with discrete world models. arXiv 2020, arXiv:2010.02193. [Google Scholar]
Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwińska, A.; Colmenarejo, S.G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. Hybrid computing using a neural network with dynamic external memory. Nature 2016, 538, 471–476. [Google Scholar] [CrossRef]
Fortunato, M.; Tan, M.; Faulkner, R.; Hansen, S.; Puigdomènech Badia, A.; Buttimore, G.; Deck, C.; Leibo, J.Z.; Blundell, C. Generalization of reinforcement learners with working and episodic memory. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Ha, D.; Schmidhuber, J. Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Kaiser, L.; Babaeizadeh, M.; Milos, P.; Osinski, B.; Campbell, R.H.; Czechowski, K.; Erhan, D.; Finn, C.; Kozakowski, P.; Levine, S.; et al. Model-based reinforcement learning for atari. arXiv 2019, arXiv:1903.00374. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Faust, A.; Oslund, K.; Ramirez, O.; Francis, A.; Tapia, L.; Fiser, M.; Davidson, J. Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 5113–5120. [Google Scholar]
Brown, S.; Waslander, S.L. The constriction decomposition method for coverage path planning. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 3233–3238. [Google Scholar]
Cabreira, T.M.; Ferreira, P.R.; Di Franco, C.; Buttazzo, G.C. Grid-based coverage path planning with minimum energy over irregular-shaped areas with UAVs. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 11–14 June 2019; pp. 758–767. [Google Scholar]
Liu, S.; Xia, J.; Sun, X.; Zhang, Y. An efficient complete coverage path planning in known environments. J. Northeast Norm. Univ. 2011, 43, 39–43. [Google Scholar]
Theile, M.; Bayerlein, H.; Nai, R.; Gesbert, D.; Caccamo, M. UAV coverage path planning under varying power constraints using deep reinforcement learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 1444–1449. [Google Scholar]
Narottama, B.; Shin, S.Y. UAV Coverage Path Planning with Quantum-based Recurrent Deep Deterministic Policy Gradient. IEEE Trans. Veh. Technol. 2023, 73, 7424–7429. [Google Scholar]
Heydari, J.; Saha, O.; Ganapathy, V. Reinforcement learning-based coverage path planning with implicit cellular decomposition. arXiv 2021, arXiv:2110.09018. [Google Scholar]
Aydemir, F.; Cetin, A. Multi-agent dynamic area coverage based on reinforcement learning with connected agents. Comput. Syst. Sci. Eng. 2023, 45, 215–230. [Google Scholar] [CrossRef]
Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A. Bayesian reinforcement learning: A survey. Found. Trends^® Mach. Learn. 2015, 8, 359–483. [Google Scholar] [CrossRef]
Brown, D.; Coleman, R.; Srinivasan, R.; Niekum, S. Safe imitation learning via fast bayesian reward inference from preferences. In Proceedings of the International Conference on Machine Learning PMLR, Virtual Event. 13–18 July 2020; pp. 1165–1177. [Google Scholar]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Loquercio, A.; Segu, M.; Scaramuzza, D. A general framework for uncertainty estimation in deep learning. IEEE Robot. Autom. Lett. 2020, 5, 3153–3160. [Google Scholar] [CrossRef]
Wei, M.; Zheng, L.; Li, H.; Cheng, H. Adaptive Neural Network-based Model Path-Following Contouring Control for Quadrotor Under Diversely Uncertain Disturbances. IEEE Robot. Autom. Lett. 2024, 9, 3751–3758. [Google Scholar] [CrossRef]
Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proceedings of the International Conference on Machine Learning PMLR 2016, New York City, NY, USA, 19–24 June 2016; pp. 1842–1850. [Google Scholar]
Karunaratne, G.; Schmuck, M.; Le Gallo, M.; Cherubini, G.; Benini, L.; Sebastian, A.; Rahimi, A. Robust high-dimensional memory-augmented neural networks. Nat. Commun. 2021, 12, 2468. [Google Scholar] [CrossRef]
Bae, H.; Kim, G.; Kim, J.; Qian, D.; Lee, S. Multi-robot path planning method using reinforcement learning. Appl. Sci. 2019, 9, 3057. [Google Scholar] [CrossRef]
Wayne, G.; Hung, C.C.; Amos, D.; Mirza, M.; Ahuja, A.; Grabska-Barwinska, A.; Rae, J.; Mirowski, P.; Leibo, J.Z.; Santoro, A.; et al. Unsupervised predictive memory in a goal-directed agent. arXiv 2018, arXiv:1803.10760. [Google Scholar]
Edgar, I. A Guide to Imagework: Imagination-Based Research Methods; Routledge: London, UK, 2004. [Google Scholar]
Pascanu, R.; Li, Y.; Vinyals, O.; Heess, N.; Buesing, L.; Racanière, S.; Reichert, D.; Weber, T.; Wierstra, D.; Battaglia, P. Learning model-based planning from scratch. arXiv 2017, arXiv:1707.06170. [Google Scholar]
Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to control: Learning behaviors by latent imagination. arXiv 2019, arXiv:1912.01603. [Google Scholar]
Liu, K.; Stadler, M.; Roy, N. Learned sampling distributions for efficient planning in hybrid geometric and object-level representations. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9555–9562. [Google Scholar]
Argenson, A.; Dulac-Arnold, G. Model-based offline planning. arXiv 2020, arXiv:2008.05556. [Google Scholar]
Tang, Y.; Munos, R.; Rowland, M.; Avila Pires, B.; Dabney, W.; Bellemare, M. The nature of temporal difference errors in multi-step distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 2022, 35, 30265–30276. [Google Scholar]
Schoknecht, R.; Riedmiller, M. Speeding-up reinforcement learning with multi-step actions. In Proceedings of the Artificial Neural Networks—ICANN 2002: International Conference, Madrid, Spain, 28–30 August 2002; Proceedings 12. pp. 813–818. [Google Scholar]
De Asis, K.; Hernandez-Garcia, J.; Holland, G.; Sutton, R. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence 2018, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Han, S.; Chen, Y.; Chen, G.; Yin, J.; Wang, H.; Cao, J. Multi-step reinforcement learning-based offloading for vehicle edge computing. In Proceedings of the 2023 15th International Conference on Advanced Computational Intelligence (ICACI), Seoul, Republic of Korea, 6–9 May 2023; pp. 1–8. [Google Scholar]
Klamt, T.; Behnke, S. Planning hybrid driving-stepping locomotion on multiple levels of abstraction. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1695–1702. [Google Scholar]

Figure 1. Schematic of the proposed LIRL framework for efficient CPP.

Figure 2. Illustration of coverage maps at multiple scales.

Figure 4. Learning curves of LIRL and baselines in Map2.

Figure 5. Adaptability to environmental changes in Map3.

Table 2. Performance comparison of LIRL with baselines.

Method	Coverage (%)	Time (min)	Efficiency	Collisions/min
Map1
LIRL (Ours)	98.2 ± 0.7	12.3 ± 1.1	0.89 ± 0.03	0.05 ± 0.02
DDPG	92.5 ± 1.2	18.7 ± 1.5	0.76 ± 0.05	0.18 ± 0.04
SAC	95.1 ± 0.9	15.2 ± 1.3	0.82 ± 0.04	0.11 ± 0.03
PPO	93.8 ± 1.1	16.9 ± 1.4	0.79 ± 0.04	0.15 ± 0.03
Spiral-STC	97.5 ± 0.5	19.8 ± 0.9	0.71 ± 0.02	0.02 ± 0.01
Map2
LIRL (Ours)	97.8 ± 0.8	28.5 ± 1.7	0.86 ± 0.04	0.07 ± 0.02
DDPG	90.2 ± 1.5	42.3 ± 2.1	0.71 ± 0.06	0.25 ± 0.05
SAC	93.9 ± 1.2	35.7 ± 1.9	0.78 ± 0.05	0.16 ± 0.04
PPO	92.5 ± 1.3	38.4 ± 2.0	0.75 ± 0.05	0.19 ± 0.04
Spiral-STC	96.9 ± 0.6	45.6 ± 1.5	0.68 ± 0.03	0.04 ± 0.01
Map3
LIRL (Ours)	96.5 ± 1.0	62.7 ± 2.5	0.83 ± 0.05	0.09 ± 0.03
DDPG	87.3 ± 1.8	91.5 ± 3.2	0.67 ± 0.07	0.31 ± 0.06
SAC	91.8 ± 1.5	78.3 ± 2.9	0.75 ± 0.06	0.22 ± 0.05
PPO	90.1 ± 1.6	83.9 ± 3.0	0.72 ± 0.06	0.26 ± 0.05
Spiral-STC	95.7 ± 0.8	98.2 ± 2.3	0.65 ± 0.04	0.06 ± 0.02

Table 3. Ablation study results in Map2.

Variant	Coverage (%)	Time (min)	Efficiency	Collisions/min
LIRL (Full)	97.8 ± 0.8	28.5 ± 1.7	0.86 ± 0.04	0.07 ± 0.02
w/o MAER	95.1 ± 1.1	33.2 ± 1.9	0.81 ± 0.05	0.12 ± 0.03
w/o LIM	94.3 ± 1.2	34.7 ± 2.0	0.79 ± 0.05	0.14 ± 0.03
w/o MSPL	96.2 ± 0.9	31.1 ± 1.8	0.83 ± 0.04	0.09 ± 0.02
w/Fixed n-step	96.5 ± 0.9	30.3 ± 1.8	0.84 ± 0.04	0.08 ± 0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, Z.; Sun, T.; Zhou, M. LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning. Symmetry 2024, 16, 1537. https://doi.org/10.3390/sym16111537

AMA Style

Wei Z, Sun T, Zhou M. LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning. Symmetry. 2024; 16(11):1537. https://doi.org/10.3390/sym16111537

Chicago/Turabian Style

Wei, Zhenglin, Tiejiang Sun, and Mengjie Zhou. 2024. "LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning" Symmetry 16, no. 11: 1537. https://doi.org/10.3390/sym16111537

APA Style

Wei, Z., Sun, T., & Zhou, M. (2024). LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning. Symmetry, 16(11), 1537. https://doi.org/10.3390/sym16111537

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning

Abstract

1. Introduction

2. Related Work

2.1. Coverage Path Planning

2.2. Reinforcement Learning in CPP

2.3. Uncertainty Management in RL

2.4. Memory-Augmented Neural Networks

2.5. Imagination-Based Planning

2.6. Multi-Step RL

3. Method

3.1. Problem Formulation

3.1.1. State Space

3.1.2. Action Space

3.1.3. Reward Function

3.2. Memory-Augmented Experience Replay

3.3. Latent Imagination Module

3.4. Multi-Step Prediction Learning

3.5. Policy Learning Module

3.6. Computational Complexity and Cost Analysis

3.6.1. Memory-Augmented Experience Replay (MAER)

3.6.2. Latent Imagination Module (LIM)

3.6.3. Multi-Step Prediction Learning (MSPL)

3.6.4. Overall Complexity Within the SAC Framework

4. Experiments

4.1. Experimental Setup

Environments

4.2. Implementation and Test Environment

4.2.1. Performance Metrics

4.2.2. Baselines and Implementation Details

4.3. Results and Analysis

4.3.1. Overall Performance

4.3.2. Learning Efficiency

4.3.3. Adaptability to Environmental Changes

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

List of Abbreviations and Notations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI