A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios

Le, Jiahui; He, Lili; Zheng, Junhong

doi:10.3390/app15105249

Open AccessArticle

A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios

by

Jiahui Le

,

Lili He

and

Junhong Zheng

^*

College of Computer Science and Technology (College of Artificial Intelligence), Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5249; https://doi.org/10.3390/app15105249

Submission received: 26 March 2025 / Revised: 23 April 2025 / Accepted: 4 May 2025 / Published: 8 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

To effectively plan the travel paths of automated guided vehicles (AGVs) in complex manufacturing scenarios and avoid dynamic obstacles, this paper proposes a pathfinding strategy that integrates macro-control and micro-autonomy. At the macro level, a central system employs a modified A* algorithm for preliminary pathfinding, guiding the AGVs toward their targets. At the micro level, a distributed system incorporates a navigation and obstacle avoidance strategy trained by Prioritized Experience Replay Double Dueling Deep Q-Network with

ε

-Dataset Aggregation (PER-D3QN-EDAgger). Each AGV integrates its current state with information from the central system and the neighboring AGVs to make autonomous pathfinding decisions. The experimental results indicate that this strategy exhibits a strong adaptability to diverse environments, low path costs, and rapid solution speeds. It effectively avoids the neighboring AGVs and other dynamic obstacles, and maintains a high task completion rate of over 95% when the number of AGVs is below 200 and the obstacle density is below 0.5. This approach combines the advantages of centralized pathfinding, which ensures high path quality, with distributed planning, which enhances adaptability to dynamic environments.

Keywords:

AGV; MAPF; PER-D3QN-EDAgger

1. Introduction

In recent years, due to the demand for cost reduction and increased efficiency, as well as the promotion of intelligent warehousing technology, an increasing number of factories have begun adopting AGVs (automated guided vehicles) as part of their industrial logistics systems. An AGV is an unmanned transport device guided by a central system, which has a certain degree of autonomous decision-making [1,2] and LIDAR detection capabilities [3,4]. As a crucial component of the intelligent warehouse system, improving the pathfinding and coordination capabilities of AGVs is key to enhancing industrial logistics efficiency within the workshop.

The multi-AGV pathfinding problem aims to determine a set of conflict-free paths for multiple AGVs while minimizing the overall cost of the path. This problem falls into the category of multi-agent pathfinding (MAPF) [5]. The current solutions to the MAPF problem can be broadly classified into centralized and distributed approaches.

1.1. Centralized Solutions

Centralized methods involve computing a central system and assigning a feasible conflict-free path to each AGV. These methods can be further divided into A*-based search algorithms, conflict-based search algorithms, and other heuristic algorithms.

A*-based search algorithms are extensions of the A* algorithm adapted for multi-agent scenarios. Representative algorithms include OD A*, ID A* [6], OD rM* [7], and other improved A* algorithms [8,9]. These algorithms either simultaneously expand all possible states of all agents at the next time step or construct a k-agent state space by sequentially expanding the states of each agent and then performing searches based on this state space.

In conflict-based search (CBS) algorithms and their variants [10,11,12,13,14], search tree nodes do not store agent states directly but instead store the solution (i.e., a set of paths), along with detected conflicts and constraints. The algorithm operates on the following two levels: the low-level search plans individual AGV paths under the given constraints, typically using A* or its variants; the high-level search evaluates solutions based on cost, the number of conflicts, or the number of conflicting agent pairs. Then, it iteratively adds constraints and resolves the problem until a conflict-free solution is found.

Other metaheuristic algorithms include Ant Colony Optimization (ACO) [15], Genetic Algorithm (GA) [16], Differential Evolution (DE) [17], Artificial Potential Field (APF) [18], Particle Swarm Optimization (PSO) [19], and others. These algorithms generally achieve high solution quality; however, like other centralized methods, they still suffer from poor adaptability to dynamic obstacles, real-time planning, and large-scale AGV environments.

1.2. Distributed and Hybrid Solutions

Distributed methods allow AGVs to plan their paths independently using only locally available information, without requiring global data or dependence on a central planning system. These distributed navigation models are typically trained using imitation learning or reinforcement learning techniques.

Multi-agent reinforcement learning (MARL) is widely used in MAPF and can be broadly divided into the following four categories:

First, in the analysis of emergent behaviors, also called independent learning (IL), each agent treats all other agents as part of the environment and independently applies single-agent RL algorithms to interact and act in a way that maximize its own reward (e.g., IQL [20], IPPO [21]).

Next, with learning communication, agents exchange explicit messages to share intentions. During training, they learn when and what to communicate based on their local observations, deciding whether to send a message, what information to include, and which peers to address (e.g., NDQ [22]).

Learning cooperation involves using centralized training with decentralized execution. Agents learn implicit cooperation strategies by sharing value or policy information during training but acting individually at runtime (e.g., VDN [22], QMIX [23], QTRAN [24], MADDPG [25], COMA [26]).

Finally, in agent modeling, agents build models of the policies, goals, or types of other agents to predict their actions and achieve more effective coordination (e.g., ToMnet [27]).

However, reinforcement learning-based distributed navigation strategies face challenges such as sparse rewards, leading to slow convergence. Several approaches have been explored to address these issues, including the following:

Imitation and Reinforcement Learning: A previous study [28] proposed a learning framework called PRIMAL to train a distributed solution model, which combined reinforcement learning and imitation learning to improve model convergence and reduce training costs. Based on PRIMAL, PRIMAL2 [29] was developed and introduced specialized optimizations for corridor structures on the obstruction map to alleviate the problem of AGV blocking in narrow aisles. The centralized system would provide the distributed system with early warnings of corridor structures, as well as information about other AGVs in the corridor structures.

Task Decomposition to Shorten Decision Chain: Previous research [30,31] proposed an evolutionary reinforcement learning-based method called MAPPER. Under the guidance of a central planner, the algorithm would decompose long-duration tasks into multiple simpler tasks to help the distributed planner. This method mitigates the sparse rewards problem and improves agent performance in large-scale environments.

Enhanced Communication Methods: The literature [32] proposed a distributed method based on prioritized communication learning called PICO. PICO would integrate implicit planning priorities into a decentralized MARL framework, enabling a dynamic communication topology for effective collision avoidance. Another study [33] proposed a distributed method based on Message-Aware Graph Attention Networks (MAGATs), which use a query–key mechanism to determine the relative importance of information from neighboring agents. A request–response communication protocol, DCC [34] was proposed to allow each agent to select the nearby agents whose messages are more likely to influence its own policy.

Problem Decomposition: A previous study [35] proposed a hybrid solution that divided MAPF into two sub-tasks to reach the goal and avoid collisions, solved each with RL methods, and then merged the learned “goal reach” and “collision avoidance” policies into a unified strategy.

Providing Additional Immediate Rewards: Some methods introduce artificial potential fields as extra rewards to mitigate the sparse reward problem [36]. Others use A* guidance, offering extra rewards for following or returning to the guided path [37].

Initialization of Models with Other Algorithms: Some approaches, such as using the flower pollination algorithm [38], initialize the model parameters to speed up learning and bypass the initial random exploration phase.

1.3. Comparison of Centralized and Distributed Solutions

Among the aforementioned solutions, centralized approaches offer high-quality solutions but lack robustness in complex environments. They struggle to adapt to dynamic obstacles and exhibit an exponential growth in time complexity as the number of AGVs increases, making them unsuitable for large-scale AGV deployments.

However, distributed solutions excel in adaptability to dynamic environments and scale linearly with the number of AGVs, making them highly scalable. However, they suffer from lower solution quality, reduced path feasibility, and slow convergence during training, leading to long training times.

To reduce the complexity of the MAPF problem while combining the strengths of both centralized and distributed solutions, this paper decomposes the MAPF problem into the following two sub-problems: AGV pathfinding and AGV obstacle avoidance. The central system pre-plans paths using A* (ignoring other AGVs and unexpected obstacles), while each AGV leverages its onboard distributed navigation strategy to handle dynamic obstacle avoidance, achieving a hybrid approach of macro-level control and micro-level autonomy. By using a single agent A* for preliminary planning, the central system maintains linear scalability in solving time as the number of AGVs increases. Meanwhile, the distributed navigation strategy of AGVs enables them to decide whether to follow the centrally planned path or reroute around obstacles based on real-time environmental conditions. This hybrid strategy ensures robustness in complex manufacturing environments.

2. Problem Description and Model Formulation

The multi-stage lifelong multi-AGV pathfinding problem is a specialized form of the multi-agent pathfinding (MAPF) problem. A key distinction lies in the lifelong pathfinding requirement and multi-stage transportation tasks, which better reflect real-world factory operations.

2.1. Description of Multi-Stage Lifelong Multi-AGV Path Finding

To better reflect real-world manufacturing scenarios, where transportation tasks are assigned dynamically, spanning multiple floors and workshops, this paper defines the multi-stage lifelong multi-AGV pathfinding problem. In this problem, AGVs continuously execute transportation tasks assigned by the central system, plan their paths accordingly, and collaborate to complete multi-stage transportation while avoiding known and unforeseen obstacles. The objective is to minimize overall transportation costs and maximize transportation efficiency.

The multi-stage lifelong multi-AGV pathfinding problem can be described as a tuple

(G, A, T)

, expressed as follows:

G = (N, E)

: A directed graph where nodes

n \in N

represent traversable locations for AGVs, and edges

(n, n^{'}) \in E

indicate feasible transitions between nodes;

A = \{a_{1}, a_{2}, a_{3}, \dots, a_{k}\}

: The set of AGVs, where k denotes the number of AGVs. Each AGV

a_{i}

has a recorded current position and other status information;

T = \{t_{1}, t_{2}, t_{3}, \dots, t_{m}\}

: The set of multi-stage transportation tasks, where m is the total number of tasks. The number of tasks m is usually greater than the number of AGVs k, so an AGV can perform multiple tasks. Each multi-stage task consists of the following subtasks:

t_{i} = \{s_{1}, s_{2}, \dots, s_{n}\}

(1)

where n is the number of subtasks in a multi-stage task

t_{i}

. Each subtask

s_{i}

is defined by a set of key waypoints, denoted as follows:

s_{i} = \{n_{1}, n_{2}, \dots, n_{g o a l}\}

(2)

AGVs must visit these key waypoints sequentially to complete the subtask.Each multi-stage task can be collaboratively executed by multiple AGVs, with each AGV handling one or more subtasks. A subtask

s_{i}

is always executed by a single AGV.

Time is discretized into timesteps; at each timestep, each AGV selects a discrete action from its action space to move along an edge from its current node to the next node. The path of an AGV from its current location to a key waypoint is represented as follows:

P_{n_{c u r} \to n_{k e y}} = \{n_{c u r}, n_{m i d 1}, n_{m i d 2}, \dots, n_{k e y}\}

(3)

The AGV moves step by step until it reaches the key waypoint.

P_{s_{i}} = \{P_{n_{s t a r t} \to n_{1}}, P_{n_{1} \to n_{2}}, P_{n_{2} \to n_{3}}, \dots, P_{n_{g o a l - 1} \to n_{g o a l}}\}

(4)

For a multi-stage transportation task with n subtasks, the overall pathfinding solution consists of the following set of paths:

P_{t_{i}} = \{P_{s_{1}}, P_{s_{2}}, \dots, P_{s_{n}}\}

(5)

The final transportation plan for all AGVs is the collective set of planned paths for all tasks.

P_{t o t a l} = \{P_{t_{1}}, P_{t_{2}}, \dots, P_{t_{m}}\}

(6)

2.1.1. Optimization Objective

m i n \frac{1}{m} \sum_{i = 1}^{m} T_{i}

(7)

where

T_{i}

is the total time required to complete the transportation task

t_{i}

along the path

P_{t_{i}}

. The goal is to minimize the average completion time of tasks under the condition that a certain number of AGVs operate at full capacity.

2.1.2. Basic Assumptions

Transportation tasks are assigned dynamically, and future tasks cannot be predicted at the current time;
AGV locations and motion states are known;
Partial obstacle information is known, including factory layout and pre-mapped equipment, while unknown obstacles (e.g., debris, human workers) must be detected by AGVs;
AGVs can detect obstacles within a limited range around themselves;
AGVs can communicate, broadcasting their operational states and planned trajectories;
Time is discretized. At each timestep, AGVs can take an action.

2.2. AGV-Referenced Action Space

In traditional MAPF problems, the action space for AGVs is typically defined in terms of global movement directions, including move up, move down, move left, move right, and wait. However, this formulation ignores orientation and turning actions, assuming that AGVs can move omni-directionally.

In real-world manufacturing environments, most AGVs cannot move omnidirectionally; they must rotate before changing direction. Like the Figure 1 shows, a left turn requires an AGV to stop, rotate 90° to the left, and then move forward. Although turning does not change the AGV’s position, it incurs a time penalty similar to moving forward.

To better simulate AGV movement, this paper redefines the action space based on the AGV’s own reference frame, allowing the AGV to move forward, rotate 90° left, rotate 90° right, rotate 180° back, and then wait. Because most of the rotation time of an AGV’s is spent on stopping and re-accelerating; the actual cost of the rotation itself is minimal. Although a single 180° turn takes longer than a 90° turn, it still takes much less time than performing two consecutive 90° turns. Therefore, in this work, a 180° rotation is treated as a basic action and occupies exactly one time step.

By incorporating rotation actions, this action space accurately models the time penalty for turning, improving the realism of AGV motion planning. Unifying all basic actions under the same time step duration simplifies subsequent model training; during training, each agent can select one basic action per time step, and each action receives environmental feedback within that same step.

2.3. Map Modeling

In MAPF problems, maps are typically represented as directed graphs, where the nodes represent locations that AGVs can occupy. Each node can accommodate at most one AGV at a time. Edges represent possible movements between nodes, constrained by AGV motion rules.

To model complex manufacturing environments, the factory floor must be grid-based while satisfying the following conditions:

Accurate representation of different map elements, such as obstacles, storage locations, and transportation paths;
Explicit definition of connectivity between nodes, incorporating AGV motion constraints.

This paper classifies the elements of the map and constructs a grid-based model, introducing directional nodes to reflect the constraints of the AGV movement.

2.3.1. Classification of Obstacles in Manufacturing Environment

Manufacturing environments contain various types of obstacles beyond pre-mapped static objects. These include workers, other AGVs, offline AGVs (stopped or disconnected vehicles), unregistered shelves and equipment. The obstacles are categorized in Table 1 along two dimensions.

Based on these two dimensions, obstacles are classified into the following four types:

Known, Movable: Other online AGVs, automated doors;
Unknown, Movable: Workers, forklifts;
Known, Fixed: Factory layout, pre-registered shelves;
Unknown, Fixed: Unregistered equipment, offline AGVs.

2.3.2. Grid-Based Map Representation

To accurately simulate an automated warehouse with mobile shelves, this article employs a grid-based environment model that categorizes different types of obstacles and storage zones, which is shown in Figure 2.

Black cells represent permanent obstacles, such as factory walls and pillars. AGVs cannot enter these areas;
Yellow cells represent occupied shelf nodes. Empty AGVs can pass through, but AGVs carrying a shelf cannot;
Green cells represent vacant shelf nodes, where AGVs can move freely;
White cells represent reserved passageway, ensuring accessibility to all shelf nodes;
AGVs are depicted as colored icons, and their current destination is marked by a matching star icon.

Figure 2. The grid map.

2.3.3. Directional Nodes

In MAPF problems, map nodes are typically represented by their X, Y coordinates. However, since the AGV movement depends on the orientation, a traditional X-Y coordinate system is insufficient.

To address this, this paper introduces directional nodes, where each grid cell is subdivided into four distinct nodes, each corresponding to a different AGV orientation—0° (North), 90° (East), 180° (South), and 270° (West)—but sharing the same X-Y coordinates. The concept of directional nodes was introduced to represent the AGV’s orientation and to support the AGV-Referenced Action Space.

The edges between the nodes are defined by the AGV-referenced actions and constrained by the grid map obstacles, as shown in Figure 3. Rotation actions create bidirectional links between the four direction nodes within the same grid cell; an AGV can rotate within its current cell. Forward actions create a unidirectional link from a node in one cell to the corresponding direction node in the adjacent cell, let an AGV can move forward to an adjacent cell.

Before executing a forward action, collision detection is performed; if any of the directional nodes in the target cell is occupied by an obstacle, the path is considered blocked. Therefore, the conflict detection actually only uses the X-Y coordinates.

This modeling approach enhances accuracy of AGV motion simulation and enables realistic turning constraints.

3. Macro-Control and Micro-Autonomy Pathfinding Strategy

3.1. Overall Design of Macro-Control and Micro-Autonomy Pathfinding System

Centralized planning strategies require the pre-computation of a conflict-free path for each AGV before execution. This approach shows a lack of adaptability. As the number of AGVs increases, the solution space expands exponentially, making centralized methods computationally expensive and slow in large-scale deployments. Centralized planning struggles to handle unexpected obstacles or dynamic environmental changes, requiring frequent re-planning.

On the other hand, distributed planning strategies allow AGVs to make decisions independently based on locally available information. Although these methods are highly adaptable and scale linearly with the number of AGVs, they often suffer from lower path quality and convergence issues. Distributed methods may result in suboptimal or inefficient routes. Training RL models for pathfinding may be slow.

To address these challenges, this paper proposes the following macro-control and micro-autonomy pathfinding strategy, which combines centralized pre-planning with distributed obstacle avoidance and pathfinding:

Macro-Level (Centralized Pre-Planning): The central system pre-plans AGV paths using an modified A* algorithm, ignoring potential dynamic obstacles and AGV conflicts. This pre-planned path serves as a guideline for AGVs.
Micro-Level (Distributed Decision Making): Each AGV autonomously decides its next move based on the pre-planned path from the central system, real-time local obstacle detection, and neighboring AGV states (shared via communication).

This hybrid strategy ensures that AGVs benefit from high-quality initial pathfinding while maintaining the flexibility to adapt to real-time changes. The system response time is very fast and grows linearly with AGV size n. For the pre-planning part, since a single-agent A* algorithm is used, the time complexity is

O (n)

; The distributed planning part is executed independently and in parallel by each AGV, resulting in a time complexity of O(1).

The overall framework is shown in Figure 4.

3.1.1. Execution Process

The execution process of the macro-control and micro-autonomy pathfinding system is as follows:

Task Decomposition: The central system first breaks down transport tasks that are too difficult for a single AGV to complete, such as cross-floor or cross-workshop tasks, into smaller subtasks that a single AGV can handle. For each subtask, the system identifies key waypoints like a pickup node and a drop-off node. If the loading or unloading involves other machinery, a pre-transfer waiting point is also defined.
Task Scheduling: Then, the central system assigns subtasks. Subtasks without prerequisites, or whose prerequisite tasks have been assigned, are then allocated to AGVs based on the priority of the task and the estimated distance from the A* pre-planned path. Under full-load operation with no idle AGVs available, each AGV, upon completion of its current subtask, immediately requests and receives the next subtask from the central system.
Pathfinding: For each AGV that actively executes a subtask, the central system uses A* to calculate a pre-planned path from its current node to the next critical waypoint. The map, pre-planned path, and positions of the neighbor AGVs are transmitted to the vehicle. The AGV then combines this information with real-time LiDAR data and runs its distributed planning strategy to dynamically avoid obstacles and refine its local path.
Feedback and Calibration: Upon entering each new node, the AGV scans the ground code (a QR code on the floor) to recalibrate its position and reports its updated coordinates to the central system. If the AGV has deviated from its original pre-planned route, the central system recalculates and issues an updated path.

3.1.2. System Robustness Discussion

The macro-control and micro-autonomy pathfinding system is based both on AGV communication and on each AGV’s autonomous obstacle avoidance. We analyzed its behavior under the following special adverse conditions:

Communication Delays or Interruptions: Packet loss or latency between an AGV and the central system can hinder timely updates on the exact position of the AGV, causing the AGV to receive a stale pre-planned path or to share outdated data with its neighbors. However, such issues do not halt system operation. Once the AGV reaches its next node and rescans the ground code, it can broadcast its true location.

In rare cases where an AGV completely loses connectivity, its distributed planner will lack a fresh pre-planned path or updated information from the neighbor AGVs, potentially causing it to fail its current task. However, other AGVs will treat a disconnected vehicle as a dynamic unpredictable obstacle and simply avoid it. Thus, the system as a whole remains operational, demonstrating strong resilience to communication interruptions.

Unexpected Situations: An AGV might be at risk of collision if it receives delayed position updates on its neighbors or fails to predict the trajectory of a dynamic unknown obstacle. To protect against this, each AGV retains an Emergency Stop Mechanism; its LiDAR continuously monitors for unforeseen obstacles within a safety radius. If such an obstacle appears, the AGV immediately halts, overriding any commands to ensure safety.

Furthermore, during docking maneuvers with conveyors or elevators, the AGV temporarily suspends obstacle detection to ensure a smooth connection. In these cases, it proceeds slowly, ensuring that no severe collisions occur.

3.2. Centralized Planning with Modified A*

The central system’s role is to provide preliminary pathfinding using an modified A* algorithm. This pre-planned path is then passed to AGVs as a reference, helping them make more informed navigation decisions.

Since the centralized plan does not need to be fully conflict-free—AGVs will adjust their paths dynamically—the system does not consider AGV conflicts during the pre-planning process. This significantly reduces computation time, making the approach scalable to large AGV fleets.

3.2.1. A* Algorithm Overview

A* (A-star) [39] is a widely used graph search algorithm that balances breadth-first search (BFS) and greedy search. It evaluates nodes based on the following function:

f (n) = g (n) + h (n)

(8)

where

g (n)

represents the actual cost from the start node to node n,

h (n)

is a heuristic estimate of the cost from node n to the target. A* expands nodes based on the lowest value

f (n)

, ensuring an efficient search for the shortest path.

The heuristic function

h (n)

is normally computed as the shortest path length without considering obstacles or conflicts. In a four-direction node space,

h (n)

is typically calculated as the Manhattan distance between the AGV’s current node and the target node, which can be expressed as follows:

h (n) = | x_{n_{c u r}} - x_{n_{t a r}} | + | y_{n_{c u r}} - y_{n_{t a r}} |

(9)

where

(x_{n_{c u r}}, y_{n_{c u r}})

denotes the coordinates of the AGV’s current node; and

(x_{n_{t a r}}, y_{n_{t a r}})

denotes the coordinates of the AGV’s target node.

3.2.2. Enhancements for AGV Path Finding

In order to adapt A* for AGVs in a manufacturing environment and better provide pre-planning guidance and expert experience for the subsequent distributed planning strategy, this article incorporates several modifications to the A* algorithm, including the following:

AGV-Referenced Action Space: The action space is updated to move forward, rotate 90° left, rotate 90° right, rotate 180° back, and then wait. A* now uses specific rotating actions to apply turning penalty corrections.
Directional Spatial Nodes: Instead of treating locations as simple X-Y coordinates, each AGV state includes directional information (North, East, South, West) and represents it using directional spatial nodes. Accordingly, when adding nodes to the open list or close list, the algorithm also considers the directional attributes of the nodes.
Modified Heuristic Function: A rotation penalty $p (n)$ is introduced. It is based on the expected number of rotation actions and is combined with the Manhattan distance to form a revised heuristic function $h^{'} (n)$ . This allows for a more accurate estimation of the value of each node and helps reduce unnecessary turns.

The AGV-Referenced Action Space and Directional Spatial Nodes ensure consistency with our subsequent experiments and provide direct guidance for the distributed planning strategy. The directional penalty adjusts the heuristic function to account for rotation costs and avoid unnecessary rotation.

In our model, which accounts for AGV orientation, the Manhattan distance only estimates the number of forward moves the AGV should take. We therefore introduce

p (n)

to estimate the number of rotation actions required to modify the heuristic function

h (n)

.

p (n)

represents the estimated number of rotation actions that an AGV would need in an ideal obstacle-free scenario to move to the target node. The specific calculation method for

p (n)

is detailed in Table 2.

The heuristic function is revised to the following:

h^{'} (n) = | x_{n_{c u r}} - x_{n_{t a r}} | + | y_{n_{c u r}} - y_{n_{t a r}} | + p (n)

(10)

The evaluation function is updated as follows:

f (n) = g (n) + h^{'} (n)

(11)

3.3. Distributed Planning Strategy

At the micro-level, AGVs need to make real-time decisions based on the following factors: Pre-planned paths from the central system (A* pre-planned path), real-time local obstacle detection, and the position and pre-planned path of the neighboring AGVs. To enable efficient and adaptive decision-making, this paper adopts a reinforcement learning-based distributed navigation strategy, using PER-D3QN-EDAgger. This approach improves traditional deep Q-learning models by integrating the following:

Double Dueling Deep Q-Network (D3QN) [40]: Improves Q-value estimation stability and prevents overestimation errors;
Prioritized Experience Replay (PER) [41]: Enhances training efficiency by prioritizing high-impact experience samples;
$ε$ -DAgger Imitation Learning: Combines reinforcement learning and imitation learning, using the parameter $ε$ to control whether to use expert demonstrations (A* pre-planned path) for training. It improves learning efficiency while avoiding over-reliance on expert demonstrations, allowing the system to adapt to situations where the expert strategy has flaws.

PER and D3QN originate from the existing research; this paper merely integrates and applies them. The

ε

-DAgger Imitation Learning strategy is our original contribution, designed to deeply combine imitation learning with reinforcement learning to accommodate flawed expert policies.

3.3.1. PER-D3QN Reinforcement Learning

The Prioritized Experience Replay Double Dueling Deep Q-Network (PER-D3QN) is an improved version of the Deep Q-Network (DQN) algorithm [42]. This method integrates the advantages of Double DQN [43], Dueling DQN [44], and Prioritized Experience Replay (PER) to enhance convergence speed and improve training stability.

By separating state value and action advantage, PER-D3QN efficiently processes states where certain actions are irrelevant or redundant, reducing unnecessary computation and improving learning efficiency, and increasing training stability.

However, directly summing the state value and advantage value may introduce bias. To mitigate this, Dueling DQN applies the following correction formula:

Q (s, a) = V (s) + (A (s, a) - \frac{1}{|A|} \sum_{a^{'}} A (s, a^{'}))

(12)

where

|A|

represents the number of actions;

V (s)

represents the state value in state s; and

A (s, a)

represents the advantage value of choosing the action a in state s.

To address the issue of overestimation in DQN, PER-D3QN decouples action evaluation and selection. The Q-value is still performed by the target network, but action selection is handled by the evaluation network.

The target Q-value is updated as follows:

y_{i} = \{\begin{matrix} r_{i} & if d o n e = t r u e \\ r_{i} + γ \times Q (s_{i}^{'}, a r g m a x_{a^{'}} Q (s_{i}^{'}, a^{'}; θ); θ^{'}) & if d o n e = f a l s e \end{matrix}

(13)

where

r_{i}

is the reward;

γ

is the discount factor;

θ

represents the parameters of the evaluation network;

θ^{'}

represents the parameters of the target network; and

d o n e

is the termination flag of decision chain. If

d o n e

is true, it means that the AGV has reached the target.

The loss function

l o s s (θ)

is defined as the mean squared error between

y_{i}

and the Q-value of the evaluation network, as follows:

l o s s (θ) = \frac{1}{M} \sum_{i = 1}^{M} {(y_{i} - Q (s_{i}, a_{i}; θ))}^{2}

(14)

where

θ

is updated with the aim of minimizing the loss function. However,

θ^{'}

is not updated immediately. Instead, every N step,

θ

can be copied to the target network, which enhances the training stability.

In standard DQN, Experience Replay samples experiences uniformly at random. PER assigns priorities to experiences so that samples with greater learning potential are used more frequently. The probability that an experience is sampled is given as follows:

P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}}

(15)

where

α

controls how much prioritization is applied (if

α

= 0, the sampling is uniform), and

p_{i}

is the priority of an experience. The priority of each experience is determined using the Temporal Difference (TD) error, which measures the discrepancy between predicted and actual Q-values, as follows:

p (i) = |y_{i} - Q (s_{i}, a_{i}; θ)| + η

(16)

where

η

is a small value to ensure that all experiences have the opportunity to be sampled.

3.3.2. $ε$ -DAgger Imitation Learning Strategy

ε

-DAgger Imitation Learning is based on Dataset Aggregation (DAgger) [45] and uses a parameter

ε

to control whether to use expert demonstrations (A* pre-planned path) for training.

ε

-Dagger blends imitation learning with reinforcement learning, dynamically interleaving them during training. In training, when an AGV selects an action, it performs the following decision-making process.

With probability $1 - ε$ : It executes the standard reinforcement learning process and uses $ε$ -greedy policy to select an action; with probability $ε$ , it selects a random action from the action space; with probability $1 - ε$ , it selects the best action according to the evaluation network. Consequently, the overall probability that the final model takes a random action is $ε (1 - ε)$ .
With probability $ε$ : It imitates the expert demonstrations (A* pre-planned path), uses A* to plan path, and chooses the action from the planned path. After receiving feedback from the environment, the expert demonstrations will be stored in the experience pool for learning.

In the early stage of training, AGVs heavily rely on A* to avoid random inefficient movements. As training progresses, AGVs increasingly rely on their own learned policy, gradually surpassing the performance of A*.

Compared to traditional DAgger,

ε

-Dagger avoids overfitting to the expert strategy and can adapt to flawed expert strategy. It allows AGVs to discover more efficient paths while still benefiting from expert guidance.

3.3.3. Action and Observation Space Design

Action Space

The AGV action space adopts the previously mentioned AGV-referenced action space. It is defined as follows: move forward, rotate 90° left, rotate 90° right, rotate 180° back, and then wait. If an AGV attempts an invalid action (such as colliding with an obstacle), the action will be rolled back and replaced with a “wait” action.

The limited field of view mechanism simulates the constraints of the AGV’s LIDAR detection range, reduces the dimensionality of the input data, facilitates the generalization of the policy to maps of any size, and helps the model focus on resolving local AGV conflicts while minimizing interference from irrelevant factors.

Observation Space

The observation space is primarily composed of information from the AGV’s own LIDAR detection, the segmentation of the global map within a limited field of view, and communication with the neighboring AGVs.

Specifically, the observation representation consists of the following four feature layers, as shown in Figure 5:

Goal Layer: This layer represents the relative position of the key waypoint that the AGV must reach. Using the AGV’s own frame of reference, the matrix center corresponds to the AGV’s location. If the waypoint lies within the AGV’s nearby cells, the corresponding matrix entry is set to 1; otherwise, it remains at 0.
Obstacles Layer: This layer depicts all potential obstacles around the AGV by fusing the global map provided by the central system with the AGV’s LiDAR data. Cells containing obstacles have a value of 1, and free cells are 0. In addition, the loaded AGVs treat shelves as obstacles, whereas the unloaded AGVs freely pass beneath shelves, which are then not considered obstacles.
A* Pre-Planned Path Layer: This layer encodes the pre-planned path from the central system to guide the AGV. Starting from the center of matrix (the AGV position) and extending outward, each cell on the planned path is assigned the number of time steps required to reach it. If a single grid cell appears in multiple time steps (e.g., during a turn), the maximum value is used.
Neighbor AGV Layer: This layer conveys the relative positions and expected trajectories of the surrounding AGVs. Similarly to the A* Pre-Planned Path Layer, each neighboring AGV’s A* pre-planned path radiates outward from its current position. Cells on these paths are set to the time steps needed to arrive there, taking the maximum value where the paths overlap or intersect.

To maintain spatial consistency, all feature layers are rotated to align with the AGV orientation, ensuring that the forward direction always points upward in the input tensor.

3.3.4. Reward Design

The specific reward design is shown in Table 3.

These rewards are designed to encourage AGVs to reach the target node as quickly as possible and guide AGVs to balance the subsequent pre-planned paths while adapting to dynamic obstacles.

On the other hand, providing additional rewards when an AGV follows the same actions as the A* pre-planned path can also help mitigate the issue of sparse rewards during learning.

4. Result and Discussion

4.1. Experimental Setup

4.1.1. Hardware and Software Environment

System: Windows 11;
CPU: Intel Core i5-13600KF (Intel, Santa Clara, CA, USA);
GPU: NVIDIA GeForce RTX 3060 (NVIDIA, Santa Clara, CA, USA);
RAM: 32 GB;
Program Language: Python 3.8.20;
Program Framework: PaddlePaddle.

4.1.2. Hyperparameters

The specific hyperparameters are shown in Table 4.

4.2. PER-D3QN-EDAgger Model Training

To validate the effectiveness of different learning components, we conducted an ablation study comparing the following:

Full PER-D3QN-EDAgger (complete model);
PER-D3QN-EDAgger without A* rewards (removing the incentive for following the A* pre-planned path);
PER-D3QN without IL (removing the DAgger-based expert demonstrations; reinforcement learning alone);
PER-D3QN with pre-training using DAgger (expert demonstrations only used for initialization).

The training map environment is based on Figure 2 and generates different scenarios by varying the number of AGVs

n_{a g v}

and the shelf density coefficient

n_{s h e l f}

.

The initialization of the experimental environment is conducted through the following steps:

Shelf Initialization: In the beginning, the shelf nodes generate shelves based on random numbers in $[0, 1]$ to reconstruct obstacle layouts. If the random number is less than the shelf density coefficient $n_{s h e l f}$ , the cell is marked as occupied by a shelf; otherwise it remains free, allowing the AGVs to pass freely. During training, $n_{s h e l f}$ is sampled uniformly from 0.1 to 0.8. Only natural, the larger $n_{s h e l f}$ is, the higher the obstacle density of the environment.
AGV Placement: The number of AGVs $n_{a g v}$ is chosen uniformly from 1 to 20. Except when $n_{a g v} = 0$ , one AGV serves as a dynamic obstacle, making unpredictable random moves. In the environment configuration, all AGVs are initially placed at random in free-road cells.
Task Generation: Each training episode randomly creates between 10 and 30 transport tasks, each with exactly two key waypoints. For each task, one occupied shelf cell and one free shelf cell are selected without replacement as the start and end points. If there are too few of either shelf cell, the environment well be reinitialized. Because free-road cells are reserved in advance, no task is ever rendered unreachable.

Every 10,000 training steps, a test run is executed, and the average reward is recorded. During testing, both the number of tasks and the number of AGVs are fixed at 10, while all other parameters match the training configuration, the result is shown in Figure 6.

From the results, PER-D3QN without IL performs poorly in both convergence and stability, while the imitation learning component significantly accelerates model training. Convergence refers to learning efficiency, and stability refers to the robustness to environments.

PER-D3QN-Dagger demonstrates the best convergence performance but lacks stability compared to PER-D3QN-EDAgger. This is because PER-D3QN-Dagger overfits to A* expert demonstrations, leading to insufficient dynamic obstacle avoidance. This highlights the ability of the

ε

-Dagger strategy to adapt and surpass the flawed expert strategy.

The full PER-D3QN-EDAgger shows slightly a better convergence than its variant without A* rewards, but their stability is comparable. This suggests that additional A* rewards help mitigate the sparse reward issue in the early stages of training, but have limited impact on model performance in the later stages.

4.3. Dynamic Multi-Stage Path Planning Experiments

To evaluate the generalization capability of the PER-D3QN-EDAgger model and its adaptability to dynamic multi-level transportation tasks in complex production environments, we conduct simulation experiments in different scenarios, using the traditional centralized planning method CBS, ID A*, and modern distributed planning method PRIMAL2 as a reference. The centralized planning methods re-plan every steps to avoid dynamic obstacles.

The experimental environment is represented on a larger warehouse-like simulated map, that is shown in Figure 7. The map is generated based on the aforementioned layout of buildings, roads, and shelf areas, by controlling the number of AGVs and the shelf density parameter. The shelf density parameter primarily regulates the obstacle density in the environment, ranging from 0.05 to 0.9 in increments of 0.05. During environment initialization, if the random number for a shelf node exceeds the shelf density parameter, a shelf is placed. The number of AGVs controls the scale of agents and dynamic obstacles, with 90% of the AGVs acting as system-controlled agents and 10% acting as dynamic obstacles that move randomly without conflicts. The AGV count ranges from 10 to 200 in increments of 10. Upon initialization, each AGV is randomly placed on the roads.

These two parameters can generate 18 × 20 = 360 different environments; in each environment, 10 sets of experiments are performed, with each experiment randomly generating 20 transportation tasks in the context of inter-warehouse relocations. Notably, 10 of them are two-stage transportation tasks; each two-stage task has two subtasks, and each subtask has two key waypoints. This results in a total of 200 tasks or 300 subtasks.

Key metrics such as task completion rate, path length, and response time are recorded. The response time is defined as the time taken from when an AGV is assigned a transport task to when it takes its first action in response; this metric reflects real-time planning capability. For centralized planning, it measures the total path planning time; for distributed planning, it measures the action selection time of an individual AGV; for the hybrid method, it is the sum of the two.

The maximum path length is set to 256 steps and the maximum response time is set to 60 s. If the cumulative path length or response time exceeds the limit during path planning, the task is considered failed. In such cases, the unfinished task is recorded as incomplete and the cumulative path length and response time are assigned their respective upper limits. The key experiment results are shown in Table 5.

4.3.1. Task Completion Rate

Figure 8 shows the average task completion rate of the algorithm in different environments. ID A* and CBS perform well in low AGV scenarios, but fail as the AGV count increases, due to an exponential growth in computation time. PER-D3QN-EDAgger maintains a high task completion rate of more than 95% when the number of AGVs is below 200 and the shelf density is below 0.5, demonstrating strong adaptability. In environments with a high AGV count, PER-D3QN-EDAgger achieves a success rate that is at least as high as PRIMAL2, while PRIMAL2 performs better under high shelf density conditions. This advantage may be attributed to PRIMAL2’s corridor structure optimization.

4.3.2. Response Time

Figure 9 shows the average response time of the algorithm in different environments. The response time of ID A* and CBS increases exponentially as the number of AGVs grows. Both PRIMAL2 and PER-D3QN-EDAgger deliver a fast path planning times. PER-D3QN-EDAgger maintains near-instantaneous response times (50 ms), making it highly suitable for real-time planning in dynamic environments.

4.3.3. Average Path Cost

Figure 10 shows the average path cost of the algorithms in different environments. ID A* produces optimal paths in low-AGV environments, but often leads to deadlocks when the number of AGV is greater than 30. CBS produces optimal paths in low-density environments, but struggles with congestion, often leading to deadlocks or excessive detours. Under low AGVs counts, distributed planning methods incur slightly higher average path costs than centralized planning methods, but remain within the acceptable limits. In comparison with PRIMAL2, PER-D3QN-EDAgger produces higher-quality paths a majority of the time. Thanks to the A* expert demonstrations and extra A* rewards, PER-D3QN-EDAgger’s learned policy more closely mimics the A*centralized planning method, and PER-D3QN-EDAgger has better dynamic obstacle adaptation. However, in high shelf-density scenarios, PRIMAL2 achieves lower average path costs and thus demonstrates stronger adaptability.

5. Conclusions

This paper investigates the multi-stage lifelong pathfinding problem for multiple AGVs in complex manufacturing environments. The study focuses on key challenges such as multi-stage transportation tasks, AGV turning simulation, dynamic obstacles, real-time planning, and lifelong scheduling, proposing an efficient hybrid planning framework that combines centralized and distributed approaches.

To accurately model factory floor and AGV movement patterns, we introduce a grid-based map representation combined with a directed node graph, where each AGV position includes directional attributes. Additionally, we refine the action space by separating rotation and movement actions, ensuring that turning penalties are properly reflected in the planning process. To handle dynamic obstacles, real-time scheduling, and large-scale AGV fleets, we propose the following macro-control and micro-autonomy pathfinding strategy:

Macro-Level (Centralized Planning): The central system decomposes multi-stage transportation tasks and pre-planned paths for each subtask using an modified A* algorithm.
Micro-Level (Distributed Planning): Each AGV uses a deep reinforcement learning-based navigation strategy, trained with PER-D3QN-EDAgger, to make real-time decisions by considering local obstacles, pre-planned paths, and neighboring AGVs.

The centralized pre-planning ensures high-quality paths, while the distributed decision-making enhances adaptability to dynamic environments. To optimize the distributed AGV navigation model, we introduce PER-D3QN-EDAgger, which integrates the following: Prioritized Experience Replay for faster learning; Double Dueling Deep Q-Networks for more stable Q-value estimation; and

ε

-DAgger Imitation Learning, which accelerates training by incorporating expert demonstrations (A* pre-planned paths).

Our experiments demonstrate that PER-D3QN-EDAgger significantly improves pathfinding performance in the following ways: fast convergence during training, effectively addressing the sparse reward issue in reinforcement learning-based MAPF; high adaptability, maintaining high task completion rates across various obstacle densities and AGV fleet sizes; low path costs, efficiently avoiding dynamic and static obstacles while minimizing travel distance; millisecond-level response times, ensuring real-time pathfinding and rapid obstacle avoidance. This strategy successfully meets the requirements of real-time scheduling and dynamic obstacle avoidance in lifelong, multi-stage, multi-AGV pathfinding scenarios.

6. Future Prospects

Although our proposed hybrid planning strategy achieves strong performance, several directions can further enhance the find of AGV path in complex manufacturing environments, including the following:

Pilot Operation in Real Manufacturing SCcenarios: The validation of the method proposed in this paper has thus far relied entirely on simulation experiments, which inherently cannot cover every scenario encountered in real manufacturing environments. Its applicability on actual production or logistics sites remains unverified. Therefore, in future work, we will build more realistic testbeds and seek collaboration with industry partners to conduct pilot runs in environments such as dark warehouses(fully automated unlit facilities warehouse).

Improve System Robustness: Our distributed planning strategy depends on communications with both the central controller and the neighboring AGVs. If an AGV goes offline due to a fault, its transport tasks will be disrupted. To address this, it is necessary to train a purely local data-based fallback planner as an emergency contingency. In addition, we have not yet introduced an explicit deadlock avoidance mechanism. AGVs learn implicit coordination behaviors via communication-based training. But no training process can cover all possible deadlock scenarios. Hence, we will monitor AGV task execution in real time and incorporate an explicit deadlock-avoidance policy in future work.

Deeper Integration of Heuristics and Reinforcement Learning: The heuristic function in A* shares conceptual similarities with the reinforcement learning reward function, since both estimate future path costs. Future work could explore integrating heuristics into reinforcement learning rewards, where states closer to the goal receive lower penalties, further mitigating the sparse reward issue and improving training efficiency.

Expanding AGV Decision-Making Strategies: In our current model, AGVs follow two main decision-making strategies—obstacle avoidance (micro-level autonomy) and following pre-planned paths (macro-level control). Future work could introduce multiple obstacle avoidance strategies, allowing AGVs to adapt their navigation behavior based on local congestion levels, obstacle types, and operational zones. Additionally, AGVs could dynamically switch between centralized and decentralized solvers, optimizing pathfinding based on the real-time complexity of their environment.

Task Scheduling Optimization: While this paper focuses on pathfinding, task scheduling plays a crucial role in optimizing AGV efficiency in lifelong scenarios. Future research could explore intelligent task allocation methods for minimizing task switching delays, optimizing AGV workload balancing, and reducing idle time in dynamic task assignment scenarios.

By incorporating these advancements, AGV pathfinding frameworks can become even more efficient, scalable, and adaptable, ultimately improving the overall performance of smart logistics and industrial automation systems.

Author Contributions

Conceptualization, L.H. and J.Z.; methodology, J.L. and J.Z.; software, J.Z. and J.L.; validation, J.Z. and J.L.; writing—original draft preparation, J.L.; writing—review and editing, L.H. and J.Z.; funding acquisition and resources, L.H. and J.Z.; supervision, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by “Pioneer” R&D Program of Zhejiang (Grant No. 2025C01088).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ge, X.; Li, L.; Chen, H. Research on online scheduling method for flexible assembly workshop of multi-agv system based on assembly island mode. In Proceedings of the 2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS), Xi’an, China, 7–8 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 371–375. [Google Scholar]
De Ryck, M.; Versteyhe, M.; Debrouwere, F. Automated guided vehicle systems, state-of-the-art control algorithms and techniques. J. Manuf. Syst. 2020, 54, 152–173. [Google Scholar] [CrossRef]
Bhardwaj, A.; Sam, L.; Bhardwaj, A.; Martín-Torres, F.J. LiDAR remote sensing of the cryosphere: Present applications and future prospects. Remote Sens. Environ. 2016, 177, 125–143. [Google Scholar] [CrossRef]
Wild, G.; Hinckley, S. Acousto-ultrasonic optical fiber sensors: Overview and state-of-the-art. IEEE Sens. J. 2008, 8, 1184–1193. [Google Scholar] [CrossRef]
Stern, R. Multi-agent path finding—An overview. In Artificial Intelligence: 5th RAAI Summer School, Dolgoprudny, Russia, 4–7 July 2019, Tutorial Lectures; Springer: Cham, Switzerland, 2019; pp. 96–115. [Google Scholar]
Standley, T. Finding optimal solutions to cooperative pathfinding problems. Proc. AAAI Conf. Artif. Intell. 2010, 24, 173–178. [Google Scholar] [CrossRef]
Wagner, G.; Choset, H. Subdimensional expansion for multirobot path planning. Artif. Intell. 2015, 219, 1–24. [Google Scholar] [CrossRef]
Guo, T.; Sun, Y.; Liu, Y.; Liu, L.; Lu, J. An automated guided vehicle path planning algorithm based on improved a* and dynamic window approach fusion. Appl. Sci. 2023, 13, 10326. [Google Scholar] [CrossRef]
Chen, D.; Liu, X.; Liu, S. Improved A-star algorithm based on the two-way search for path planning of automated guided vehicle. J. Comput. Appl. 2021, 41, 309–313. [Google Scholar]
Sharon, G.; Stern, R.; Felner, A.; Sturtevant, N.R. Conflict-based search for optimal multi-agent pathfinding. Artif. Intell. 2015, 219, 40–66. [Google Scholar] [CrossRef]
Boyarski, E.; Felner, A.; Harabor, D.; Stuckey, P.J.; Cohen, L.; Li, J.; Koenig, S. Iterative-deepening conflict-based search. In Proceedings of the International Joint Conference on Artificial Intelligence-Pacific Rim International Conference on Artificial Intelligence 2020, Virtual Event, 11–17 July 2020; Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, UDA, 2020; pp. 4084–4090. [Google Scholar]
Boyarski, E.; Felner, A.; Le Bodic, P.; Harabor, D.D.; Stuckey, P.J.; Koenig, S. F-aware conflict prioritization & improved heuristics for conflict-based search. Proc. AAAI Conf. Artif. Intell. 2021, 35, 12241–12248. [Google Scholar]
Chan, S.H.; Li, J.; Gange, G.; Harabor, D.; Stuckey, P.J.; Koenig, S. ECBS with flex distribution for bounded-suboptimal multi-agent path finding. In Proceedings of the Fourteenth International Symposium on Combinatorial Search, Guangzhou, China, 26–30 July 2021; Volume 12, pp. 159–161. [Google Scholar]
Rahman, M.; Alam, M.A.; Islam, M.M.; Rahman, I.; Khan, M.M.; Iqbal, T. An adaptive agent-specific sub-optimal bounding approach for multi-agent path finding. IEEE Access 2022, 10, 22226–22237. [Google Scholar] [CrossRef]
Yu, J.; Li, R.; Feng, Z.; Zhao, A.; Yu, Z.; Ye, Z.; Wang, J. A novel parallel ant colony optimization algorithm for warehouse path planning. J. Control Sci. Eng. 2020, 2020, 5287189. [Google Scholar] [CrossRef]
Xu, L.; Liu, Y.; Wang, Q. Application of adaptive genetic algorithm in robot path planning. Comput. Eng. Appl. 2020, 56, 36–41. [Google Scholar]
Chen, J.; Liang, J.; Tong, Y. Path planning of mobile robot based on improved differential evolution algorithm. In Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China, 13–15 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 811–816. [Google Scholar]
Niu, Q.; Li, M.; Zhao, Y. Research on improved artificial potential field method for AGV path planning. Mach. Tool Hydraul. 2022, 50, 19–24. [Google Scholar]
Qiuyun, T.; Hongyan, S.; Hengwei, G.; Ping, W. Improved particle swarm optimization algorithm for AGV path planning. IEEE Access 2021, 9, 33522–33531. [Google Scholar] [CrossRef]
Abed-Alguni, B.H.; Paul, D.J.; Chalup, S.K.; Henskens, F.A. A comparison study of cooperative Q-learning algorithms for independent learners. Int. J. Artif. Intell. 2016, 14, 71–93. [Google Scholar]
De Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.; Sun, M.; Whiteson, S. Is independent learning all you need in the starcraft multi-agent challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 7234–7284. [Google Scholar]
Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. Proc. AAAI Conf. Artif. Intell. 2018, 32, 2974–2982. [Google Scholar] [CrossRef]
Rabinowitz, N.; Perbet, F.; Song, F.; Zhang, C.; Eslami, S.A.; Botvinick, M. Machine theory of mind. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4218–4227. [Google Scholar]
Sartoretti, G.; Kerr, J.; Shi, Y.; Wagner, G.; Kumar, T.S.; Koenig, S.; Choset, H. Primal: Pathfinding via reinforcement and imitation multi-agent learning. IEEE Robot. Autom. Lett. 2019, 4, 2378–2385. [Google Scholar] [CrossRef]
Damani, M.; Luo, Z.; Wenzel, E.; Sartoretti, G. PRIMAL_2: Pathfinding via reinforcement and imitation multi-agent learning-lifelong. IEEE Robot. Autom. Lett. 2021, 6, 2666–2673. [Google Scholar] [CrossRef]
Liu, Z.; Chen, B.; Zhou, H.; Koushik, G.; Hebert, M.; Zhao, D. Mapper: Multi-agent path planning with evolutionary reinforcement learning in mixed dynamic environments. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 11748–11754. [Google Scholar]
Guan, H.; Gao, Y.; Zhao, M.; Yang, Y.; Deng, F.; Lam, T.L. Ab-mapper: Attention and bicnet based multi-agent path planning for dynamic environment. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 13799–13806. [Google Scholar]
Li, W.; Chen, H.; Jin, B.; Tan, W.; Zha, H.; Wang, X. Multi-agent path finding with prioritized communication learning. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10695–10701. [Google Scholar]
Li, Q.; Lin, W.; Liu, Z.; Prorok, A. Message-aware graph attention networks for large-scale multi-robot path planning. IEEE Robot. Autom. Lett. 2021, 6, 5533–5540. [Google Scholar] [CrossRef]
Ma, Z.; Luo, Y.; Pan, J. Learning selective communication for multi-agent path finding. IEEE Robot. Autom. Lett. 2021, 7, 1455–1462. [Google Scholar] [CrossRef]
Skrynnik, A.; Yakovleva, A.; Davydov, V.; Yakovlev, K.; Panov, A.I. Hybrid policy learning for multi-agent pathfinding. IEEE Access 2021, 9, 126034–126047. [Google Scholar] [CrossRef]
Ni, P.; Mao, P.; Wang, N.; Yang, M. Robot path planning based on improved A-DDQN algorithm. J. Syst. Simul. 2024. [Google Scholar] [CrossRef]
Wang, B.; Liu, Z.; Li, Q.; Prorok, A. Mobile robot path planning in dynamic environments through globally guided reinforcement learning. IEEE Robot. Autom. Lett. 2020, 5, 6932–6939. [Google Scholar] [CrossRef]
Bai, Z.; Pang, H.; He, Z.; Zhao, B.; Wang, T. Path planning of autonomous mobile robot in comprehensive unknown environment using deep reinforcement learning. IEEE Internet Things J. 2024, 11, 22153–22166. [Google Scholar] [CrossRef]
Ferguson, D.; Likhachev, M.; Stentz, A. A guide to heuristic-based path planning. In Proceedings of the International Workshop on Planning Under Uncertainty for Autonomous Systems, International Conference on Automated Planning and Scheduling (ICAPS), Monterey, CA, USA, 5–10 June 2005; pp. 9–18. [Google Scholar]
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2018, 32, 3215–3222. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. Proc. AAAI Conf. Artif. Intell. 2016, 30, 2094–2100. [Google Scholar] [CrossRef]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]

Figure 1. Comparison Between Four-Direction Action Space and AGV-Referenced Action Space.

Figure 3. Comparison between four-neighborhood nodes and directional nodes. With directional nodes, a rotation action will transfer the AGV between different nodes within the same grid cell, while a forward action will move the AGV to the node in the grid cell ahead with the same direction.

Figure 4. The overall framework of macro-control and micro-autonomy pathfinding strategy.

Figure 5. The observation space is divided based on the grid map and consists of the following four information channels: the goal layer, obstacle layer, A* layer, and neighbor AGV layer.

Figure 6. The average rewards of different algorithms at different training iterations—the higher, the better.

Figure 7. A larger warehouse-like simulated grid map (100 × 100).

Figure 8. Comparison of average task completion rate of algorithm under different obstacle densities and AGV quantities—the higher, the better.

Figure 9. Comparison of average response time of algorithm under different obstacle densities and AGV quantities—the lower, the better.

Figure 10. Comparison of average path cost of the algorithm under different obstacle densities and AGV quantities—the lower, the better.

Table 1. Obstacles can be categorized along two dimensions.

Category	Definition	Examples
Fixed or Movable	Fixed obstacles cannot move and only affect the connectivity of the grid they occupy; movable obstacles can move and may affect the connectivity of multiple grid cells within a single timestep.	Fixed: Factory walls, pre-registered shelves. Movable: Workers, forklifts, other AGVs.
Known or Unknown	Known obstacles are registered in the central system, with their state globally shared; Unknown obstacles are not registered in the central system; only nearby AGVs can detect it using their own sensing devices.	Known: Pre-mapped factory layout, online AGVs. Unknown: Unregistered or offline equipment, workers.

Table 2. Directional Penalty.

Situation	The Angle Between AGV’s Direction and Its Target	$p (n)$ (Penalty)
The AGV’s target is directly in front of AGV.	angle = 0°	0
The AGV’s target is directly behind AGV.	angle = 180°	1
The AGV’s target is directly on the left or right side of AGV.	angle = 90°	1
The AGV’s target is on the front-left or front-right of AGV.	0° < angle < 90°	1
The AGV’s target is on the rear-left or rear-right of AGV.	90° < angle < 180°	2

Table 3. Reward scheme.

Condition	Reward
AGV reaches goal	10
AGV takes any action	−0.1
AGV follows the A* path to take action	0.1
Collision or invalid move	−2

Table 4. Hyperparameters.

Parameter	Value
Discount Factor $γ$	0.99
Greedy Factor $ε$	0.9 to 0.1 (over 50,000 steps)
Prioritization Factor $α$	1
Factor $η$	0.1
Learning Rate	0.0003 to 0.00001 (over 50,000 steps)
Batch Size	128
Experience Replay Buffer Size	100,000
Target Network Update Frequency	Every 2500 steps
Train Total Steps	1,000,000
Warmup Size	10,000

Table 5. Experimental results.

	AGV Count	1–10	1–10	11–20	11–20
	Shelf Density	0.05–0.45	0.50–0.90	0.05–0.45	0.50–0.90
Average	ID A*	0.420	0.422	0	0
Average	CBS	0.982	0.854	0.638	0.488
Success Rate	PER-D3QN-EDAgger	0.994	0.956	0.984	0.888
Success Rate	PRIMAL2	0.975	0.943	0.939	0.891
Average	ID A*	41.634	41.386	60	60
Average	CBS	3.415	13.691	29.632	41.987
Response Time(s)	PER-D3QN-EDAgger	0.012	0.010	0.012	0.013
Response Time(s)	PRIMAL2	0.026	0.024	0.025	0.026
Average Path	ID A*	164.758	165.506	256	256
Average Path	CBS	20.517	49.265	106.266	149.899
Cost (steps)	PER-D3QN-EDAgger	26.094	40.038	45.247	79.576
Cost (steps)	PRIMAL2	39.698	43.524	60.636	73.311

The table divides the environments into four categories based on high/low AGV counts and high/low shelf density, then reports each algorithm’s performance in these environments based on the experimental results and calculates their average values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Le, J.; He, L.; Zheng, J. A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios. Appl. Sci. 2025, 15, 5249. https://doi.org/10.3390/app15105249

AMA Style

Le J, He L, Zheng J. A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios. Applied Sciences. 2025; 15(10):5249. https://doi.org/10.3390/app15105249

Chicago/Turabian Style

Le, Jiahui, Lili He, and Junhong Zheng. 2025. "A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios" Applied Sciences 15, no. 10: 5249. https://doi.org/10.3390/app15105249

APA Style

Le, J., He, L., & Zheng, J. (2025). A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios. Applied Sciences, 15(10), 5249. https://doi.org/10.3390/app15105249

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Macro-Control and Micro-Autonomy Pathfinding Strategy for Multi-Automated Guided Vehicles in Complex Manufacturing Scenarios

Abstract

1. Introduction

1.1. Centralized Solutions

1.2. Distributed and Hybrid Solutions

1.3. Comparison of Centralized and Distributed Solutions

2. Problem Description and Model Formulation

2.1. Description of Multi-Stage Lifelong Multi-AGV Path Finding

2.1.1. Optimization Objective

2.1.2. Basic Assumptions

2.2. AGV-Referenced Action Space

2.3. Map Modeling

2.3.1. Classification of Obstacles in Manufacturing Environment

2.3.2. Grid-Based Map Representation

2.3.3. Directional Nodes

3. Macro-Control and Micro-Autonomy Pathfinding Strategy

3.1. Overall Design of Macro-Control and Micro-Autonomy Pathfinding System

3.1.1. Execution Process

3.1.2. System Robustness Discussion

3.2. Centralized Planning with Modified A*

3.2.1. A* Algorithm Overview

3.2.2. Enhancements for AGV Path Finding

3.3. Distributed Planning Strategy

3.3.1. PER-D3QN Reinforcement Learning

3.3.2. ε -DAgger Imitation Learning Strategy

3.3.3. Action and Observation Space Design

Action Space

Observation Space

3.3.4. Reward Design

4. Result and Discussion

4.1. Experimental Setup

4.1.1. Hardware and Software Environment

4.1.2. Hyperparameters

4.2. PER-D3QN-EDAgger Model Training

4.3. Dynamic Multi-Stage Path Planning Experiments

4.3.1. Task Completion Rate

4.3.2. Response Time

4.3.3. Average Path Cost

5. Conclusions

6. Future Prospects

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.2. $ε$ -DAgger Imitation Learning Strategy