An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning

Zhou, Ying; Feng, Yixin; Mao, Peiyan; Wang, Pengfei

doi:10.3390/automation7040100

Open AccessArticle

An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning

¹

China Institute of FTZ Supply Chain, Shanghai Maritime University, Shanghai 201306, China

²

Logistics Engineering College, Shanghai Maritime University, Shanghai 201306, China

³

Anghua (Shanghai) Automation Engineering Co., Ltd., Shanghai 201399, China

^*

Author to whom correspondence should be addressed.

Automation 2026, 7(4), 100; https://doi.org/10.3390/automation7040100 (registering DOI)

Submission received: 7 May 2026 / Revised: 16 June 2026 / Accepted: 20 June 2026 / Published: 25 June 2026

Download

Browse Figures

Versions Notes

Abstract

A common limitation of existing multi-AGV cooperative systems is their reliance on the obstacle-agnostic Manhattan distance as the basis for reward signals. This causes agents to receive misleading feedback, engage in excessive futile exploration, and ultimately achieve poor training quality. To address this, we introduce an A*-distance guidance mechanism for multi-agent reinforcement learning (MARL) path planning, built on the precise path distance computed via the A* algorithm (A*-distance). Within the QMIX framework, we incorporate an A*-distance-based guiding function into the action selection mechanism. This function evaluates candidate actions by quantifying their immediate effect on the A*-distance, providing positive incentives for actions that bring the agent closer to the goal and applying negative penalties for those that lead it farther away. This effectively biases exploration towards actions that genuinely shorten the obstacle-aware path to the goal, suppresses ineffective exploration, and accelerates policy convergence. Experiments in four warehouse environments (simple obstacles, complex obstacles, large-scale, and congested) show that, compared with standard QMIX, the proposed method achieves higher global average reward and faster convergence. The advantage grows as environment scale and obstacle density increase. In the large-scale and congested environments, standard QMIX and the other MARL baselines fail to solve the task, whereas the proposed method still succeeds. It is the only learning-based method to solve these hardest tasks while keeping path length close to that of dedicated search-based solvers. Ablation experiments further show that the A*-distance-guided action selection is the primary contributor to these gains, while the A*-distance reward plays a supporting role.

Keywords:

automated guided vehicle; multi-agent reinforcement learning; path planning

1. Introduction

The rapid expansion of e-commerce has pushed warehouse systems toward ever-shorter fulfillment cycles and an increasingly diverse mix of order types. Order picking lies at the heart of this challenge [1]. In this context, automated guided vehicles (AGVs) have become the central execution unit for material handling and sorting in modern logistics facilities. Compared with single-AGV systems, fleets of cooperating AGVs can dramatically increase warehouse throughput, but they also introduce substantially greater coordination complexity. Achieving efficient, collision-free path planning for multiple AGVs—so as to optimize overall system performance—has therefore emerged as one of the most pressing open problems in intelligent warehousing [2].

Classical AGV path planning algorithms include A* Algorithm, Artificial Potential Fields (APF), Rapidly-exploring Random Trees (RRT), and Genetic Algorithms (GA), among others [3,4]. Among these, A* Algorithm has been widely adopted for its efficiency in finding optimal paths. Its core heuristic search strategy evaluates nodes by combining the actual cost from the start to the current node with a heuristic estimate of the remaining cost to the goal, expanding the lowest-cost node first and thereby ensuring both path optimality and search efficiency.

However, each of these classical methods has inherent limitations when applied to multi-AGV scenarios. A* Algorithm struggles with dynamic obstacles introduced by other vehicles [5]; APF is prone to local minima and has difficulty navigating narrow corridors [6]; RRT typically produces suboptimal paths whose quality depends heavily on random sampling [7]; GA suffers from high computational cost and premature convergence [8]. These shortcomings make it difficult for any single method to simultaneously satisfy the real-time, optimality, and tight-coordination demands of multi-AGV path planning.

In recent years, reinforcement learning (RL), particularly multi-agent reinforcement learning (MARL), has offered a promising new direction for addressing these challenges. RL agents learn a mapping from environment states to optimal actions through continuous interaction. Common RL algorithms include Q-learning, Deep Q-Network (DQN), and Deep Deterministic Policy Gradient (DDPG) [9], as well as algorithms designed specifically for multi-agent cooperation, such as Multi-Agent Deep Deterministic Policy Gradient (MADDPG) and QMIX [10,11]. QMIX is a MARL algorithm built on the Centralized Training with Decentralized Execution (CTDE) paradigm. Under this framework, each AGV plans its own path while simultaneously coordinating with peers, ensuring that individual agents’ locally optimal actions remain aligned with the team-level optimum, a property that makes QMIX especially well suited for tightly cooperative multi-AGV planning.

Despite this promise, RL and MARL algorithms require large amounts of training data and may suffer from instability and slow convergence [12]. The root cause lies in the conventional

ε

-greedy exploration strategy: because it selects actions uniformly at random during the exploration phase, it generates a large volume of ineffective or suboptimal transitions. As a result, many training episodes are needed before useful strategies emerge, and in complex multi-agent environments the policy may ultimately converge to a suboptimal solution [13].

Researchers have proposed a variety of improvements targeting the

ε

-greedy policy, exploration mechanisms, and reward shaping.

In the RL domain, El Wafi et al. proposed a dynamic hyperparameter tuning strategy for

ε

-greedy that adaptively adjusts the learning rate (

α

) and discount factor (

γ

) during training to improve the exploration–exploitation balance in complex environments [14]. Ben-Akka et al. introduced an adaptive

ε

-decay scheme paired with a custom reward that penalizes revisiting states, to address slow convergence in high-obstacle-density environments [15]. Mou et al. developed an improved greedy Q-learning algorithm for path planning of unmanned surface vehicles [16]. Gharbi proposed a dynamic reward-enhanced Q-learning (DRQL) method that accelerates convergence by replacing blind exploration with informative feedback signals [17].

In the DRL domain, Zhang et al. introduced an optimistic

ε

-greedy strategy in which an optimism network biases exploration toward high-potential actions, correcting value underestimation caused by insufficient exploration and preventing convergence to suboptimal policies [18]. Pei et al. proposed a noisy D3QN algorithm that replaces the

ε

-greedy policy with parameter-space noise to achieve adaptive exploration [19]. Wang et al. designed an adaptive trajectory-constrained exploration strategy that uses offline suboptimal demonstration trajectories as a reference and leverages maximum mean discrepancy (MMD) to encourage exploration of new regions under sparse and deceptive reward settings [20]. Li et al. combined Dueling DQN with prioritized experience replay and APF, using APF to probabilistically intervene in action selection during training [21]. Wang et al. proposed an improved frontier-based exploration strategy guided by deep RL, replacing random or heuristic target selection with learned target points to reduce exploration distance in unknown environments [22]. Xue et al. developed an action curiosity module (ACM) that dynamically computes curiosity rewards from obstacle-avoidance prediction errors, encouraging exploratory behavior while using cosine annealing to prevent policy degradation from over-exploration [23]. Yin et al. proposed RND3QN, a mapless local path planning algorithm that uses reward values as an exploration metric alongside an auxiliary reward function to enrich the reward distribution in sparse settings [24]. Futuhi et al. directed exploration toward infrequently visited states through a dedicated

ε t

-greedy search procedure managed by a state-visit memory unit (SVMU) [25].

To address the problems of low efficiency and poor quality in exploration, existing research attempts to optimize the process from perspectives such as improving exploration strategies. Motivated by this body of work, we incorporate A*-distance guidance into MARL as a framework-agnostic enhancement. By introducing an A*-distance-guided action selection strategy into the QMIX framework, we steer agents toward more efficient exploration. To validate the approach, we conduct experiments in simple, complex, large-scale, and congested warehouse simulation environments and compare against standard QMIX. Results confirm that the proposed A*-QMIX method delivers superior learning quality. The main contributions of this paper are as follows:

(1): We propose an A*-distance-based improvement strategy from two complementary perspectives: reward shaping and action selection. On the reward side, A*-distance replaces Manhattan distance in computing reward and penalty terms, so that the reward signal accurately reflects the true traversal cost under obstacle constraints. On the action-selection side, the A*-distance guiding function is embedded in the $ε$ -greedy exploration mechanism, guiding agents to prioritize directionally rational actions, reducing wasteful exploration, and improving training quality.
(2): We conduct comparative experiments across four environments (simple obstacles, complex obstacles, large-scale, and congested), using final reward, global average reward, and episodes to 90% success rate as evaluation metrics. Results confirm that, across all environments, the proposed method outperforms standard QMIX across all evaluation metrics, with the advantage becoming more obvious as environment scale and obstacle density increase. Moreover, when compared with other MARL baselines such as VDN and MAPPO, the proposed method also demonstrates clear superiority, particularly in the large-scale and congested environments, where competing methods largely fail. Notably, the proposed method is the only learning-based method that reliably solves the two hardest tasks, achieving path lengths that closely match or, in certain cases, even surpass those of dedicated search-based solvers like CBS and LaCAM.

2. Environment Modeling and Problem Definition

This section formally defines the multi-AGV cooperative path planning problem and establishes the foundations for the methods proposed in subsequent sections. We discretize the warehouse environment into a grid map, cast the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), and adopt the QMIX framework, built on the Centralized Training with Decentralized Execution (CTDE) paradigm, as the algorithmic backbone of this work.

2.1. Environment Modeling

In a real warehouse, AGVs navigate autonomously within a two-dimensional space defined by shelving units and traversable corridors. Path planning is subject to multiple constraints, including fixed obstacles, dynamic obstacles posed by other AGVs, and corridor width.

Grid-based map representation is the dominant approach for modeling warehouse environments [26]. In practice, shelves are arranged in regular rows and corridors are fixed. Although AGV motion is continuous, navigation over any short time horizon can be treated as movement between discrete grid cells.

Accordingly, we model the warehouse environment as a two-dimensional grid map shown in Figure 1, in which black cells represent impassable obstacles (shelves, walls) and white cells represent free space. Each AGV’s motion is discretized into single-step moves to adjacent cells at every time step. To evaluate the algorithm’s generalization across varying difficulty levels, we construct four environments: Simple Obstacle Environment, Complex Obstacle Environment, Large-Scale Environment, and Congested Environment, as shown in Figure 2.

2.2. Dec-POMDP Formulation for Multi-AGV Path Planning

In multi-AGV cooperative path planning, each vehicle can only access local environment information centered on itself rather than the complete global state, making standard fully observable MDPs inapplicable. We therefore model the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), defined by the tuple:

〈 N, S, A, P, R, Ω, O, γ 〉

where:

(1)

N = \{1, \dots, n\}

: the set of

n

AGVs performing tasks simultaneously in the warehouse.

(2)

S

: composition of the current positions of all AGVs, target positions, and the distribution of obstacles in the grid map. A single AGV cannot directly obtain the complete global state.

(3)

A

: the action space that each AGV can choose.

(4)

P (s_{t + 1} | s_{t}, a_{t}) : S \times A \to S

: the state transition function, determined by the new positions of each AGV after executing the joint action and the conflict resolution results;

(5)

R

: the reward function, giving AGVs collision penalties, arrival rewards, step penalties, and distance penalties.

(6)

Ω

: the observation space of the entire system,

Ω^{i}

is the local information.

(7)

O

: the observation function that maps the global state to the local observation value consists of position information, priority information, and 5 × 5 local environment perception information.

(8)

γ \in [0, 1]

: the discount factor balances immediate and future rewards.

At each time step, the environment outputs the global state

s_{t} \in S

, each AGV obtains the local observation view

o_{t}^{i} = O (s_{t}, i)

according to the observation function, and selects an action

a_{t}^{i}

based on its own historical observation-action sequence

τ^{i} \in {(Ω^{i} \times A^{i})}^{*}

, in order to maximize the cumulative return

\sum_{k = 0}^{\infty} γ^{k} r_{t + k}

.

2.3. A*-QMIX Framework

We select QMIX as the base framework and use it to validate the A*-distance-guided mechanism. QMIX’s CTDE architecture aligns naturally with the multi-AGV cooperative planning setting: the training phase can leverage global state information to coordinate agent behavior, while during execution each AGV acts independently on its local observation alone. Building on this foundation, we equip QMIX with the A*-distance guidance; the resulting configuration is hereafter referred to as A*-QMIX. The overall framework, shown in Figure 3, comprises three sequential components: the agent network, the conflict resolution rule pool, and the mixing network.

In the agent network component, each agent

i

receives two inputs at every time step

t

: the current local observation

Ω_{t}^{i}

(a multi-channel feature comprising position information, priority information, and the 5 × 5 local FOV) and the action executed at the previous step

α_{t - 1}^{i}

. The historical observation-action trajectory

τ_{i}

is compressed into an internal state representation that encodes the information necessary for the current decision-making. The output layer then computes local action-value estimates

Q_{a} (τ^{i}, \cdot)

, representing the expected cumulative return for each candidate action a given the current history. This Q-value vector is subsequently adjusted by the A*-distance-guided action selection module to produce each agent’s intended action.

Once all agents have selected their intended actions, the conflict resolution rule pool detects and resolves conflicts among the AGVs’ joint action. The pool contains preset resolution rules (Rule 1, Rule 2, Rule 3) targeting three conflict types (head-on conflict, node conflict, and occupancy conflict), all resolved on a priority basis: the higher-priority AGV retains its intended action unchanged, while the lower-priority AGV re-selects the action with the highest adjusted Q-value from the current set of safe actions, allowing it to avoid the conflict while still moving as close to its goal as possible. After arbitration, each agent submits its finalized local Q-value for the current time step to the mixing network.

In the mixing network stage, the local Q-values output by all agents,

{Q_{1} (τ^{1}, α^{1}), \dots, Q_{n} (τ^{n}, α^{n})}

are fed into the mixing network. At the same time, a hypernetwork takes the current global state

S_{t}

as input and uses forward propagation to generate the weight matrices and bias vectors for each layer of the mixing network. Its output layer applies an absolute-value activation function to ensure all weights are non-negative, which structurally enforces the monotonicity constraint: any increase in a local Q-value will not cause the global joint value to decrease. The mixing network uses these weights to perform a nonlinear weighted combination of the agents’ local Q-values, and outputs the global joint value

Q_{t o t} (τ, α)

.

Q_{t o t} (τ, α)

is the core output of the entire framework. It drives parameter updates across the whole network by minimizing the temporal-difference mean squared error against a target network. The A*-QMIX training procedure follows the standard DQN temporal-difference learning paradigm, with a loss function defined by the mean squared error between the predicted Q-values and the target Q-values.

3. A*-Distance-Guided Exploration Strategy

This section first details the agent design for the multi-AGV path planning task, then presents improvements to the base algorithm from two perspectives: reward signal design and action selection. The A*-distance serves as the central element of the distance-guided exploration strategy. By incorporating A*-distance into both the reward signal and the action selection mechanism, agents can more accurately perceive the true reachable distance to their goals during training and preferentially execute directionally rational actions during exploration, thereby reducing futile exploration and improving training quality.

3.1. Agent Design

Based on the Dec-POMDP model defined in Section 2.2, we detail below the design of agents, observation space, action space, reward function, and conflict resolution strategy.

3.1.1. Agent

Each AGV is treated as an independent agent. All agents share the same network architecture, action space, and observation capability, and move at a constant speed across the grid map. The agent set

N = \{1, \dots, n\}

consists of n AGVs simultaneously performing path planning tasks in the environment.

3.1.2. Observation Space

Under the Dec-POMDP framework, each AGV can only access local environmental information. The observation space of the

{AGV}_{i}

is denoted by

Ω^{i}

, which represents the set of all possible local observations available to it. At each time step

t

, the observation function

O

maps the global state

S

to a local observation

o_{t}^{i} = O (s_{t}, i) \in Ω^{i}

, which consists of three components: positional information, priority information, and local field of view.

Position information includes the start and goal positions. Priority information encodes the agent’s ordering:

{AGV}_{i}

has higher priority than

{AGV}_{j}

if

i

<

j

. A static priority scheme is used, assigning each AGV a fixed priority number for the duration of the episode.

The local FOV is a square region centered on the AGV, ensuring the AGV always occupies the center of its observation space. As shown in Figure 4, we use a 5 × 5 FOV that covers a radius of two grid cells around the agent. Data within the FOV is organized into five separate input channels: obstacles, the current agent’s goal, higher-priority AGVs, lower-priority AGVs, and the goals of other AGVs.

By combining position information, priority information, and the local FOV, each AGV can effectively perceive its surroundings and execute path planning and conflict resolution using only local information, in full compliance with the decentralized execution requirement of the Dec-POMDP framework.

3.1.3. Action Space

Although AGV motion is continuous in reality, within a single time step it can be treated as a single-step move on the grid map. The motion of each AGV is thus translated into a sequence of discrete actions. Assuming constant speed and no diagonal movement, the action space comprises five discrete actions,

A = \{u p, d o w n, l e f t, r i g h t, s t o p\}

, corresponding to moving one cell in each cardinal direction or remaining stationary. The grid dynamics are expressed as:

\{\begin{matrix} (x^{'}, y^{'}) = (x, y + 1) \\ (x^{'}, y^{'}) = (x, y - 1) \\ (x^{'}, y^{'}) = (x - 1, y) \\ (x^{'}, y^{'}) = (x + 1, y) \\ (x^{'}, y^{'}) = (x, y) \end{matrix}

(1)

3.1.4. Reward Function

The reward function evaluates action quality to guide AGVs toward optimal decisions. In addition to baseline reward and penalty terms, we introduce a distance-change-based reward/penalty mechanism grounded in A*-distance. This mechanism integrates A*-distance directly into the reward signal: by quantifying the directional rationality of each action, it provides agents with more precise feedback that accelerates convergence and improves exploration efficiency. Full details are given in Section 3.2.

3.1.5. Conflict Resolution Strategy

Conflicts that arise during multi-AGV path planning fall into three main categories: Head-on Conflict, Node Conflict, and Occupancy Conflict. Because all AGVs travel at the same speed, Pursuit Conflicts do not occur. The three relevant conflict types are illustrated in Figure 5:

We address these conflicts using a priority-based resolution strategy, whose workflow is shown in Figure 6. When an AGV detects through its local FOV that a conflict is imminent, the higher-priority AGV keeps its intended action unchanged, while the lower-priority AGV selects the action with the highest Q-value from the current set of safe actions.

The key advantage of this strategy is that, rather than forcing the lower-priority AGV to stop and wait, it picks the best available safe action, allowing the AGV to continue progressing toward its goal while avoiding the conflict. This prevents excessive conservatism from delaying routes and effectively improves the overall task completion efficiency of the multi-AGV system.

The above workflow is applied identically to all agents and across all experimental configurations (QMIX, A*-Distance Reward only, A*-Guided Selection only, and A*-QMIX (Full)). It acts as a fixed safety layer that is activated only when an imminent conflict is predicted, and then resolves it through a priority-based rule: the higher-priority AGV is granted to proceed toward its goal, while the lower-priority AGV is barred from the conflicting action and re-selects the best remaining safe action under its learned (guiding-function-adjusted) Q-values. This design aligns with common practices in multi-agent systems: enforcing collision constraints via a permissible action set [27]; and employing prioritized conflict-resolution rules that decouple safety constraints from strategic decision-making [28]. This reduces unnecessary training overhead caused by clearly infeasible actions, thereby lowering training time and difficulty. Because the same workflow is used in both baseline and proposed training, it introduces no hand-coded policy bias. Hence, any performance improvement observed is solely due to the A*-distance guidance, not to the conflict resolution rules.

3.2. A*-Distance-Based Reward Function

Conventional reward functions measure AGV proximity to the goal using Manhattan distance. However, Manhattan distance considers only horizontal and vertical coordinate differences, completely ignoring the actual distribution of obstacles in the map. In obstacle-dense warehouse environments, this often severely underestimates the true movement cost required for an AGV to reach its goal.

Figure 7 illustrates this discrepancy: when the AGV is forced upward to avoid an obstacle, the Manhattan distance increases, causing the reward and Q-value to be forced down. To overcome this limitation, we replace Manhattan distance with A*-distance as the distance metric. A*-distance is computed by running the A* algorithm on the actual map topology, yielding the true shortest-path length from the AGV’s current position to its goal under obstacle constraints, which accurately reflects actual traversability in the current environment. Substituting A*-distance for Manhattan distance in reward computation ensures that the reward signal remains consistent with the traversability conditions of the environment: when an AGV advances toward its goal along a viable path, the A*-distance decreases monotonically, and the reward is positive; when an AGV takes an action that increases the true detour cost, the A*-distance grows, and a penalty is correctly applied. This prevents agents from misreading directions in obstacle-dense regions due to inaccurate distance estimates, improving both reward signal quality and overall training quality. The reward function is defined as follows:

r_{i} (t) = r_{1} + r_{2} + r_{3} + r_{4}

(2)

where

r_{i} (t)

is the total reward;

r_{i} (t)

is the collision penalty, applied when the chosen action would cause the AGV to collide with an obstacle or another AGV,

r_{1} = - 0.5

;

r_{2}

is the goal-arrival reward, a one-time bonus awarded when an AGV first reaches its goal,

r_{2} = 1

;

r_{3}

is the step penalty, a small penalty applied at each time step (including waiting) to encourage AGVs to reach their goals as quickly as possible,

r_{3} = - 0.01

;

r_{4}

is the distance reward/penalty, defined as

r_{4} = d_{i} - d_{i + 1}

, where

d_{i}

is the A*-distance from

{AGV}_{i}

to its goal at time

t

. If an action reduces the distance to the goal, the AGV receives a positive reward; otherwise, a penalty is applied. The distance-difference reward is a commonly used shaping technique in goal-directed navigation tasks. It provides dense learning signals for sparse-reward problems and guides the learning of effective navigation policies [29,30].

3.3. A*-Distance-Guided Action Selection Strategy

Deep RL algorithms such as QMIX typically rely on the

ε

-greedy strategy for action selection: with probability

ε

, a random action is chosen from the action space to encourage exploration; with probability 1 −

ε

, the action with the highest expected return is selected. Formally:

π_{i} (τ_{i}) = \{\begin{array}{l} \arg \max Q_{i} (τ_{i}, a_{i}^{t}), & p < 1 - ε \\ a_{random}, & else \end{array}

(3)

We combine the

ε

-greedy strategy with A*-distance guidance to govern action selection during learning, a scheme we call the A*-Distance-Guided Action Selection Strategy. Building on conventional

ε

-greedy, we inject an A*-distance guiding function that evaluates candidate actions according to their directional rationality. By rewarding actions that bring the agent closer to its goal and penalizing those that move it farther away, the strategy guides AGVs to plan paths more efficiently in complex environments, reducing unnecessary exploratory behavior and improving training quality. The improved strategy is:

π_{i} (τ_{i}) = \{\begin{array}{l} \arg \max [Q_{i} (τ_{i}, a_{i}^{t}) + G (p_{i} (t), a)], & p < 1 - ε \\ a_{random}, & else \end{array}

(4)

where

Q_{i} (τ_{i}, a_{i}^{t})

is the local Q-value of

{AGV}_{i}

for action a given trajectory history

τ_{i}

, and

G (p_{i} (t), a)

is the guiding function value for taking action a in state

S

, defined as:

G (p_{i} (t), a) = \frac{d_{i}^{t} - d_{i}^{t + 1} (a)}{\max (1, d_{i}^{t})} \times η

(5)

where

d_{i}^{t}

is the A*-distance from

{AGV}_{i}

’s current position to its goal at time

t

;

d_{i}^{t + 1} (a)

is the A*-distance from the position reached by executing action a to the goal;

η

is a small positive constant. When

G (p_{i} (t), a)

> 0, action a moves the AGV closer to its goal; when

G (p_{i} (t), a)

< 0, it moves the AGV farther away.

By using this guiding function to augment action selection, agents at every decision step can simultaneously draw on their learned Q-values and assess each candidate action’s contribution to the global path objective, significantly reducing wasteful exploration and path detours. The A*-QMIX is trained using the standard temporal-difference (TD) learning paradigm, with network parameters updated by minimizing the global Q-value prediction error:

L (θ) = \sum_{m = 1}^{b} [{(y_{m}^{tot} - Q_{tot} (τ, a_{t}, s_{t}; θ))}^{2}]

(6)

y_{m}^{tot} = r_{t} + γ \max_{u_{t}^{'}} Q^{'} (τ^{'}, a_{t}^{'}, s_{t}^{'}; θ^{'})

(7)

where

L (θ)

is the loss function;

b

is the mini-batch size sampled from the experience replay buffer;

τ

and

τ^{'}

denote the trajectory histories at the current and next time steps;

a

and

a^{'}

are actions at the current and next steps;

s

and

s^{'}

are the global states at the current and next steps;

θ

and

θ^{'}

are the parameters of the current and target mixing networks, respectively;

y_{m}^{tot}

is the target global value;

r_{t}

is the immediate reward; and

γ

is the discount factor;

Q_{tot} (τ, a_{t}, s_{t}; θ)

is the current global Q value;

Q^{'} (τ^{'}, a_{t}^{'}, s_{t}^{'}; θ^{'})

is the target global Q value.

By adding the A*-distance guiding function to the Q-value during action selection, the conventional

ε

-greedy strategy is meaningfully augmented. The Q-value captures the agent’s accumulated strategic knowledge from long-term learning, while the guiding function provides an immediate assessment of each candidate action’s directional rationality based on the true reachable distance to the goal. Together, they ensure that AGVs preferentially select actions that are genuinely beneficial during exploration, effectively suppressing counterproductive choices, reducing unnecessary detours, and improving overall training quality.

4. Experiments and Evaluations

To validate the proposed A*-distance-guided exploration strategy and to isolate the contribution of each of its two components, we conduct comparative and ablation experiments against the baseline QMIX algorithm in four warehouse environments of increasing difficulty: simple, complex, large-scale and congested environments. Because reinforcement learning is sensitive to random initialization, every configuration in every environment is trained over five independent random seeds; learning curves report the cross-seed mean with a ±1 standard-deviation band, summary metrics are reported as mean ± standard deviation, and the statistical significance of differences between methods is reported as p-values.

Four configurations are compared, forming a 2 × 2 ablation over the two components: QMIX (neither component), A*-Distance Reward only (A*-distance replaces Manhattan distance in the reward term), A*-Guided Selection only (A*-Distance-Guided Action Selection Strategy), and A*-QMIX (Full) (both components).

Training behavior is characterized by three metrics: global average reward (the mean episode reward over the entire training period), final reward (the mean reward over the last 100 episodes), and episodes to 90% success rate (the first episode at which the window-100 success rate reaches 90%, reflecting sample efficiency). Hyperparameter settings are listed in Table 1. Test-phase and computational metrics: success rate, arrival rate, collision rate, deadlock rate, path length, A* call count, and training time are summarized in Table 2.

4.1. Ablation Experiments Across Different Environments

4.1.1. Simple Obstacle Environment

In the simple obstacle environment, the map is a 20 × 20 grid with 96 regularly distributed obstacles (an obstacle density of 24%) simulating a typical warehouse layout. Five AGVs start from different positions across the map and execute their assigned tasks.

Figure 8 reports the training reward, loss, success-rate, and exploration-rate curves, all under the same linear

ε

-decay schedule. The final reward is approximately 93.4 for all four configurations. The global average reward is 55.78 ± 1.00 for QMIX, 55.15 ± 2.70 for A*-Distance Reward only, 61.36 ± 0.51 for A*-Guided Selection only, and 61.30 ± 0.46 for A*-QMIX (Full); the improvement of A*-QMIX over QMIX (+9.9%) is statistically significant (p < 10⁻⁴). The number of episodes to reach a 90% success rate is 1176 ± 29, 1183 ± 72, 858 ± 29, and 858 ± 21 for the four configurations, respectively, so A*-QMIX reaches the milestone 318 episodes (27%) earlier than QMIX (p < 10⁻⁶).

Figure 9, Figure 10, Figure 11 and Figure 12 show the converged path trajectories of the four configurations on a representative seed, each in both 2D planar and 3D space-time representations. Under every configuration, all five AGVs successfully reach their assigned goals.

4.1.2. Complex Obstacle Environment

In the complex obstacle environment, the map is again a 20 × 20 grid but contains 96 irregularly distributed obstacles (an obstacle density of 24%), simulating a more challenging warehouse scenario, and five AGVs start from different positions to execute their tasks.

Figure 13 reports the training reward, loss, success-rate, and exploration rate curves, all under the same linear

ε

-decay schedule. The final reward is 68.26 ± 0.06 for QMIX, 76.08 ± 0.27 for A*-Distance Reward only, 68.23 ± 0.07 for A*-Guided Selection only, and 76.25 ± 0.11 for A*-QMIX (Full). The global average reward is 37.66 ± 1.65 for QMIX, 44.89 ± 1.90 for A*-Distance Reward only, 42.75 ± 0.52 for A*-Guided Selection only, and 50.48 ± 0.46 for A*-QMIX (Full); the improvement of A*-QMIX over QMIX (+34.0%) is statistically significant (p < 10⁻⁴). The number of episodes to reach a 90% success rate is 989 ± 50, 998 ± 38, 795 ± 11, and 786 ± 32 for the four configurations, respectively, so A*-QMIX reaches the milestone 203 episodes (21%) earlier than QMIX (p < 10⁻³). In the success-rate curve, the A*-Distance Reward-only configuration exhibits a transient dip in the late training phase before recovering.

Figure 14, Figure 15, Figure 16 and Figure 17 show the converged path trajectories of the four configurations on a representative seed, each in both 2D planar and 3D space-time representations. Under every configuration, all five AGVs successfully reach their assigned goals.

4.1.3. Large-Scale Environment

In the large-scale environment, the map is expanded to a 30 × 30 grid containing 216 irregularly distributed obstacles (an obstacle density of 24%, the same as in the simple and complex environments), and the number of AGVs is increased to 16, simulating a larger and more congested warehouse scenario. Owing to the larger number of AGVs and the higher task difficulty, the learning rate is reduced from 1 × 10⁻⁴ to 1 × 10⁻⁵ in this environment.

Figure 18 reports the training reward, loss, success rate, and exploration-rate curves, all under the same linear

ε

-decay schedule. The final reward is 147.46 ± 3.31 for QMIX, 152.77 ± 2.35 for A*-Distance Reward only, 154.30 ± 3.97 for A*-Guided Selection only, and 154.43 ± 9.82 for A*-QMIX (Full). The global average reward is 57.66 ± 3.02 for QMIX, 65.19 ± 3.86 for A*-Distance Reward only, 101.66 ± 0.58 for A*-Guided Selection only, and 106.98 ± 1.15 for A*-QMIX (Full); the improvement of A*-QMIX over QMIX (+85.5%) is statistically significant (p < 10⁻⁶). A*-QMIX and A*-Guided Selection only reach a 90% success rate at 791 ± 18 and 804 ± 26 episodes, respectively, whereas QMIX and A*-Distance Reward only peak near 50% around episode 1500 and then decline, without reaching a 90% success rate within the 2000 training episodes.

Figure 19, Figure 20, Figure 21 and Figure 22 show the converged path trajectories of the four configurations on a representative seed, each in both 2D planar and 3D space-time representations. A*-QMIX (Full) and A*-Guided Selection only route all 16 AGVs to their goals, whereas under QMIX and A*-Distance Reward only some AGVs become trapped in deadlocks and fail to reach their goals.

4.1.4. Congested Environment

In the Congested Environment, the map is a 31 × 31 grid with 480 regularly placed obstacles (an obstacle density of approximately 50%, roughly twice that of the preceding three environments). The traversable space forms narrow, single-cell-wide corridors, which induces a large number of conflicts. The number of AGVs is 16, simulating a bigger, more crowded warehouse scenario. Because there are more AGVs and the tasks are tougher, the learning rate is lowered from 1 × 10⁻⁴ to 1 × 10⁻⁵ in this environment.

Figure 23 reports the training reward, loss, success-rate, and exploration-rate curves, all under the same linear

ε

-decay schedule. The final reward is 103.74 ± 2.22 for QMIX, 109.38 ± 4.34 for A*-Distance Reward only, 97.68 ± 22.71 for A*-Guided Selection only, and 96.54 ± 11.88 for A*-QMIX (Full). The global average reward is −33.93 ± 6.69 for QMIX, −37.64 ± 9.30 for A*-Distance Reward only, 28.83 ± 2.66 for A*-Guided Selection only, and 28.21 ± 0.83 for A*-QMIX (Full). A*-QMIX and A*-Guided Selection only reach a 90% success rate at 819 ± 71 and 736 ± 40 episodes, respectively, whereas QMIX and A*-Distance Reward only peak near 20% and 30% around episode 1500 and then decline, without reaching a 90% success rate within the 2000 training episodes.

Figure 24, Figure 25, Figure 26 and Figure 27 show the converged path trajectories of the four configurations on a representative seed, each in both 2D planar and 3D space-time representations. A*-QMIX (Full) and A*-Guided Selection only route all 16 AGVs to their goals, whereas under QMIX and A*-Distance Reward only some AGVs become trapped in deadlocks and fail to reach their goals.

4.1.5. MAPF Performance in Ablation Experiments

This section summarizes the Multi-Agent Path Finding (MAPF) evaluation metrics of all four configurations across the four environments. Each model is evaluated under a deterministic greedy rollout (

ε

= 0). As listed in Table 2, the success rate is the fraction of episodes in which all AGVs reach their goals, whereas the arrival, deadlock, and collision rates are per-AGV averages over the five seeds. The path length (makespan) is computed over successful episodes only (marked “N/A” when no seed solves the task); training time and A* calls are recorded during training and averaged over the five seeds.

In the simple and complex environments, all four configurations achieve 100% success with no deadlocks, but the two configurations employing A*-guided action selection yield the shortest path lengths. The gap turns obvious under more AGVs and conflicts: in both the large-scale and congested environments, A*-Guided Selection only and A*-QMIX (Full) achieve high success rates, whereas QMIX and A*-Distance Reward only perform poorly. In the congested environment, A*-QMIX (Full) routes almost all AGVs to their goals (97.5% arrival) at the same optimal path length (14.0) as A*-Guided Selection only, with a single seed failing. This is a minor side effect of the A*-Distance Reward, which measures distance on the static obstacle map alone and occasionally guides agents toward paths that are prone to conflicts under extremely congested environments. It costs only 2.5% of arrivals and leaves path length unchanged, indicating that this side effect has a negligible impact on overall performance. Across all environments, the two configurations employing A*-guided action selection generally issue the most A* calls yet train the fastest, reflecting the impact of the A*-distance-guided action selection strategy: despite numerous A* calls, it guides agents toward their goals along feasible paths, thereby ending episodes much earlier, and the saved simulation steps far outweigh the cost of computing A* distances.

4.2. Comparison with Existing Methods

This section compares the proposed A*-QMIX with five other algorithms: QMIX, Value-Decomposition Networks (VDN), Multi-Agent Proximal Policy Optimization (MAPPO), Conflict-Based Search (CBS), and LaCAM. The training performance of A*-QMIX, QMIX, VDN, and MAPPO is compared as shown in Figure 28, Figure 29, Figure 30 and Figure 31.

Evaluating both final reward and final training success rate, we compare A*-QMIX with QMIX, VDN, and MAPPO across the four environments. The simple obstacle environment is easy enough that all four methods achieve rewards close to 93 and success rates near 100%, with no meaningful gap between them. In the complex obstacle environment, A*-QMIX attains the highest final reward (76.25 ± 0.11), compared to 68.26 ± 0.06, 68.23 ± 0.06, and 67.34 ± 0.33 for QMIX, VDN, and MAPPO, respectively, while all methods still maintain success rates near 100%.

This advantage becomes more obvious as the number of AGVs increases and the environments become complex and congested. In the large-scale environment, A*-QMIX achieves the highest reward (154.43 ± 9.82) with an 88.0% final training success rate, whereas QMIX falls to 26.0%, MAPPO falls to 69.6%, and VDN fails to complete the task. In the congested environment, A*-QMIX’s success rate still reaches 88.6%, while QMIX, VDN, and MAPPO all fail to complete the task.

Overall, the algorithms perform comparably on the easy environment, but as AGV count and obstacle density increase, A*-QMIX’s lead becomes increasingly significant, and it is the only method that reliably solves the large-scale and congested tasks.

The success rate and path length of the aforementioned MARL algorithms along with the search-based CBS and LaCAM are evaluated across each environment. The results for all four environments are summarized in Table 3.

Table 3 reports the success rate and path length of all six methods. In the simple and complex obstacle environments, all methods reach a 100% success rate. A*-QMIX gives the shortest paths among the MARL methods and matches CBS and LaCAM (22.0 and 20.0). The gap widens as the AGV count and obstacle density rise. In the large-scale environment, QMIX reaches a 20% success rate, and VDN and MAPPO fail entirely. In the congested environment, all three competing MARL methods fail. A*-QMIX still completes both tasks. In the large-scale environment, its path length is 19.0 ± 2.2, slightly longer than CBS (16.0 ± 0.0) and LaCAM (18.0 ± 0.0). In the congested environment, it is 14.0 ± 0.0, slightly longer than CBS (13.0 ± 0.0) but far shorter than LaCAM (24.0 ± 0.0). Overall, A*-QMIX is the only learning-based method that reliably solves the two hardest tasks while keeping near-optimal path length.

5. Conclusions

This paper addresses the problems of low exploration efficiency and poor exploration quality that arise in multi-AGV path planning, and proposes an A*-distance-guided improvement method whose effectiveness is confirmed through simulation experiments. The main findings and future directions are summarized as follows.

(1): We incorporated A*-distance into the $ε$ -greedy exploration strategy within the QMIX framework to design the A*-Distance-Guided Action Selection Strategy for multi-AGV path planning. By rewarding actions that bring agents closer to their goals and penalizing those that move them farther away, the strategy effectively reduces unnecessary exploration and improves learning efficiency. Replacing Manhattan distance with A*-distance in reward computation further provides a more accurate, obstacle-aware feedback signal.
(2): Simulation results show the following. In the simple obstacle environment, the proposed method achieves a global average reward of 61.3 vs 55.8 for standard QMIX (+9.9%), reaches the 90% success rate approximately 318 episodes earlier, and both algorithms converge to a comparable final reward of approximately 93.4 with comparable path quality. In the complex obstacle environment, the proposed method achieves a global average reward of 50.5 vs 37.7 for standard QMIX (+34.0%), reaches the 90% success rate approximately 203 episodes earlier, and achieves a notably higher converged final reward (76.3 vs. 68.3). In the large-scale environment, the proposed method achieves a global average reward of 107.0 vs 57.7 for standard QMIX (+85.5%), and reaches a 90% success rate within 2000 episodes, whereas standard QMIX fails to. In the congested environment, the proposed method substantially raises the global average reward from −33.9 for standard QMIX to 28.2, and reaches a 90% success rate within 2000 episodes, whereas standard QMIX fails to. Ablation comparisons indicate that the A*-distance-guided action selection is the primary source of these improvements and is decisive for convergence in the large-scale and congested environments, whereas the A*-distance reward plays a supporting role by providing a more accurate reward signal and raising the converged final reward. These results validate the effectiveness of the A*-distance-guided improvement strategy in enhancing exploration efficiency and policy quality for multi-AGV path planning, with improvements becoming more pronounced as environment scale and obstacle density increase. Compared with other algorithms, A*-QMIX is the only learning-based method that reliably solves the large-scale and congested tasks (100% and 80% test success rate), whereas QMIX, VDN, and MAPPO largely fail (≤20%). Its path length (19.0 and 14.0) remains close to that of the search-based solvers CBS (16.0 and 13.0) and LaCAM (18.0 and 24.0).
(3): By combining A*-distance guidance with multi-agent reinforcement learning, this work offers a solution for multi-AGV cooperative path planning in unmanned warehouses that balances convergence speed and planning quality. Future work may explore dynamic task assignment, even larger AGV fleets, and the impact of uncertainty in real-world warehouse settings, with the aim of extending the practical applicability of the proposed approach.

Author Contributions

Conceptualization, P.M., P.W. and Y.F.; writing—original draft preparation, Y.F.; writing—review and editing, Y.F. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Peiyan Mao is employed by Anghua (Shanghai) Automation Engineering Co., Ltd. Author Pengfei Wang is employed by Anghua (Shanghai) Automation Engineering Co., Ltd. All other authors declare that there are no commercial or financial relationships that could be regarded as potential conflicts of interest regarding this research.

References

Ni, J.; Ge, Y.; Zhao, Y.; Gu, Y. An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks. Appl. Sci. 2025, 15, 11211. [Google Scholar] [CrossRef]
Yin, R.; Rahman, M.N.A.; Hishamuddin, H.; Ikram, I.M.; Sabtu, M.I. Towards Industry 5.0: Integrating Technical Optimization of Automated Guided Vehicles with Human-Centricity in Sustainable Production Systems. J. Mech. Sci. Technol. 2026, 40, 619–632. [Google Scholar] [CrossRef]
Bhargava, A.; Suhaib, M.; Singholi, A.S. A Review of Recent Advances, Techniques, and Control Algorithms for Automated Guided Vehicle Systems. J. Braz. Soc. Mech. Sci. Eng. 2024, 46, 419. [Google Scholar] [CrossRef]
Tang, Y.; Zakaria, M.A.; Younas, M. Path Planning Trends for Autonomous Mobile Robot Navigation: A Review. Sensors 2025, 25, 1206. [Google Scholar] [CrossRef] [PubMed]
Xuan, D.T.; Hung, N.T.; Thang, V.T. A Comprehensive Review of Improved A* Path Planning Algorithms and Their Hybrid Integrations. Automation 2025, 6, 52. [Google Scholar] [CrossRef]
Song, F.; Shao, Y.; Jiang, D.; Ren, Z.; Tang, F.; Tang, Y.; Si, B. An Improved Artificial Potential Field Method With Distributed Representation and Scale-Invariant Path Planning. IEEE Trans. Cogn. Dev. Syst. 2026, 18, 128–141. [Google Scholar] [CrossRef]
Ul Islam, N.; Gul, K.; Faizullah, F.; Ullah, S.S.; Syed, I. Trajectory Optimization and Obstacle Avoidance of Autonomous Robot Using Robust and Efficient Rapidly Exploring Random Tree. PLoS ONE 2024, 19, e0311179. [Google Scholar] [CrossRef] [PubMed]
Hao, K.; Zhao, J.; Yu, K.; Li, C.; Wang, C. Path Planning of Mobile Robots Based on a Multi-Population Migration Genetic Algorithm. Sensors 2020, 20, 5873. [Google Scholar] [CrossRef] [PubMed]
Lin, S.; Wang, J.; Kong, X. Bio-Inspired Reactive Approaches for Automated Guided Vehicle Path Planning: A Review. Biomimetics 2025, 11, 17. [Google Scholar] [CrossRef] [PubMed]
Hu, H.; Yang, X.; Xiao, S.; Wang, F. Anti-Conflict AGV Path Planning in Automated Container Terminals Based on Multi-Agent Reinforcement Learning. Int. J. Prod. Res. 2023, 61, 65–80. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar] [CrossRef]
Lin, S.; Liu, A.; Wang, J.; Kong, X. A Review of Path-Planning Approaches for Multiple Mobile Robots. Machines 2022, 10, 773. [Google Scholar] [CrossRef]
Wong, A.; Bäck, T.; Kononova, A.V.; Plaat, A. Deep Multiagent Reinforcement Learning: Challenges and Directions. Artif. Intell. Rev. 2023, 56, 5023–5056. [Google Scholar] [CrossRef]
El Wafi, M.; Youssefi, M.A.; Dakir, R.; Bakir, M. Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning. Automation 2025, 6, 12. [Google Scholar] [CrossRef]
Ben-Akka, M.; Tanougast, C.; Diou, C. Novel Design of Reward and Epsilon-Greedy Decay Strategy Tailored for Q-Learning in Optimizing Local Mobile Robot Path Planning. Knowl.-Based Syst. 2025, 324, 113836. [Google Scholar] [CrossRef]
Mou, J.; Shi, B.; Wang, B.; Yu, C.; Wang, Y.; Zhong, F.; Zheng, L.; Wang, J.; Li, J. A Novel Reinforcement Learning Framework-Based Path Planning Algorithm for Unmanned Surface Vehicle. Front. Mar. Sci. 2025, 12, 1641093. [Google Scholar] [CrossRef]
Gharbi, A. A Dynamic Reward-Enhanced Q-Learning Approach for Efficient Path Planning and Obstacle Avoidance in Mobile Robotics. Appl. Comput. Inform. 2024. ahead of print. [Google Scholar] [CrossRef]
Zhang, R.; Wang, S.; Chen, W.; Zhou, Y.; Zhao, Z.; Zhang, Z.; Zhang, R. Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2502.03506. [Google Scholar]
Pei, X.; Zhang, L.; Zhang, M.; Yin, Y.; Leng, Z.; Wang, Y.; Gan, H. A Path Planning Method Based on Noisy D3QN Algorithm with N-Step Updates. Ain Shams Eng. J. 2026, 17, 103826. [Google Scholar] [CrossRef]
Wang, G.; Wu, F.; Zhang, X.; Guo, N.; Zheng, Z. Adaptive Trajectory-Constrained Exploration Strategy for Deep Reinforcement Learning. Knowl.-Based Syst. 2024, 285, 111334. [Google Scholar] [CrossRef]
Li, C.; Yue, X.; Liu, Z.; Ma, G.; Zhang, H.; Zhou, Y.; Zhu, J. A Modified Dueling DQN Algorithm for Robot Path Planning Incorporating Priority Experience Replay and Artificial Potential Fields. Appl. Intell. 2025, 55, 366. [Google Scholar] [CrossRef]
Wang, R.; Zhang, J.; Lyu, M.; Yan, C.; Chen, Y. An Improved Frontier-Based Robot Exploration Strategy Combined with Deep Reinforcement Learning. Robot. Auton. Syst. 2024, 181, 104783. [Google Scholar] [CrossRef]
Xue, J.; Chen, J.; Zhang, S. Action-Curiosity-Based Deep Reinforcement Learning Algorithm for Path Planning in a Nondeterministic Environment. Intell. Comput. 2025, 4, 0140. [Google Scholar] [CrossRef]
Yin, Y.; Chen, Z.; Liu, G.; Guo, J. A Mapless Local Path Planning Approach Using Deep Reinforcement Learning Framework. Sensors 2023, 23, 2036. [Google Scholar] [CrossRef] [PubMed]
Futuhi, E.; Karimi, S.; Gao, C.; Müller, M. ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control. arXiv 2026, arXiv:2410.05225. [Google Scholar]
Koval, A.; Karlsson, S.; Nikolakopoulos, G. Experimental Evaluation of Autonomous Map-Based Spot Navigation in Confined Environments. Biomim. Intell. Robot. 2022, 2, 100035. [Google Scholar] [CrossRef]
Chen, Y.F.; Liu, M.; Everett, M.; How, J.P. Decentralized Non-Communicating Multiagent Collision Avoidance with Deep Reinforcement Learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Li, W.; Chen, H.; Jin, B.; Tan, W.; Zha, H.; Wang, X. Multi-Agent Path Finding with Prioritized Communication Learning. In 2022 International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Ma, L.; Liu, Y.; Chen, J.; Jin, D. Learning to Navigate in Indoor Environments: From Memorizing to Reasoning. arXiv 2019, arXiv:1904.06933. [Google Scholar]
Marchesini, E.; Farinelli, A. Centralizing State-Values in Dueling Networks for Multi-Robot Reinforcement Learning Mapless Navigation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2021; pp. 4583–4588. [Google Scholar]

Figure 1. Mapping physical warehouse layouts to grid-based abstraction.

Figure 2. Diagram of the experimental environments. (a) Simple Obstacle Environment; (b) Complex Obstacle Environment; (c) Large-Scale Environment; (d) Congested Environment.

Figure 3. Overview of the A*-QMIX framework.

Figure 4. Agent local field of view (5 × 5 FOV).

Figure 5. Types of conflict in multi-AGV path planning. (a) Head-on Conflict; (b) Node Conflict; (c) Occupancy Conflict.

Figure 6. Priority-based conflict resolution workflow.

Figure 7. Comparison of actual movement paths under Manhattan distance versus A*-distance in an obstacle avoidance scenario.

Figure 8. Training curve comparison in the simple obstacle environment. (a) Training Reward Comparison. (b) Training Loss Comparison. (c) Training Success Rate Comparison. (d) Exploration Rate Decay.

Figure 9. Converged path trajectories of QMIX in the simple obstacle environment. (a) 3D Trajectory in the simple obstacle environment. (b) 2D Trajectory in the simple obstacle environment.

Figure 10. Converged path trajectories of A*-Distance Reward only in the simple obstacle environment. (a) 3D Trajectory in the simple obstacle environment. (b) 2D Trajectory in the simple obstacle environment.

Figure 11. Converged path trajectories of A*-Guided Selection only in the simple obstacle environment. (a) 3D Trajectory in the simple obstacle environment. (b) 2D Trajectory in the simple obstacle environment.

Figure 12. Converged path trajectories of A*-QMIX (Full) in the simple obstacle environment. (a) 3D Trajectory in the simple obstacle environment. (b) 2D Trajectory in the simple obstacle environment.

Figure 13. Training curve comparison in the complex obstacle environment. (a) Training Reward Comparison. (b) Training Loss Comparison. (c) Training Success Rate Comparison. (d) Exploration Rate Decay.

Figure 14. Converged path trajectories of QMIX in the complex obstacle environment. (a) 3D Trajectory in the complex obstacle environment. (b) 2D Trajectory in the complex obstacle environment.

Figure 15. Converged path trajectories of A*-Distance Reward only in the complex obstacle environment. (a) 3D Trajectory in the complex obstacle environment. (b) 2D Trajectory in the complex obstacle environment.

Figure 16. Converged path trajectories of A*-Guided Selection only in the complex obstacle environment. (a) 3D Trajectory in the complex obstacle environment. (b) 2D Trajectory in the complex obstacle environment.

Figure 17. Converged path trajectories of A*-QMIX (Full) in the complex obstacle environment. (a) 3D Trajectory in the complex obstacle environment. (b) 2D Trajectory in the complex obstacle environment.

Figure 18. Training curve comparison in the large-scale Environment. (a) Training Reward Comparison. (b) Training Loss Comparison. (c) Training Success Rate Comparison. (d) Exploration Rate Decay.

Figure 19. Converged path trajectories of QMIX in the large-scale environment. (a) 3D Trajectory in the large-scale Environment. (b) 2D Trajectory in the large-scale Environment.

Figure 20. Converged path trajectories of A*-Distance Reward only in the large-scale environment. (a) 3D Trajectory in the large-scale Environment. (b) 2D Trajectory in the large-scale Environment.

Figure 21. Converged path trajectories of A*-Guided Selection only in the large-scale environment. (a) 3D Trajectory in the large-scale Environment. (b) 2D Trajectory in the large-scale Environment.

Figure 22. Converged path trajectories of A*-QMIX (Full) in the large-scale environment. (a) 3D Trajectory in the large-scale Environment. (b) 2D Trajectory in the large-scale Environment.

Figure 23. Training curve comparison in the congested environment. (a) Training Reward Comparison. (b) Training Loss Comparison. (c) Training Success Rate Comparison. (d) Exploration Rate Decay.

Figure 24. Converged path trajectories of QMIX in the congested environment. (a) 3D Trajectory in the congested environment. (b) 2D Trajectory in the congested environment.

Figure 25. Converged path trajectories of A*-Distance Reward only in the congested environment. (a) 3D Trajectory in the congested environment. (b) 2D Trajectory in the congested environment.

Figure 26. Converged path trajectories of A*-Guided Selection only in the congested environment. (a) 3D Trajectory in the congested environment. (b) 2D Trajectory in the congested environment.

Figure 27. Converged path trajectories of A*-QMIX (Full) in the congested environment. (a) 3D Trajectory in the congested environment. (b) 2D Trajectory in the congested environment.

Figure 28. Training performance comparison in the simple obstacle environment. (a) Total Reward Comparison. (b) Success Rate Comparison.

Figure 29. Training performance comparison in the complex obstacle environment. (a) Total Reward Comparison. (b) Success Rate Comparison.

Figure 30. Training performance comparison in the large-scale environment. (a) Total Reward Comparison. (b) Success Rate Comparison.

Figure 31. Training performance comparison in the congested environment. (a) Total Reward Comparison. (b) Success Rate Comparison.

Table 1. Algorithm hyperparameter settings.

Parameter	Value
Discount factor ( $γ$ )	0.95
Initial $ε$ value	1
$ε$ decay rate	0.5 × 10⁻³
Mini-batch size ( $b$ )	64
Learning rate ( $l$ )	1 × 10⁻⁴/1 × 10⁻⁵
Target network update frequency ( $c$ )	100 episodes
Total training episodes ( $k$ )	2000
Guiding coefficient $η$	0.1

Table 2. MAPF evaluation metrics across environments.

Environments	Method	Success Rate	Arrival Rate	Collision Rate	Deadlock Rate	Path Length	Training Time (s)	A* Calls (Training)
Simple (20 × 20, 5 AGVs)	QMIX	100.0%	100.0%	0.0%	0.0%	25.2 ± 6.6	6017	0
	A*-Distance Reward only	100.0%	100.0%	0.0%	0.0%	22.2 ± 0.4	6026	1,449,654
	A*-Guided Selection only	100.0%	100.0%	0.0%	0.0%	22.0 ± 0.0	5464	3,231,204
	A*-QMIX (Full)	100.0%	100.0%	0.0%	0.0%	22.0 ± 0.0	5589	4,529,552
Complex (20 × 20, 5 AGVs)	QMIX	100.0%	100.0%	0.0%	0.0%	20.4 ± 0.9	5476	0
	A*-Distance Reward only	100.0%	100.0%	0.0%	0.0%	20.2 ± 0.4	5583	1,322,456
	A*-Guided Selection only	100.0%	100.0%	0.0%	0.0%	20.0 ± 0.0	5106	2,718,567
	A*-QMIX (Full)	100.0%	100.0%	0.0%	0.0%	20.0 ± 0.0	5130	3,892,883
Large-scale (30 × 30, 16 AGVs)	QMIX	20.0%	90.0%	5.0%	10.0%	30.0	17,129	0
	A*-Distance Reward only	0.0%	80.0%	2.5%	20.0%	N/A	18,529	5,828,229
	A*-Guided Selection only	100.0%	100.0%	6.2%	0.0%	17.6 ± 2.6	10,823	7,644,380
	A*-QMIX (Full)	100.0%	100.0%	5.0%	0.0%	19.0 ± 2.2	11,317	1,126,9181
Congested (31 × 31, 16 AGVs)	QMIX	0.0%	75.0%	0.0%	25.0%	N/A	15,465	0
	A*-Distance Reward only	0.0%	90.0%	0.0%	10.0%	N/A	15,233	5,285,939
	A*-Guided Selection only	100.0%	100.0%	0.0%	0.0%	14.0 ± 0.0	9104	5,104,937
	A*-QMIX (Full)	80.0%	97.5%	0.0%	2.5%	14.0 ± 0.0	9462	8,359,040

Table 3. Comparison with existing methods across environments.

Method	Simple (20 × 20, 5 AGVs)		Complex (20 × 20, 5 AGVs)		Large-Scale (30 × 30, 16 AGVs)		Congested (31 × 31, 16 AGVs)
Method	Success Rate	Path Length	Success Rate	Path Length	Success Rate	Path Length	Success Rate	Path Length
A*-QMIX	100.0%	22.0 ± 0.0	100.0%	20.0 ± 0.0	100.0%	19.0 ± 2.2	80.0%	14.0 ± 0.0
QMIX	100.0%	25.2 ± 6.6	100.0%	20.4 ± 0.9	20.0%	30.0	0.0%	N/A
VDN	100.0%	22.4 ± 0.5	100.0%	20.8 ± 1.8	0.0%	N/A	0.0%	N/A
MAPPO	100.0%	24.6 ± 5.3	100.0%	20.2 ± 0.4	0.0%	N/A	0.0%	N/A
CBS	100.0%	22.0 ± 0.0	100.0%	20.0 ± 0.0	100.0%	16.0 ± 0.0	100.0%	13.0 ± 0.0
LaCAM	100.0%	22.0 ± 0.0	100.0%	20.0 ± 0.0	100.0%	18.0 ± 0.0	100.0%	24.0 ± 0.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Feng, Y.; Mao, P.; Wang, P. An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning. Automation 2026, 7, 100. https://doi.org/10.3390/automation7040100

AMA Style

Zhou Y, Feng Y, Mao P, Wang P. An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning. Automation. 2026; 7(4):100. https://doi.org/10.3390/automation7040100

Chicago/Turabian Style

Zhou, Ying, Yixin Feng, Peiyan Mao, and Pengfei Wang. 2026. "An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning" Automation 7, no. 4: 100. https://doi.org/10.3390/automation7040100

APA Style

Zhou, Y., Feng, Y., Mao, P., & Wang, P. (2026). An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning. Automation, 7(4), 100. https://doi.org/10.3390/automation7040100

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning

Abstract

1. Introduction

2. Environment Modeling and Problem Definition

2.1. Environment Modeling

2.2. Dec-POMDP Formulation for Multi-AGV Path Planning

2.3. A*-QMIX Framework

3. A*-Distance-Guided Exploration Strategy

3.1. Agent Design

3.1.1. Agent

3.1.2. Observation Space

3.1.3. Action Space

3.1.4. Reward Function

3.1.5. Conflict Resolution Strategy

3.2. A*-Distance-Based Reward Function

3.3. A*-Distance-Guided Action Selection Strategy

4. Experiments and Evaluations

4.1. Ablation Experiments Across Different Environments

4.1.1. Simple Obstacle Environment

4.1.2. Complex Obstacle Environment

4.1.3. Large-Scale Environment

4.1.4. Congested Environment

4.1.5. MAPF Performance in Ablation Experiments

4.2. Comparison with Existing Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI