1. Introduction
The automated guided vehicle (AGV) is widely used for material transport tasks in manufacturing workshops due to its efficient transportation mode. As an essential part of the manufacturing workshop, AGV has played an indispensable role [
1,
2]. Path planning is the central key in the field of AGV research. It calculates the feasible path from the starting position to the goal position in the map, and ensures that AGV will not collide with static and dynamic obstacles in the moving process. How to achieve collision-free and conflict-free efficient paths for AGVs in dynamic environments is still one of the main focuses of AGV system research.
Many researchers have used many representative algorithms to solve the path planning problem of AGV, including graph-based search algorithms such as Dijkstra [
3], A* [
4,
5], D* [
6], etc. These algorithms have the advantage of being simple and easy to implement and have been widely used in AGV path planning. However, when the environment is large and the obstacles are complex, the state space grows exponentially and the time efficiency of the algorithm is low. Aiming at the above problems, many algorithms have been proposed, such as Rapidly exploring Random Trees and intelligent bionic algorithms. Rapidly exploring Random Trees (RRT) [
7] is a sampling-based path planning algorithm to solve the path planning problem of AGV. Its advantages are high efficiency and speed in high-dimensional path search, while the disadvantages are that these algorithms usually conduct random search in the sampling environment to find the path and the results are often not optimal. Intelligent bionic algorithms are discovered through bionic research. Typical algorithms include the genetic algorithm (GA) [
8,
9], ant colony algorithm (ACA) [
10], and particle swarm optimization (PSO) [
11]. When dealing with the path planning problem under the complex dynamic environment information, the inspiration from nature often works well. However, in the complex map environment, there are still some problems, such as slow calculation speed and easy falling into local optimum.
Almasri et al. [
12] proposed a route following and obstacle avoidance technology relying on low-cost infrared sensors, which can be easily used in real-time control applications of microcontrollers. Matveev et al. [
13] proposed an integrated navigation strategy to safely guide the robot to reach the target through an environment full of mobile non-convex obstacles. Tanveer et al. [
14] defined the path to be followed as a curve on the plane and fed it back to the path tracking controller. Obstacles generated the final safe path by modifying the Gaussian function modeling of the original function. This method has strong robustness and security. These three methods can quickly obtain a feasible path and avoid obstacles; however, the path obtained by these methods may be suboptimal.
In recent years, with the rapid development of deep learning (DL) and reinforcement learning (RL), the applications of RL have achieved significant breakthroughs in many fields, such as games [
15,
16], healthcare [
17], robotics [
18], etc. Reinforcement learning is an artificial intelligence method, which emphasizes that agents adapt to complex environments by constantly interacting with the environment to learn the best strategies. RL can actively learn and has the characteristics of adapting to complex dynamic environments. RL technology provides a new way to solve path planning problems. More and more scholars have introduced learning-based methods into the field of path planning [
19,
20] to solve the limitations of classical path planning methods.
RL algorithms can be classified into model-free reinforcement learning and model-based reinforcement learning according to whether there is an environment model or not. Model-free reinforcement learning algorithm directly improves the policy or estimates the value function according to the data sampled by agents and environment interaction, such as classical Q-Learning [
21], Sarsa, and other temporal-difference algorithms. Among them, the Q-Learning algorithm is widely used in path planning problems [
19,
22,
23] because of its advantages of learning and no need for environmental maps. Many researchers apply Euclidean distance to the calculation of fitness function and reward function of Q-Learning. For example, the Euclidean distance between the current position of the mobile robot and the target position is used to evaluate the fitness function of the multiple mobile robots environment [
24]. Low et al. [
25] added distance measurement to Q-Learning to guide agents to move toward the target position.
With the increasing complexity of the map environment, the Q-Learning algorithm shows the characteristics of low learning efficiency and slow convergence. Dyna-Q [
26] is a classical model-based reinforcement learning algorithm that integrates planning and direct RL methods. The direct RL method is to improve the value function and policy by using the real experience generated by agent interaction with the environment. At the same time, the algorithm also learns an environment model and then uses this model to assist agents in policy promotion or value estimation; so, Dyna-Q has high efficiency. Viet et al. [
27] evaluated the efficiency of five reinforcement learning algorithms in path planning, including Q-Learning, Sarsa, Q(
), Sarsa(
), and Dyna Q. The experimental results show that Dyna-Q has a faster convergence speed in solving the path planning problem of mobile robots.
In order to further improve the convergence performance of Dyna-Q, a common way to improve the Dyna-Q algorithm is to combine Dyna-Q with other algorithms. An Improved Dyna-H algorithm was proposed in [
28]. In the planning process of the Dyna-Q algorithm, those actions that lead to the worst trajectory were taken to simulate the update policy, integrating the advantages of heuristic search into Dyna agents. Dabouni et al. [
29] proposed a direct heuristic dynamic programming algorithm based on dynamic programming (Dyna HDP). By combining neural networks and Dyna, the optimal policy efficiency can be improved in the least number of experiments. Pei et al. [
30] proposed a method incorporating the
policy, which is a new action selection policy, combined with simulated annealing mechanism. This policy combines heuristic reward function and heuristic action to solve the exploration utilization dilemma and improve the convergence and learning efficiency of path planning.
Wang et al. [
31] introduced an exploration factor that applies the technology of ant colony algorithm to the selection of candidates in indirect learning. A hybrid exploration strategy is proposed to improve the sample efficiency of the Dyna-Q algorithm. Zajdel [
32] proposed a epoch-incremental learning method, which uses breadth-first search to determine the shortest distances between the non-terminal states observed and the target state in each episode, and performs additional policy updates based on these distances.
Although Dyna-Q has shown excellent performance in path planning, there are still some difficulties and challenges in path planning in large dynamic environments. First of all, when the map environment is large, the sparse reward problem makes it difficult for AGV to explore the target location, which leads to many iterations and low learning efficiency. Improving the search efficiency and convergence speed of the Dyna-Q algorithm is still a challenge. The second is how to plan a collision-free and efficient path in an environment with multiple dynamic obstacles.
In order to solve the above difficulties, this research proposes an Improved Dyna-Q algorithm for path planning in dynamic environments. Different from the Dyna-Q algorithm, the Improved Dyna-Q algorithm adopts a novel global guidance mechanism based on heuristic graph, and combines an improved dynamic reward function and an improved heuristic planning strategy. The global guidance mechanism guides AGV to move towards the target position for a long time but does not require AGV to strictly follow the global guidance mechanism. The advantage is that AGV is encouraged to explore all potential solutions to avoid falling into the locally optimal solution.
The Improved Dyna-Q algorithm proposed in this study is applied to solve the AGV path planning in a dynamic environment with the following main contributions:
Propose a global path guidance mechanism based on heuristic graph for large complex dynamic environments to effectively reduce the path search space and improve the efficiency of the algorithm in searching for the optimal path.
Propose a new dynamic reward function and action selection method, which can provide more intensive feedback signals and more efficient action selection for AGV path planning, so as to effectively solve the problem that the algorithm converges slowly due to the sparse Dyna-Q rewards.
The experimental results on random obstacle maps and maps with different numbers of dynamic obstacles show that the proposed algorithm has the advantage of fast convergence in unknown static environments; moreover, it can effectively solve the problem of AGV path planning in the environment of the different number of dynamic obstacles, without obtaining or predicting the trajectory of dynamic obstacles.
The rest of this paper is arranged as follows:
Section 2 introduces the relevant background, including the path planning problem in dynamic environments, and the basic concepts of reinforcement learning and the Dyna-Q algorithm. In
Section 3, we discuss the details of the proposed algorithm and describe the concept of heuristic graph in detail. In
Section 4, we discuss simulation experiments and verify the efficiency of our method through some comparative experiments. Finally,
Section 5 summarizes the work of this paper.
2. Related Background
Reinforcement learning is used to learn how to optimize goals by interacting with the environment through trial and error. The theoretical framework of RL is usually based on a Markov Decision Process (MDP). In this section, we first describe the AGV path planning problem in dynamic environments; then, we briefly summarize the theory of RL and MDP, and finally introduce the Dyna-Q algorithm.
2.1. Problem Description
This work considers an undirected grid graph environment with size , where V is a set of vertices of the graph G, . The length of each grid is L, and E is the edge of the graph G. The set of static obstacles in the environment is defined as , , represents each static obstacle vertex, and is the total number of static obstacles. The set of free cells excluding static obstacles is defined as . Following the traditional 4-way connectivity rule, for each vertex , its neighborhood is . The set of dynamic obstacles denotes the position of dynamic obstacles at time t, where .
Definition 1. (Path ) For an AGV with the start position and the goal position , where is the set of free cells, a path is defined as a sequence of vertices satisfying (i) ; (ii) ; (iii) or , where denotes the path node of a time step t.
Definition 2. (A conflict-free path ) For a path to be conflict-free, , where denotes the path node of each time step t, it must satisfy (i) , where is the set of static obstacles (no collision with static obstacles); (ii) , where is the set of position of dynamic obstacles at time step t (no vertex conflict with all dynamic obstacles); (iii) , where is the path node at time step t and is the position of the ith dynamic obstacle at time step t (no edge conflict with all dynamic obstacles).
The AGV path planning problem can be expressed as an optimization problem. The optimization objective includes an objective function that minimizes the path length of AGV from the starting position to the target position and avoids the constraint of collision with static or dynamic obstacles on the path.
Since obstacles are usually irregular in shape, if a grid contains any obstacle area, the entire grid is considered as an obstacle grid. Dynamic obstacles can move randomly at the vertices of graph
G. We consider a similar scenario in [
33], where a camera is installed at the top of the AGV workspace to capture dynamic obstacle information. This paper assumes that the AGV can obtain its global position in the environment in the process of AGV movement. This paper assumes that the AGV knows the information of all static obstacles. However, this paper does not assume that the AGV knows the trajectory of dynamic obstacles. AGV can only obtain the current position of dynamic obstacles within its local field of vision.
Specifically, to illustrate the AGV path planning problem in this study,
Figure 1 shows an example of AGV path planning in a dynamic environment.
2.2. MDP of AGV Path Planning
Standard reinforcement learning is a learning process through trial-and-error interaction with the environment in a Markov Decision Process framework. In general, an MDP is represented as a five-tuple , where S is a set of environmental states , A is a set of agent actions , P is a transition probability function between states, R is a reward function, and is a discount rate . RL agent takes an action in some state . The state of environment is transferred to , and a reward is given. The agent continuously learns by interacting with the environment and eventually learns an optimal policy to reach the goal.
Definition 3. (State set S) In this work, the map environment is modeled as a grid map; so, the state set S is defined aswhere W and H are the width and height of the grid map. Definition 4. (Action set A) The action set is the set of all possible actions of an AGV to move from the current state to the next state. The action set of the AGV is defined as in this study, which, respectively, denote waiting for a time in the current state, forward, backward, left, right.
Definition 5. (Reward function R) The typical reward function R is assigned as The typical reward function gives a positive reward when the AGV reaches the target position, a negative reward when it collides with a static or dynamic obstacle, and a reward value of 0 in other cases.
Given the MDP definition, the reinforcement learning objective is to learn an optimal policy that maximizes the expected sum of future rewards [
34]. Policy
is a mapping from states to probabilities of selecting each possible action. The discounted return
is defined by
where
denotes the immediate reward at time
t and
is a discount factor indicating the magnitude of the effect of future rewards on the present. Therefore, the reinforcement learning objective can be expressed as follows [
35]:
In Equation (
4), T is the time step at the end of each episode and
is the episode trajectory (the sequence of observations and actions obtained throughout the episode). The goal is to learn an optimal policy
, which maximizes the expected sum of future reward.
2.3. Dyna-Q Algorithm
Q-Learning is a classical model-free RL algorithm for solving MDP, whose main advantage is the utilization of a temporal-difference algorithm to achieve off-policy learning.
Definition 6. (Q-value ) is defined by the expected return of the state-action pair under the policy π. The equation is as follows: The core idea of the Q-Learning algorithm is that the Q-value of the next state–action pair is generated by the policy to be evaluated rather than the Q-value of the next state–action pair that follows the present policy. The final policy of Q-Learning is obtained by iterating the state–action value function .
Definition 7. (Q-value updating) A Q-value will be assigned when the AGV is located at state s and an action a is taken. The Q-function or the Q-value is updated usingwhere is the Q-value when the action a is executed at state a, is the highest Q-value among all actions at next state , α is the learning rate, and γ is the discount factor. The Dyna-Q algorithm is a model-based RL method, which is the integration of planning and direct RL method. Planning is a process of optimizing strategies by using randomly generated simulation experiences through the learned environment model, while the direct RL method is to improve the value function and policy function by using the real experience generated by interaction with the environment, as shown in
Figure 2. The Dyna agent interacts with the real environment to obtain empirical data and improves the algorithm policy by iterating through Equation (
6); at the same time, an environment model is learned using real experiences, and the simulation data generated by the model are then used to update the action-value function using Q-Learning update method.
In the actual path planning task, real experiences obtained by interacting with the real environment are usually expensive and time-consuming, while using the learned environment model to generate simulation experiences is fast and convenient, thus reducing the training time of the algorithm.
When the Dyna-Q algorithm is applied to path planning in large-scale dynamic environments, there are also the following problems:
In the path planning problem, the traditional reward function will update the reward only when reaching the destination or encountering obstacles. Especially in the face of large and complex environments, there are a large number of negative or invalid search spaces. Therefore, in large and complex environments, the traditional reward function has the problem of sparse rewards—that is, AGV is difficult to obtain positive rewards in the process of exploration, resulting in slow learning or even inability to learn.
In the early stages of Dyna-Q, each step of the policy is exploratory. In addition, when Dyna architecture uses environment model planning, it randomly selects the state and action to update, which has certain blindness. Therefore, the application of Dyna-Q algorithm to path planning in a large-scale dynamic environment has the problems of low learning efficiency and long training time.
3. Improved Dyna-Q
In order to solve the problem of sparse reward, slow convergence, and long training time of Dyna-Q algorithm in large-scale dynamic environments, this paper proposes an Improved Dyna-Q algorithm. In this section, we first describe the basic concepts of heuristic graph, and then introduce the details of the method in this paper.
3.1. Heuristic Graph
In this study, AGV only has a local field of view—that is, it can only perceive the information of the current grid position. In the early stage of exploration, each step of RL strategy is exploratory. Especially in large and complex maps, it is difficult to reach the target location. Therefore, path planning in large and complex maps needs to rely on other information to enable the AGV to reach the target position quickly. Previously, researchers proposed distance metrics [
25]. The quantitative value of distance measurement is evaluated based on the distance from the current position to the next state and target position. Distance measurement provides distance information about the current position to the target position. However, it does not provide information on environmental barriers. Therefore, when obstacles are complex, the disadvantage of using distance measurement is that the AGV may fall into a dead end [
25].
In this study, a novel heuristic graph is first proposed, which contains the standardized shortest path distance information from each location in the whole map to its target location. It allows AGV to obtain a heuristic distance to the target. The purpose of proposing a heuristic information map is to provide sufficient information for AGV, guide AGV to quickly explore its target position as global guidance information, and avoid falling into the map area that leads to a dead end.
Definition 8. (Heuristic Graph ) A graph structure of size , with the same size of the graph as the environment map. The heuristic value of each position in the heuristic graph is defined by Equation (7). The heuristic value for selecting action a in the current state s is given by Equation (8).
where
is the position of the neighbor node after the selection of action
a at position
s.
represents the heuristic distance information of selection action
a in the current state
s—that is, the standardized path distance information from the adjacent node to the target node. The shortest path distance of each location in the map is calculated by performing Breadth First Search (BFS) in the entire static map from the target location. Heuristic graph implies static obstacle information. As shown below, the pseudo code for computing the heuristic graph is given in Algorithm 1.
Algorithm 1 get heuristic graph |
Input: goal, Map[Height, Width] |
Output: |
1: Initialization:Visited , q ← Queue() |
2: Goal.len ← 0 |
3: q.push(Goal) |
4: Visited.add(Goal) |
5: while not q.empty() do |
6: current = q.pop() |
7: for next in get_neighbors(current) do |
8: if next ∉ Visited then |
9: next.len ← current.len + len(current, next) |
10: q.push(next) |
11: Visited.add(next) |
12: end if |
13: end for |
14: end while |
15: for each s in Map do |
16: if Map[s] is Obstacle then |
17: = 1 |
18: else |
19: = Visited[s].len/(Width+Height) |
20: end if |
21: end for |
The heuristic graph can be calculated in advance of the algorithm training policy; the heuristic graph is only calculated once and remains unchanged. During execution, the heuristic value can be obtained by retrieving the current position of AGV from the heuristic graph. Therefore, the impact of computing the heuristic graph on execution time can be ignored.
Definition 9. (Best heuristic action set ) The best heuristic action set is defined as the set of actions with the minimum heuristic value at the current state. The best heuristic action set indicates that the action in the set is executed at the current position, and the next position is closer to the target position—that is, those actions that make the next position closer to the target position.
Figure 3 is the visualization of the heuristic map of a map instance.
Figure 3a shows a map size of
, where ‘S’ represents the starting position of the AGV, ‘G’ represents the target position of the AGV, and gray squares represent static obstacles.
Figure 3b shows the visualization of the heuristic graph of the map example on the left. The color squares from dark to light represent the size of the heuristic value. The darker the color of the current point, the greater the value of the heuristic graph of the current position—that is, the farther from the target position.
3.2. Improved Dynamic Reward Function
The reward function is used to determine the value of action. AGV interacts with the environment according to the reward function, and continuously optimizes the action policy through the reward value. The reward function must be designed to correspond to the target to be optimized. Therefore, the design of the reward function is crucial to the efficiency of the optimized problem. In most previous researches on reinforcement learning path planning, the reward values are usually a static value, which are obtained only when it reaches the target position or encounters obstacles. This reward design can easily reach the target position in a small map environment, and then converge to a shorter path. However, there is a sparse reward problem in long-distance navigation. According to the characteristics of the path planning problem, some scholars proposed a reward function related to the Euclidean or Manhattan distance. Compared with the traditional reward function, this method has a guiding effect on the exploration of mobile robots, and the convergence speed has been improved to a certain extent. However, this reward function does not take into account the information of obstacles, which easily leads to falling into the local optimal solution.
In this study, a novel dynamic reward function based on heuristic graph is proposed to provide AGV with intensive reward values. The dynamic reward function based on heuristic graph encourages AGV to explore all potential solutions, and AGV will encourage it to move closer to the target location at any location. Heuristic graph contains static obstacle information, which can effectively guide AGV to avoid concave obstacle areas, thus speeding up the convergence of the algorithm.
Definition 10. (Dynamic reward function ) The dynamic function guides the AGV towards the goal in the direction of the reduced value in the heuristic graph, while avoiding collisions with dynamic and static obstacles. In detail, the dynamic reward function based on heuristic graph is defined as follows:where is the next position at which the AGV takes action a at time t, and denotes the immediate reward value of the action a taken by the AGV at time t. More specifically, at each time, the dynamic reward function provides the following: (1) a small negative reward when the AGV reaches a free grid that is not a minimum heuristic value; (2) a large negative reward when the AGV conflicts with a static or a dynamic obstacle, where ; (3) a positive reward when the action performed by the AGV is the best heuristic action, where ; (4) a large positive reward when the AGV reaches the goal position.
Combined with the heuristic graph, the dynamic reward function mechanism provides intensive feedback signals for AGV. By this way, AGV is not required to strictly follow a global path but encouraged to explore all potential solutions to avoid falling into the local optimal solution; in the process of guiding AGV to the target position, avoid collision with dynamic obstacles at the same time, and finally converge to a collision-free and conflict-free optimal policy.
3.3. Improved Heuristic Planning
In order to solve the problem of high complexity, heuristic search increases knowledge in specific fields to improve search efficiency. Heuristic search uses the current information related to the problem as heuristic information to guide the search. Heuristic strategies can guide the search to select branches that are more likely to produce successful results than other methods, reduce the search scope, and reduce the complexity of the problem [
28].
The model-based RL algorithm is divided into two parts: direct learning and planning. In the planning process, RL algorithm randomly generates simulation experiences from the model learned from practical experience, and then indirectly improves the value functions and strategies. In the path planning problem, there is a certain blindness in the process of randomly generating simulation experience. Inspired by the heuristic search algorithm, the previous work [
30] believed that using heuristic planning to generate simulation experience in the planning process, and updating which states are closer to the target position is conducive to improving the learning efficiency of the algorithm, thus speeding up the convergence of the algorithm.
This paper uses heuristic graph
as the heuristic function. During each planning, select the heuristic action with the minimum heuristic function value
in the current state. When there are multiple minimum heuristic actions in the current state, randomly select one of them.
Through the combination of heuristic graph and heuristic planning, update the most promising Q-values, further accelerate the learning efficiency of the algorithm, and reduce the training time of the Dyna-Q algorithm.
3.4. Improved Dyna-Q for Path Planning
In this paper, an Improved Dyna-Q algorithm for AGV path planning in dynamic environments is proposed. The basic idea is as follows: A global path guidance mechanism called heuristic graph is introduced, which combines heuristic graph and dynamic reward function to provide intensive rewards, guide AGV to move to the target position, and solve the sparse reward problem of long-distance path planning. Then, in the planning process of Dyna architecture, the heuristic graph information is used for heuristic search to improve the learning efficiency of the algorithm and reduce the training time of the algorithm.
Algorithm 2 gives a complete Improved Dyna-Q algorithm flow. Among them,
is the learning environment model, which is used to predict the reward of the next state and state–action pair
, and
n is a super parameter, which represents the steps of simulation planning. After AGV interacts with the environment and executes Q-Learning once, Dyna-Q will execute
n times of planning. When
n is 0, Dyna-Q becomes a classical Q-Learning algorithm.
Algorithm 2 Improved Dyna-Q algorithm |
1: Initialize Q(s,a)=0, Model(s,a)=0 |
2: get heuristic graph |
3: repeat |
4: initial state |
5: repeat |
6: -greedy(s,Q) |
7: execute a; observe and |
8: |
9: Model(s,a) |
10: for do |
11: |
12: if s, ∉ Model() then |
13: s ← random previously observed state; |
14: random action previously taken in state s; |
15: end if |
16: Model(s,a); |
17: |
18: |
19: end for |
20: |
21: until s is terminal |
22: until the total number of episodes N is reached |