Next Article in Journal
Synergistic Use of Sentinel-2 and UAV Multispectral Data to Improve and Optimize Viticulture Management
Previous Article in Journal
A Path Planning Model with a Genetic Algorithm for Stock Inventory Using a Swarm of Drones
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Dyna-Q: A Reinforcement Learning Method Focused via Heuristic Graph for AGV Path Planning in Dynamic Environments

1
Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang 110016, China
2
Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
3
Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
4
Kunshan Intelligent Equipment Research Institute, Kunshan 215300, China
5
University of Chinese Academy of Sciences, Beijing 100049, China
6
School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China
7
College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
*
Authors to whom correspondence should be addressed.
Drones 2022, 6(11), 365; https://doi.org/10.3390/drones6110365
Submission received: 14 October 2022 / Revised: 10 November 2022 / Accepted: 16 November 2022 / Published: 19 November 2022

Abstract

:
Dyna-Q is a reinforcement learning method widely used in AGV path planning. However, in large complex dynamic environments, due to the sparse reward function of Dyna-Q and the large searching space, this method has the problems of low search efficiency, slow convergence speed, and even inability to converge, which seriously reduces the performance and practicability of it. To solve these problems, this paper proposes an Improved Dyna-Q algorithm for AGV path planning in large complex dynamic environments. First, to solve the problem of the large search space, this paper proposes a global path guidance mechanism based on heuristic graph, which can effectively reduce the path search space and, thus, improve the efficiency of obtaining the optimal path. Second, to solve the problem of the sparse reward function in Dyna-Q, this paper proposes a novel dynamic reward function and an action selection method based on the heuristic graph, which can provide more intensive feedback and more efficient action decision for AGV path planning, effectively improving the convergence of the algorithm. We evaluated our approach in scenarios with static obstacles and dynamic obstacles. The experimental results show that the proposed algorithm can obtain better paths more efficiently than other reinforcement-learning-based methods including the classical Q-Learning and the Dyna-Q algorithms.

1. Introduction

The automated guided vehicle (AGV) is widely used for material transport tasks in manufacturing workshops due to its efficient transportation mode. As an essential part of the manufacturing workshop, AGV has played an indispensable role [1,2]. Path planning is the central key in the field of AGV research. It calculates the feasible path from the starting position to the goal position in the map, and ensures that AGV will not collide with static and dynamic obstacles in the moving process. How to achieve collision-free and conflict-free efficient paths for AGVs in dynamic environments is still one of the main focuses of AGV system research.
Many researchers have used many representative algorithms to solve the path planning problem of AGV, including graph-based search algorithms such as Dijkstra [3], A* [4,5], D* [6], etc. These algorithms have the advantage of being simple and easy to implement and have been widely used in AGV path planning. However, when the environment is large and the obstacles are complex, the state space grows exponentially and the time efficiency of the algorithm is low. Aiming at the above problems, many algorithms have been proposed, such as Rapidly exploring Random Trees and intelligent bionic algorithms. Rapidly exploring Random Trees (RRT) [7] is a sampling-based path planning algorithm to solve the path planning problem of AGV. Its advantages are high efficiency and speed in high-dimensional path search, while the disadvantages are that these algorithms usually conduct random search in the sampling environment to find the path and the results are often not optimal. Intelligent bionic algorithms are discovered through bionic research. Typical algorithms include the genetic algorithm (GA) [8,9], ant colony algorithm (ACA) [10], and particle swarm optimization (PSO) [11]. When dealing with the path planning problem under the complex dynamic environment information, the inspiration from nature often works well. However, in the complex map environment, there are still some problems, such as slow calculation speed and easy falling into local optimum.
Almasri et al. [12] proposed a route following and obstacle avoidance technology relying on low-cost infrared sensors, which can be easily used in real-time control applications of microcontrollers. Matveev et al. [13] proposed an integrated navigation strategy to safely guide the robot to reach the target through an environment full of mobile non-convex obstacles. Tanveer et al. [14] defined the path to be followed as a curve on the plane and fed it back to the path tracking controller. Obstacles generated the final safe path by modifying the Gaussian function modeling of the original function. This method has strong robustness and security. These three methods can quickly obtain a feasible path and avoid obstacles; however, the path obtained by these methods may be suboptimal.
In recent years, with the rapid development of deep learning (DL) and reinforcement learning (RL), the applications of RL have achieved significant breakthroughs in many fields, such as games [15,16], healthcare [17], robotics [18], etc. Reinforcement learning is an artificial intelligence method, which emphasizes that agents adapt to complex environments by constantly interacting with the environment to learn the best strategies. RL can actively learn and has the characteristics of adapting to complex dynamic environments. RL technology provides a new way to solve path planning problems. More and more scholars have introduced learning-based methods into the field of path planning [19,20] to solve the limitations of classical path planning methods.
RL algorithms can be classified into model-free reinforcement learning and model-based reinforcement learning according to whether there is an environment model or not. Model-free reinforcement learning algorithm directly improves the policy or estimates the value function according to the data sampled by agents and environment interaction, such as classical Q-Learning [21], Sarsa, and other temporal-difference algorithms. Among them, the Q-Learning algorithm is widely used in path planning problems [19,22,23] because of its advantages of learning and no need for environmental maps. Many researchers apply Euclidean distance to the calculation of fitness function and reward function of Q-Learning. For example, the Euclidean distance between the current position of the mobile robot and the target position is used to evaluate the fitness function of the multiple mobile robots environment [24]. Low et al. [25] added distance measurement to Q-Learning to guide agents to move toward the target position.
With the increasing complexity of the map environment, the Q-Learning algorithm shows the characteristics of low learning efficiency and slow convergence. Dyna-Q [26] is a classical model-based reinforcement learning algorithm that integrates planning and direct RL methods. The direct RL method is to improve the value function and policy by using the real experience generated by agent interaction with the environment. At the same time, the algorithm also learns an environment model and then uses this model to assist agents in policy promotion or value estimation; so, Dyna-Q has high efficiency. Viet et al. [27] evaluated the efficiency of five reinforcement learning algorithms in path planning, including Q-Learning, Sarsa, Q( λ ), Sarsa( λ ), and Dyna Q. The experimental results show that Dyna-Q has a faster convergence speed in solving the path planning problem of mobile robots.
In order to further improve the convergence performance of Dyna-Q, a common way to improve the Dyna-Q algorithm is to combine Dyna-Q with other algorithms. An Improved Dyna-H algorithm was proposed in [28]. In the planning process of the Dyna-Q algorithm, those actions that lead to the worst trajectory were taken to simulate the update policy, integrating the advantages of heuristic search into Dyna agents. Dabouni et al. [29] proposed a direct heuristic dynamic programming algorithm based on dynamic programming (Dyna HDP). By combining neural networks and Dyna, the optimal policy efficiency can be improved in the least number of experiments. Pei et al. [30] proposed a method incorporating the ϵ g r e e d y policy, which is a new action selection policy, combined with simulated annealing mechanism. This policy combines heuristic reward function and heuristic action to solve the exploration utilization dilemma and improve the convergence and learning efficiency of path planning.
Wang et al. [31] introduced an exploration factor that applies the technology of ant colony algorithm to the selection of candidates in indirect learning. A hybrid exploration strategy is proposed to improve the sample efficiency of the Dyna-Q algorithm. Zajdel [32] proposed a epoch-incremental learning method, which uses breadth-first search to determine the shortest distances between the non-terminal states observed and the target state in each episode, and performs additional policy updates based on these distances.
Although Dyna-Q has shown excellent performance in path planning, there are still some difficulties and challenges in path planning in large dynamic environments. First of all, when the map environment is large, the sparse reward problem makes it difficult for AGV to explore the target location, which leads to many iterations and low learning efficiency. Improving the search efficiency and convergence speed of the Dyna-Q algorithm is still a challenge. The second is how to plan a collision-free and efficient path in an environment with multiple dynamic obstacles.
In order to solve the above difficulties, this research proposes an Improved Dyna-Q algorithm for path planning in dynamic environments. Different from the Dyna-Q algorithm, the Improved Dyna-Q algorithm adopts a novel global guidance mechanism based on heuristic graph, and combines an improved dynamic reward function and an improved heuristic planning strategy. The global guidance mechanism guides AGV to move towards the target position for a long time but does not require AGV to strictly follow the global guidance mechanism. The advantage is that AGV is encouraged to explore all potential solutions to avoid falling into the locally optimal solution.
The Improved Dyna-Q algorithm proposed in this study is applied to solve the AGV path planning in a dynamic environment with the following main contributions:
  • Propose a global path guidance mechanism based on heuristic graph for large complex dynamic environments to effectively reduce the path search space and improve the efficiency of the algorithm in searching for the optimal path.
  • Propose a new dynamic reward function and action selection method, which can provide more intensive feedback signals and more efficient action selection for AGV path planning, so as to effectively solve the problem that the algorithm converges slowly due to the sparse Dyna-Q rewards.
  • The experimental results on random obstacle maps and maps with different numbers of dynamic obstacles show that the proposed algorithm has the advantage of fast convergence in unknown static environments; moreover, it can effectively solve the problem of AGV path planning in the environment of the different number of dynamic obstacles, without obtaining or predicting the trajectory of dynamic obstacles.
The rest of this paper is arranged as follows: Section 2 introduces the relevant background, including the path planning problem in dynamic environments, and the basic concepts of reinforcement learning and the Dyna-Q algorithm. In Section 3, we discuss the details of the proposed algorithm and describe the concept of heuristic graph in detail. In Section 4, we discuss simulation experiments and verify the efficiency of our method through some comparative experiments. Finally, Section 5 summarizes the work of this paper.

2. Related Background

Reinforcement learning is used to learn how to optimize goals by interacting with the environment through trial and error. The theoretical framework of RL is usually based on a Markov Decision Process (MDP). In this section, we first describe the AGV path planning problem in dynamic environments; then, we briefly summarize the theory of RL and MDP, and finally introduce the Dyna-Q algorithm.

2.1. Problem Description

This work considers an undirected grid graph environment G ( V , E ) with size W × H , where V is a set of vertices of the graph G, V { ( i , j ) 1 i W , 1 j H } . The length of each grid is L, and E is the edge of the graph G. The set of N s static obstacles in the environment is defined as V s = { s 1 , , s N s } , s i V , s i represents each static obstacle vertex, and N s is the total number of static obstacles. The set of free cells excluding static obstacles is defined as V f = V \ V s . Following the traditional 4-way connectivity rule, for each vertex v = ( i , j ) V , its neighborhood is N ( v ) = { ( i 1 , j ) , ( i + 1 , j ) , ( i , j 1 ) , ( i , j + 1 ) } V f . The set of dynamic obstacles V d ( t ) = { d 1 ( t ) , , d N d ( t ) } denotes the position of N d dynamic obstacles at time t, where d i ( t ) = ( i , j ) V f .
Definition 1. 
(Path P )   For an AGV A with the start position α V f and the goal position β V f , β α , where V f is the set of free cells, a path is defined as a sequence of T + 1 vertices P = ( p 0 , , p T ) satisfying (i) p 0 = α ; (ii) p T = β ; (iii) 1 t T , p t 1 = p t or p t 1 N ( p t ) , where p t denotes the path node of a time step t.
Definition 2. 
(A conflict-free path P )   For a path P to be conflict-free, 1 t T , p t P , where p t denotes the path node of each time step t, it must satisfy (i) p t V s , where V s is the set of static obstacles (no collision with static obstacles); (ii) p t V d ( t ) , where V d ( t ) is the set of position of dynamic obstacles at time step t (no vertex conflict with all dynamic obstacles); (iii) 0 i N d , ( p t 1 , p t ) ( d i ( t ) , d i ( t 1 ) ) , where p t is the path node at time step t and d i ( t ) is the position of the ith dynamic obstacle at time step t (no edge conflict with all dynamic obstacles).
The AGV path planning problem can be expressed as an optimization problem. The optimization objective includes an objective function that minimizes the path length of AGV from the starting position to the target position and avoids the constraint of collision with static or dynamic obstacles on the path.
Since obstacles are usually irregular in shape, if a grid contains any obstacle area, the entire grid is considered as an obstacle grid. Dynamic obstacles can move randomly at the vertices of graph G. We consider a similar scenario in [33], where a camera is installed at the top of the AGV workspace to capture dynamic obstacle information. This paper assumes that the AGV can obtain its global position in the environment in the process of AGV movement. This paper assumes that the AGV knows the information of all static obstacles. However, this paper does not assume that the AGV knows the trajectory of dynamic obstacles. AGV can only obtain the current position of dynamic obstacles within its local field of vision.
Specifically, to illustrate the AGV path planning problem in this study, Figure 1 shows an example of AGV path planning in a dynamic environment.

2.2. MDP of AGV Path Planning

Standard reinforcement learning is a learning process through trial-and-error interaction with the environment in a Markov Decision Process framework. In general, an MDP is represented as a five-tuple M = { S , A , P , R , γ } , where S is a set of environmental states s S , A is a set of agent actions a A , P is a transition probability function between states, R is a reward function, and γ is a discount rate 0 γ 1 . RL agent takes an action a t A in some state s t S . The state of environment is transferred to s t + 1 S , and a reward r t + 1 is given. The agent continuously learns by interacting with the environment and eventually learns an optimal policy to reach the goal.
Definition 3. 
(State set S)   In this work, the map environment is modeled as a grid map; so, the state set S is defined as
S = { s = ( x , y ) 1 x W , 1 y H }
where W and H are the width and height of the grid map.
Definition 4. 
(Action set A)   The action set is the set of all possible actions of an AGV to move from the current state to the next state. The action set of the AGV is defined as A = { a 0 , a 1 , a 2 , a 3 , a 4 } in this study, which, respectively, denote waiting for a time in the current state, forward, backward, left, right.
Definition 5. 
(Reward function R)   The typical reward function R is assigned as
R ( s , a ) = r 1 , s G o a l r 2 , s V o V d 0 , o t h e r w i s e
The typical reward function gives a positive reward r 1 > 0 when the AGV reaches the target position, a negative reward r 2 < 0 when it collides with a static or dynamic obstacle, and a reward value of 0 in other cases.
Given the MDP definition, the reinforcement learning objective is to learn an optimal policy that maximizes the expected sum of future rewards [34]. Policy π is a mapping from states to probabilities of selecting each possible action. The discounted return G t is defined by
G t = r t + 1 + γ r t + 2 + γ 2 r t + 3 + = k = 0 γ k r t + k + 1 , γ [ 0 , 1 ]
where r t denotes the immediate reward at time t and γ is a discount factor indicating the magnitude of the effect of future rewards on the present. Therefore, the reinforcement learning objective can be expressed as follows [35]:
J ( π ) = max π E [ G t ] = max π E τ π [ t = 1 T r ( s t , a t ) ]
In Equation (4), T is the time step at the end of each episode and τ is the episode trajectory (the sequence of observations and actions obtained throughout the episode). The goal is to learn an optimal policy π * , which maximizes the expected sum of future reward.

2.3. Dyna-Q Algorithm

Q-Learning is a classical model-free RL algorithm for solving MDP, whose main advantage is the utilization of a temporal-difference algorithm to achieve off-policy learning.
Definition 6. 
(Q-value Q ( s , a ) )    Q π ( s , a ) is defined by the expected return of the state-action pair ( s , a ) under the policy π. The equation is as follows:
Q π ( s , a ) = E π [ G t s t = s , a t = a ] = E π [ k = 0 γ k r t + k + 1 s t = s , a t = a ]
The core idea of the Q-Learning algorithm is that the Q-value of the next state–action pair is generated by the policy to be evaluated rather than the Q-value of the next state–action pair that follows the present policy. The final policy of Q-Learning is obtained by iterating the state–action value function Q ( s , a ) .
Definition 7. 
(Q-value updating)   A Q-value Q ( s , a ) will be assigned when the AGV is located at state s and an action a is taken. The Q-function or the Q-value is updated using
Q ( s , a ) Q ( s , a ) + α [ r + γ max a Q ( s , a ) Q ( s , a ) ]
where Q ( s , a ) is the Q-value when the action a is executed at state a, max a Q ( s , a ) is the highest Q-value among all actions at next state s , α is the learning rate, and γ is the discount factor.
The Dyna-Q algorithm is a model-based RL method, which is the integration of planning and direct RL method. Planning is a process of optimizing strategies by using randomly generated simulation experiences through the learned environment model, while the direct RL method is to improve the value function and policy function by using the real experience generated by interaction with the environment, as shown in Figure 2. The Dyna agent interacts with the real environment to obtain empirical data and improves the algorithm policy by iterating through Equation (6); at the same time, an environment model is learned using real experiences, and the simulation data generated by the model are then used to update the action-value function using Q-Learning update method.
In the actual path planning task, real experiences obtained by interacting with the real environment are usually expensive and time-consuming, while using the learned environment model to generate simulation experiences is fast and convenient, thus reducing the training time of the algorithm.
When the Dyna-Q algorithm is applied to path planning in large-scale dynamic environments, there are also the following problems:
  • In the path planning problem, the traditional reward function will update the reward only when reaching the destination or encountering obstacles. Especially in the face of large and complex environments, there are a large number of negative or invalid search spaces. Therefore, in large and complex environments, the traditional reward function has the problem of sparse rewards—that is, AGV is difficult to obtain positive rewards in the process of exploration, resulting in slow learning or even inability to learn.
  • In the early stages of Dyna-Q, each step of the policy is exploratory. In addition, when Dyna architecture uses environment model planning, it randomly selects the state and action to update, which has certain blindness. Therefore, the application of Dyna-Q algorithm to path planning in a large-scale dynamic environment has the problems of low learning efficiency and long training time.

3. Improved Dyna-Q

In order to solve the problem of sparse reward, slow convergence, and long training time of Dyna-Q algorithm in large-scale dynamic environments, this paper proposes an Improved Dyna-Q algorithm. In this section, we first describe the basic concepts of heuristic graph, and then introduce the details of the method in this paper.

3.1. Heuristic Graph

In this study, AGV only has a local field of view—that is, it can only perceive the information of the current grid position. In the early stage of exploration, each step of RL strategy is exploratory. Especially in large and complex maps, it is difficult to reach the target location. Therefore, path planning in large and complex maps needs to rely on other information to enable the AGV to reach the target position quickly. Previously, researchers proposed distance metrics [25]. The quantitative value of distance measurement is evaluated based on the distance from the current position to the next state and target position. Distance measurement provides distance information about the current position to the target position. However, it does not provide information on environmental barriers. Therefore, when obstacles are complex, the disadvantage of using distance measurement is that the AGV may fall into a dead end [25].
In this study, a novel heuristic graph is first proposed, which contains the standardized shortest path distance information from each location in the whole map to its target location. It allows AGV to obtain a heuristic distance to the target. The purpose of proposing a heuristic information map is to provide sufficient information for AGV, guide AGV to quickly explore its target position as global guidance information, and avoid falling into the map area that leads to a dead end.
Definition 8. 
(Heuristic Graph H )   A graph structure H ( S , A ) of size W × H , with the same size of the graph as the environment map. The heuristic value of each position s = ( x , y ) , 1 x W , 1 y H in the heuristic graph is defined by Equation (7). The heuristic value for selecting action a in the current state s is given by Equation (8).
V ( s ) = 0 , s = G o a l 1 , s V s s h o r t e s t p a t h l e n g t h ( s ) ( W + H ) , s V f
H ( s , a ) = V ( s )
where s is the position of the neighbor node after the selection of action a at position s.
H ( s , a ) represents the heuristic distance information of selection action a in the current state s—that is, the standardized path distance information from the adjacent node to the target node. The shortest path distance of each location in the map is calculated by performing Breadth First Search (BFS) in the entire static map from the target location. Heuristic graph implies static obstacle information. As shown below, the pseudo code for computing the heuristic graph is given in Algorithm 1.
Algorithm 1 get heuristic graph H
Input: goal, Map[Height, Width]
Output:   H
  1: Initialization:Visited , q ← Queue()
  2: Goal.len ← 0
  3: q.push(Goal)
  4: Visited.add(Goal)
  5: while not q.empty() do
  6:    current = q.pop()
  7:    for next in get_neighbors(current) do
  8:        if next ∉ Visited then
  9:         next.len ← current.len + len(current, next)
  10:         q.push(next)
  11:         Visited.add(next)
  12:      end if
  13:    end for
  14: end while
  15: for each s in Map do
  16:    if Map[s] is Obstacle then
  17:       V [ s ] = 1
  18:    else
  19:       V [ s ] = Visited[s].len/(Width+Height)
  20:    end if
  21: end for
The heuristic graph can be calculated in advance of the algorithm training policy; the heuristic graph is only calculated once and remains unchanged. During execution, the heuristic value can be obtained by retrieving the current position of AGV from the heuristic graph. Therefore, the impact of computing the heuristic graph on execution time can be ignored.
Definition 9. 
(Best heuristic action set A H )   The best heuristic action set is defined as the set of actions with the minimum heuristic value at the current state.
A H ( s ) = arg min a A ( H ( s , a ) )
The best heuristic action set indicates that the action in the set is executed at the current position, and the next position is closer to the target position—that is, those actions that make the next position closer to the target position.
Figure 3 is the visualization of the heuristic map of a map instance. Figure 3a shows a map size of 10 × 10 , where ‘S’ represents the starting position of the AGV, ‘G’ represents the target position of the AGV, and gray squares represent static obstacles. Figure 3b shows the visualization of the heuristic graph of the map example on the left. The color squares from dark to light represent the size of the heuristic value. The darker the color of the current point, the greater the value of the heuristic graph of the current position—that is, the farther from the target position.

3.2. Improved Dynamic Reward Function

The reward function is used to determine the value of action. AGV interacts with the environment according to the reward function, and continuously optimizes the action policy through the reward value. The reward function must be designed to correspond to the target to be optimized. Therefore, the design of the reward function is crucial to the efficiency of the optimized problem. In most previous researches on reinforcement learning path planning, the reward values are usually a static value, which are obtained only when it reaches the target position or encounters obstacles. This reward design can easily reach the target position in a small map environment, and then converge to a shorter path. However, there is a sparse reward problem in long-distance navigation. According to the characteristics of the path planning problem, some scholars proposed a reward function related to the Euclidean or Manhattan distance. Compared with the traditional reward function, this method has a guiding effect on the exploration of mobile robots, and the convergence speed has been improved to a certain extent. However, this reward function does not take into account the information of obstacles, which easily leads to falling into the local optimal solution.
In this study, a novel dynamic reward function based on heuristic graph is proposed to provide AGV with intensive reward values. The dynamic reward function based on heuristic graph encourages AGV to explore all potential solutions, and AGV will encourage it to move closer to the target location at any location. Heuristic graph contains static obstacle information, which can effectively guide AGV to avoid concave obstacle areas, thus speeding up the convergence of the algorithm.
Definition 10. 
(Dynamic reward function r H ( s , a ) )    The dynamic function r H ( s , a ) guides the AGV towards the goal in the direction of the reduced value in the heuristic graph, while avoiding collisions with dynamic and static obstacles. In detail, the dynamic reward function based on heuristic graph is defined as follows:
r H ( s , a ) = r 1 , s V f a A H ( s ) r 1 + r 2 , s V o V d r 1 + r 3 , a A H ( s ) r 4 , s G o a l
where s is the next position at which the AGV takes action a at time t, and r H ( s , a ) denotes the immediate reward value of the action a taken by the AGV at time t.
More specifically, at each time, the dynamic reward function provides the following: (1) a small negative reward r 1 < 0 when the AGV reaches a free grid that is not a minimum heuristic value; (2) a large negative reward r 1 + r 2 when the AGV conflicts with a static or a dynamic obstacle, where r 2 < r 1 < 0 ; (3) a positive reward r 3 when the action performed by the AGV is the best heuristic action, where r 3 = r 1 ; (4) a large positive reward r 4 when the AGV reaches the goal position.
Combined with the heuristic graph, the dynamic reward function mechanism provides intensive feedback signals for AGV. By this way, AGV is not required to strictly follow a global path but encouraged to explore all potential solutions to avoid falling into the local optimal solution; in the process of guiding AGV to the target position, avoid collision with dynamic obstacles at the same time, and finally converge to a collision-free and conflict-free optimal policy.

3.3. Improved Heuristic Planning

In order to solve the problem of high complexity, heuristic search increases knowledge in specific fields to improve search efficiency. Heuristic search uses the current information related to the problem as heuristic information to guide the search. Heuristic strategies can guide the search to select branches that are more likely to produce successful results than other methods, reduce the search scope, and reduce the complexity of the problem [28].
The model-based RL algorithm is divided into two parts: direct learning and planning. In the planning process, RL algorithm randomly generates simulation experiences from the model learned from practical experience, and then indirectly improves the value functions and strategies. In the path planning problem, there is a certain blindness in the process of randomly generating simulation experience. Inspired by the heuristic search algorithm, the previous work [30] believed that using heuristic planning to generate simulation experience in the planning process, and updating which states are closer to the target position is conducive to improving the learning efficiency of the algorithm, thus speeding up the convergence of the algorithm.
This paper uses heuristic graph H ( S , A ) as the heuristic function. During each planning, select the heuristic action with the minimum heuristic function value H ( S , A ) in the current state. When there are multiple minimum heuristic actions in the current state, randomly select one of them.
a H ( s ) = a r g min a ( H ( s , a ) )
Through the combination of heuristic graph and heuristic planning, update the most promising Q-values, further accelerate the learning efficiency of the algorithm, and reduce the training time of the Dyna-Q algorithm.

3.4. Improved Dyna-Q for Path Planning

In this paper, an Improved Dyna-Q algorithm for AGV path planning in dynamic environments is proposed. The basic idea is as follows: A global path guidance mechanism called heuristic graph is introduced, which combines heuristic graph and dynamic reward function to provide intensive rewards, guide AGV to move to the target position, and solve the sparse reward problem of long-distance path planning. Then, in the planning process of Dyna architecture, the heuristic graph information is used for heuristic search to improve the learning efficiency of the algorithm and reduce the training time of the algorithm.
Algorithm 2 gives a complete Improved Dyna-Q algorithm flow. Among them, M o d e l ( s , a ) is the learning environment model, which is used to predict the reward of the next state and state–action pair ( s , a ) , and n is a super parameter, which represents the steps of simulation planning. After AGV interacts with the environment and executes Q-Learning once, Dyna-Q will execute n times of planning. When n is 0, Dyna-Q becomes a classical Q-Learning algorithm.
Algorithm 2 Improved Dyna-Q algorithm
  1: Initialize Q(s,a)=0, Model(s,a)=0
  2: get heuristic graph H
  3: repeat
  4:     s initial state
  5:    repeat
  6:       a ϵ -greedy(s,Q)
  7:      execute a; observe s and r H ( s , a )
  8:       Q ( s , a ) Q ( s , a ) + α [ r H ( s , a ) + γ max a Q ( s , a ) Q ( s , a ) ]
  9:      Model(s,a) s , r H ( s , a )
  10:      for  i = 1 : n  do
  11:          a ˜ a H ( s )
  12:         if s, a ˜ ∉ Model( s , a ˜ ) then
  13:           s ← random previously observed state;
  14:            a ˜ random action previously taken in state s;
  15:         end if
  16:          s ˜ , r Model(s,a);
  17:          Q ( s , a ˜ ) Q ( s , a ˜ ) + α [ r + γ max a Q ( s ˜ , a ) Q ( s , a ˜ ) ]
  18:          s s ˜
  19:      end for
  20:       s s
  21:    until s is terminal
  22: until the total number of episodes N is reached

4. Simulations and Results

In this section, the Improved Dyna-Q algorithm is compared with Dyna-Q and Q-Learning algorithm in a static environment by designing simulation experiments. Then, the proposed algorithm is evaluated and verified on the maps with different numbers of dynamic obstacles to show its advantages.

4.1. Environment and Parameter Configuration

All algorithms were implemented using Python 3.8. All simulation experiments were performed on Ubuntu 20.04 with NVIDIA RTX 1050 GPU and 16 GB RAM. The proposed algorithm is verified in a random obstacle map. The environment map is a 2-D grid world of size 50 × 50 cells, the static obstacles are randomly generated according to an approximately normal distribution, and the random static obstacle ratio is 10 % .
The algorithms used in this comparative experiment are Q-Learning, the original Dyna-Q, the Dyna-Q with only the improved dynamic reward function proposed in this paper (Dyna-Q with dynamic reward), and Dyna-Q with the two improvements proposed in this paper (Improved Dyna-Q). The parameters of the four algorithms are compared using the same values. The specific parameter configurations of the four algorithms are shown in Table 1.

4.2. Simulation Results

4.2.1. Static Environment

First, in the static map environment, this paper compares the path length of the Dyna-Q algorithm using the traditional reward function and the dynamic reward function improved in this paper with the number of iterations. The traditional reward function used in this paper is Equation (12), and the dynamic reward function is Equation (13).
r ( s , a ) = 10 , s G o a l 0.5 , s V o V d 0 , o t h e r w i s e
r H ( s , a ) = 0.1 , s V f a A H ( s ) 0.1 + ( 0.5 ) , s V o V d 0.1 + ( + 0.1 ) , a A H ( s ) 10 , s G o a l
Train 1000 times in a static environment, where the maximum number of AGV exploration steps in each episode is set to 2000. As shown in Figure 4, it can be seen that in the first 200 times of episodes using the traditional reward function, the path length is 2000 steps, which indicates that there is a sparse reward problem in complex environments and AGV cannot effectively update the policy. Only after full exploration can AGV reach the target position, which makes the running time of the algorithm longer. The dynamic reward function mechanism proposed in this study can effectively guide AGV to explore the target location. Intuitively, the convergence speed of the dynamic reward function proposed in this paper is much faster than that of the traditional reward function.
Secondly, the simulation experiments of four algorithms—Q-Learning, Dyna-Q, Dyna-Q with dynamic reward, and Improved Dyna-Q—are carried out in the static environment map. Figure 5 shows the path planning trajectory obtained after 1000 training sets using the four algorithms. It can be seen from the path trajectory diagram that Q-Learning did not find the shortest path, while the paths of the other three Dyna-Q algorithms are obviously better than that of Q-Learning algorithm. Intuitively, the Improved Dyna-Q algorithm proposed in this paper is superior to other algorithms in distance and direction. Figure 6 shows the learning curve of the four algorithms in the static environment map. The Q-Learning algorithm shows a very slow convergence curve; in fact, it never found the optimal policy. The convergence speed of the Dyna-Q algorithm is greatly improved compared with that of Q-Learning, mainly because Dyna-Q learned an environment model and improved the auxiliary policy. The Improved Dyna-Q algorithm proposed in this paper has the fastest convergence speed. Compared with the Dyna-Q algorithm, the improved dynamic reward function method and the Improved Dyna-Q algorithm greatly improved the convergence speed.
Although the convergence curve trend of the Improved Dyna-Q algorithm using the dynamic reward function proposed in this paper is the same as that of the Improved Dyna-Q algorithm using the two improvement points proposed in this paper, the running time of the algorithm proposed in this paper is the shortest, as shown in Table 2. Table 2 shows the performance comparison of four algorithms trained 1000 times in the static map. Each algorithm was repeated 30 times. In the path length index, the minimum path length of the Dyna-Q, Dyna-Q with dynamic reward, and Improved Dyna-Q algorithms is 82, indicating that the three algorithms have found the optimal strategy. Because of its slow convergence speed, the Q-Learning algorithm failed to find the optimal strategy in 1000 iterations. The average calculation time of using the improved dynamic reward function proposed in this paper is reduced by 81.83 % , because the dynamic reward function not only solves the sparse reward problem but also guides the AGV to search the target location, significantly reducing the useless search space. Furthermore, compared with Dyna-Q with dynamic reward, the average running time of the Improved Dyna-Q algorithm using both improved heuristic planning and improved dynamic reward function has increased by 80.01 % , which shows that the improved heuristic planning can help speed up the running time of the algorithm. On the whole, the average path length of the algorithm proposed in this study is the shortest, and the average computing time is 95.67 % less than that of Dyna-Q algorithm.

4.2.2. Dynamic Environment

In this paper, 10, 20, and 30 dynamic obstacles are randomly added to the random static obstacle map. The three maps with different numbers of dynamic obstacles are shown in Figure 7. Each dynamic obstacle moves back and forth from its starting point to the target point, and the shortest path of the dynamic obstacle is generated by A* algorithm.
Table 3 shows the performance comparison of four algorithms trained 1000 times in three kinds of dynamic obstacle environments, in which each algorithm is repeated 30 times. Compared with the static obstacle environment, the average path length of all algorithms increased, mainly because the algorithm needs to avoid dynamic obstacles. Among the three maps with different dynamic obstacle numbers, the Improved Dyna-Q algorithm has little difference with Dyna-Q in the average path length; however, in the training time index, the Improved Dyna-Q algorithm is obviously superior to the other three algorithms.

5. Conclusions

In recent years, reinforcement learning has been favored in path planning research. Dyna-Q is a reinforcement learning method widely used in AGV path planning. However, this method has the problems of low search efficiency and slow convergence speed. An Improved Dyna-Q algorithm is proposed to solve the path planning problem of AGV in dynamic environments. This research first proposes a global path guidance mechanism based on heuristic graph, which can effectively reduce the path search space and improve the convergence speed of the algorithm. Secondly, based on the heuristic graph, this paper proposes a new dynamic reward function and a heuristic action selection method to provide more intensive feedback signals and more efficient action selection for AGV path planning.
To demonstrate the advancement of the proposed algorithms, different simulation experiments are carried out in this paper. The numerical analysis results show that the proposed algorithm significantly speeds up the convergence speed and computation time of the algorithm compared with the classical Dyna-Q algorithm, and the Improved Dyna-Q algorithm can effectively solve the AGV path planning task in different dynamic obstacle count environments.

Author Contributions

Conceptualization, Y.L. and S.Y.; methodology, S.Y.; software, S.Y.; validation, S.Y. and Y.L.; formal analysis, Y.L.; investigation, S.Y.; resources, Y.L.; data curation, Y.Z. and F.L.; writing—original draft preparation, S.Y.; writing—review and editing, S.Y., C.S., and Y.L.; visualization, S.Y. and Y.Z.; supervision, C.S., Y.Z., and F.L.; project administration, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the National Key R&D Program of China (Grant 2021YFB3301400), LiaoNing Revitalization Talents Program (Grant XLYC1907057), and by the Nature Science Foundation of Liaoning Province (Grant 2021-MS-030) and State Key Laboratory of Robotics [2022-Z03].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bechtsis, D.; Tsolakis, N.; Vlachos, D.; Iakovou, E. Sustainable supply chain management in the digitalisation era: The impact of Automated Guided Vehicles. J. Clean. Prod. 2017, 142, 3970–3984. [Google Scholar] [CrossRef] [Green Version]
  2. Patricio, R.; Mendes, A. Consumption patterns and the advent of automated guided vehicles, and the trends for automated guided vehicles. Curr. Robot. Rep. 2020, 1, 145–149. [Google Scholar] [CrossRef]
  3. Sun, Y.; Fang, M.; Su, Y. AGV path planning based on improved Dijkstra algorithm. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2021; Volume 1746, p. 012052. [Google Scholar]
  4. Yang, X.; Wushan, C. AGV path planning based on smoothing A* algorithm. Int. J. Softw. Eng. Appl. 2015, 6, 1–8. [Google Scholar] [CrossRef]
  5. Wang, C.; Wang, L.; Qin, J.; Wu, Z.; Duan, L.; Li, Z.; Cao, M.; Ou, X.; Su, X.; Li, W.; et al. Path planning of automated guided vehicles based on improved A-Star algorithm. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 2071–2076. [Google Scholar]
  6. Setiawan, Y.D.; Pratama, P.S.; Jeong, S.K.; Duy, V.H.; Kim, S.B. Experimental comparison of A* and D* lite path planning algorithms for differential drive automated guided vehicle. In AETA 2013: Recent Advances in Electrical Engineering and Related Sciences; Springer: Berlin, Germany, 2014; pp. 555–564. [Google Scholar]
  7. Jin, X.; Yan, Z.; Yang, H.; Wang, Q.; Yin, G. A Goal-biased RRT Path Planning Approach for Autonomous Ground Vehicle. In Proceedings of the 2020 4th CAA International Conference on Vehicular Control and Intelligence (CVCI), Hangzhou, China, 18–20 December 2020; pp. 743–746. [Google Scholar]
  8. Hu, Y.; Yang, S.X. A knowledge based genetic algorithm for path planning of a mobile robot. In Proceedings of the IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, New Orleans, LA, USA, 26 April–1 May 2004; Volume 5, pp. 4350–4355. [Google Scholar]
  9. Lamini, C.; Benhlima, S.; Elbekri, A. Genetic algorithm based approach for autonomous mobile robot path planning. Procedia Comput. Sci. 2018, 127, 180–189. [Google Scholar] [CrossRef]
  10. Wang, X.; Shi, H.; Zhang, C. Path planning for intelligent parking system based on improved ant colony optimization. IEEE Access 2020, 8, 65267–65273. [Google Scholar] [CrossRef]
  11. Praserttaweelap, R.; Kaitwanidvilai, S.; Aoyama, H. Safety path planning with obstacle avoidance using particle swarm optimization for AGV in manufacturing layout. Int. J. Innov. Comput. Inf. Control 2019, 15, 351–368. [Google Scholar]
  12. Almasri, M.M.; Alajlan, A.M.; Elleithy, K.M. Trajectory planning and collision avoidance algorithm for mobile robotics system. IEEE Sens. J. 2016, 16, 5021–5028. [Google Scholar] [CrossRef]
  13. Matveev, A.S.; Wang, C.; Savkin, A.V. Real-time navigation of mobile robots in problems of border patrolling and avoiding collisions with moving and deforming obstacles. Robot. Auton. Syst. 2012, 60, 769–788. [Google Scholar] [CrossRef]
  14. Tanveer, M.H.; Recchiuto, C.T.; Sgorbissa, A. Analysis of path following and obstacle avoidance for multiple wheeled robots in a shared workspace. Robotica 2019, 37, 80–108. [Google Scholar] [CrossRef]
  15. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  16. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
  17. Raghu, A.; Komorowski, M.; Ahmed, I.; Celi, L.; Szolovits, P.; Ghassemi, M. Deep reinforcement learning for sepsis treatment. arXiv 2017, arXiv:1711.09602. [Google Scholar]
  18. Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
  19. Konar, A.; Chakraborty, I.G.; Singh, S.J.; Jain, L.C.; Nagar, A.K. A deterministic improved Q-learning for path planning of a mobile robot. IEEE Trans. Syst. Man, Cybern. Syst. 2013, 43, 1141–1153. [Google Scholar] [CrossRef] [Green Version]
  20. Lv, L.; Zhang, S.; Ding, D.; Wang, Y. Path planning via an improved DQN-based learning policy. IEEE Access 2019, 7, 67319–67330. [Google Scholar] [CrossRef]
  21. Watkins, C.J.C.H. Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
  22. Li, C.; Zhang, J.; Li, Y. Application of artificial neural network based on q-learning for mobile robot path planning. In Proceedings of the 2006 IEEE International Conference on Information Acquisition, Bratislava, Slovakia, 24–27 June 2006; pp. 978–982. [Google Scholar]
  23. Low, E.S.; Ong, P.; Cheah, K.C. Solving the optimal path planning of a mobile robot using improved Q-learning. Robot. Auton. Syst. 2019, 115, 143–161. [Google Scholar] [CrossRef]
  24. Das, P.; Behera, H.; Panigrahi, B. Intelligent-based multi-robot path planning inspired by improved classical Q-learning and improved particle swarm optimization with perturbed velocity. Eng. Sci. Technol. Int. J. 2016, 19, 651–669. [Google Scholar] [CrossRef] [Green Version]
  25. Low, E.S.; Ong, P.; Low, C.Y.; Omar, R. Modified Q-learning with distance metric and virtual target on path planning of mobile robot. Expert Syst. Appl. 2022, 199, 117191. [Google Scholar] [CrossRef]
  26. Sutton, R.S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bull. 1991, 2, 160–163. [Google Scholar] [CrossRef] [Green Version]
  27. Viet, H.H.; Kyaw, P.H.; Chung, T. Simulation-based evaluations of reinforcement learning algorithms for autonomous mobile robot path planning. In IT Convergence and Services; Springer: Berlin, Germany, 2011; pp. 467–476. [Google Scholar]
  28. Santos, M.; López, V.; Botella, G. Dyna-H: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems. Knowl.-Based Syst. 2012, 32, 28–36. [Google Scholar] [CrossRef]
  29. Al Dabooni, S.; Wunsch, D. Heuristic dynamic programming for mobile robot path planning based on Dyna approach. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 3723–3730. [Google Scholar]
  30. Pei, M.; An, H.; Liu, B.; Wang, C. An improved dyna-q algorithm for mobile robot path planning in unknown dynamic environment. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 4415–4425. [Google Scholar] [CrossRef]
  31. Hwang, K.S.; Jiang, W.C.; Chen, Y.J. Pheromone-Based Planning Strategies in Dyna-Q Learning. IEEE Trans. Ind. Inform. 2017, 13, 424–435. [Google Scholar] [CrossRef]
  32. Zajdel, R. Epoch-incremental Dyna-learning and prioritized sweeping algorithms. Neurocomputing 2018, 319, 13–20. [Google Scholar] [CrossRef]
  33. Yang, Q.; Lian, Y.; Xie, W. Hierarchical planning for multiple AGVs in warehouse based on global vision. Simul. Model. Pract. Theory 2020, 104, 102124. [Google Scholar] [CrossRef]
  34. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  35. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Figure 1. An example of AGV path planning with 10 dynamic obstacles. The grid map size is 30 × 30 , “S” is the start position of the AGV, and “G” is the goal position of the AGV. The red circle is the AGV, the orange circles are the dynamic obstacles, and the gray rectangles indicate the static obstacles.
Figure 1. An example of AGV path planning with 10 dynamic obstacles. The grid map size is 30 × 30 , “S” is the start position of the AGV, and “G” is the goal position of the AGV. The red circle is the AGV, the orange circles are the dynamic obstacles, and the gray rectangles indicate the static obstacles.
Drones 06 00365 g001
Figure 2. Dyna architecture.
Figure 2. Dyna architecture.
Drones 06 00365 g002
Figure 3. Visualization of a map example’s heuristic graph.
Figure 3. Visualization of a map example’s heuristic graph.
Drones 06 00365 g003
Figure 4. Comparison of the curves of path length variation during iterations of the Dyna-Q algorithm using the traditional reward function and the improved dynamic reward function.
Figure 4. Comparison of the curves of path length variation during iterations of the Dyna-Q algorithm using the traditional reward function and the improved dynamic reward function.
Drones 06 00365 g004
Figure 5. Planned trajectories obtained using the four algorithms after 1000 training episodes.
Figure 5. Planned trajectories obtained using the four algorithms after 1000 training episodes.
Drones 06 00365 g005
Figure 6. Comparison of learning curves of four algorithms in the static obstacle map.
Figure 6. Comparison of learning curves of four algorithms in the static obstacle map.
Drones 06 00365 g006
Figure 7. Maps with different numbers of dynamic obstacles.
Figure 7. Maps with different numbers of dynamic obstacles.
Drones 06 00365 g007
Table 1. Simulation parameters of configurations.
Table 1. Simulation parameters of configurations.
ParameterSymbolValue
Discount factor γ 0.95
Learning rate α 0.2
Greedy factor ϵ 0.1
Planning stepsn10
Episode numberN1000
Table 2. Performance comparison of four algorithms in static obstacle environments.
Table 2. Performance comparison of four algorithms in static obstacle environments.
Q-LearningDyna-QDyna-Q with
Dynamic Reward
Improved Dyna-Q
path
length (unit)
min124828282
max211899188
mean171.038686.5982.37
train
time (s)
min34.86275.2859.2912.3
max36.53321.3170.6813.96
mean35.64298.864.8812.94
Table 3. Performance comparison of four algorithms in environments with different numbers of dynamic obstacles.
Table 3. Performance comparison of four algorithms in environments with different numbers of dynamic obstacles.
Q-LearningDyna-QDyna-Q with
Dynamic Reward
Improved Dyna-Q
10
dynamic
obstacles
path
length
(unit)
min125838282
max167858686
mean146.383.784.183.9
train
time(s)
min42.86302.3865.5413.29
max44.08325.7871.8814.64
mean43.4309.2168.0914.18
20
dynamic
obstacles
path
length
(unit)
min127838282
max172909089
mean155.684.485.484.9
train
time(s)
min43.12321.4166.1214.04
max44.76368.2182.7816
mean43.95341.4677.4415.06
30
dynamic
obstacles
path
length
(unit)
min130828282
max173848489
mean146.882.682.985.4
train
time(s)
min47.68316.5165.5414.61
max49.9340.0774.6817.09
mean48.59328.0170.5715.39
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liu, Y.; Yan, S.; Zhao, Y.; Song, C.; Li, F. Improved Dyna-Q: A Reinforcement Learning Method Focused via Heuristic Graph for AGV Path Planning in Dynamic Environments. Drones 2022, 6, 365. https://doi.org/10.3390/drones6110365

AMA Style

Liu Y, Yan S, Zhao Y, Song C, Li F. Improved Dyna-Q: A Reinforcement Learning Method Focused via Heuristic Graph for AGV Path Planning in Dynamic Environments. Drones. 2022; 6(11):365. https://doi.org/10.3390/drones6110365

Chicago/Turabian Style

Liu, Yiyang, Shuaihua Yan, Yang Zhao, Chunhe Song, and Fei Li. 2022. "Improved Dyna-Q: A Reinforcement Learning Method Focused via Heuristic Graph for AGV Path Planning in Dynamic Environments" Drones 6, no. 11: 365. https://doi.org/10.3390/drones6110365

APA Style

Liu, Y., Yan, S., Zhao, Y., Song, C., & Li, F. (2022). Improved Dyna-Q: A Reinforcement Learning Method Focused via Heuristic Graph for AGV Path Planning in Dynamic Environments. Drones, 6(11), 365. https://doi.org/10.3390/drones6110365

Article Metrics

Back to TopTop