# A Real-Time Path Planning Algorithm Based on the Markov Decision Process in a Dynamic Environment for Wheeled Mobile Robots

^{*}

*Actuators*in 2022)

## Abstract

**:**

## 1. Introduction

## 2. Algorithm Principle

#### 2.1. Markov Decision Process

_{t}of the environment and uses it to choose an action A

_{t}. Then, in the next step, the agent receives a reward R

_{t+}

_{1}.

_{0}, A

_{0}, R

_{1}, S

_{1}, A

_{1}, R

_{2}, etc. Assuming this sequence is finite, the random variables R

_{t}and S

_{t}have defined probability distributions. The distribution depends only on previous states and actions. We can say that, given the previous state and action, for specific values s′ ∈ S of the random variables, the probability of occurrence at step t is:

_{t}. The agent’s goal is to maximize the reward received. It is not the immediate reward that needs to be maximized, but the long-term accumulated reward. For this purpose, we define a total return function G

_{t}:

_{t}= a at S

_{t}= s. The value function of state s under policy π, denoted as v

_{π}(s), refers to the expected return from state s and following policy π thereafter. In MDP fashion, we can define v

_{π}(s) as the following:

_{π}is the expected value of the random variable when the agent follows policy π. We call function v

_{π}the state-value function of policy π. Similarly, denoting the value of taking action a in state s under policy π as q

_{π}(s, a) means starting from state s and taking action a, then following the expected return of policy π:

_{π}the action-value function of policy π. G

_{t}from (3) is substituted into Equation (4), so that for any policy π and any state s, we have

_{π}. It expresses the relationship between a state value and its successor state value. The Bellman equation averages all probabilities, weighing each chance by its occurrence. The value of one state must be equal to the discounted expected value of the next state plus the expected reward along the way.

#### 2.2. MDP-Based Path Planning

_{π}(s). The higher the value of the value function, the closer the location is to the end point. The agent has u actions A = {a

_{1}, a

_{2}, …, a

_{u}}. However, the actions of the agent are not completely reliable. Sometimes the agent will perform incorrect actions and have unexpected results. For example, when the agent was ordered to “go right” it actually ran to the “upper right” position. Although we cannot guarantee what actions the agent will perform in the end, we can at least know the actions and probabilities that all agents may perform after the order is given, that is, given π and all possible a, the corresponding probability is π(a|s).

#### 2.3. Cost to Obstacles and Walls

_{z}< 0 is the basic cost, d(s) is the distance from the position of state s to the nearest wall, α ≥ 1 is used to adjust the influence magnitude near the wall and d

_{max}is the critical value of d(s). If d

_{max}− d(s) ≤ 1, the attenuation of the value function is only the basic cost k

_{z}. On the other hand, if the position is too close to the wall and d

_{max}− d(s) > 1, then this cost increases with the power of α.

_{o}, y

_{o}), and the set O = {o

_{1}, o

_{2}, …, o

_{w}} represents this group, and the number is w. Then, we define c(o, s) as the influence of the obstacle o on the state s, and use this to modify Equation (11) to obtain:

_{d}< 0 and r

_{max}is the critical value of r(o, s).

#### 2.4. Dynamic Programming

_{0}is initialized, and each subsequent value is then calculated by using the modified Bellman equation in Equation (13) with the updated rule for all s ∈ S:

_{k}= v* is a fixed point under this rule because the Bellman equation of v* can guarantee the equality sign. As long as v* is guaranteed to exist under the same conditions, the sequence {v

_{k}} will converge to v* along with k→∞. This algorithm is called value iteration. To obtain each successor v

_{k+}

_{1}from v

_{k}, iterative policy evaluation takes the same action for each state s and replaces state s with a new value consisting of the old value of state s and its successor state and the expected immediate reward (called “expected update”). Each iteration updates the value of each state to produce a new approximation function v

_{k+}

_{1}.

_{π}has been determined, for some state s, we want to know whether we need to change the policy to choose an action a which is not π(s). One way to solve this problem is to consider choosing action a from state s and then follow the existing policy π. The value obtained in this way is:

_{π}(s). If greater, it is better to follow the policy π after the state s chooses the action a once than to follow π all the time. In this case, we can expect that it is better to choose action a every time that state s is reached. If the statement is satisfied as follows:

_{π}(s, a) in each state to a specific action under this policy. In other words, considering a new greedy policy π′, then:

_{π}and takes action a that looks best in the short term. If the structure satisfies Equation (17), it can be known that this policy is the same as or better than the original policy. The process of formulating a new policy to improve the original policy in a greedy manner from the value function of the original policy is called policy improvement.

_{π}and produces a better policy π′, we can compute v

_{π}

_{′}and improve the policy again to obtain a better policy π″, thus obtaining a series of policies and value functions that continue to improve. All states can be obtained through Formula (19) to obtain the optimal behavior at that location, but it is not necessary to solve all states (because the value function will change with the environment), so the optimal behavior obtained at the moment is not fit for the future. In fact, just by substituting the state s

_{c}(x

_{sc}= x

_{c}, y

_{sc}= y

_{c}) of the corresponding agent position c

_{k}(x

_{c}, y

_{c}) into Equation (19), the policy π

_{k}(s

_{c}) can be obtained, and then the agent will go to s

_{c}′. s

_{c}′ can be substituted into formula (19) again to obtain the policy corresponding to the position of s

_{c}′ to form a loop iteration to solve step by step. If π

_{k}(s

_{c}

_{′}), s

_{c}″, π

_{k}(s

_{c}

_{″}), s

_{c}‴, s

_{c}‴, π

_{k}(s

_{c}

_{‴}), s

_{c}

^{(4)}, … are continuously obtained in a single time point in the same steps. It will reach a state s

_{c}

^{(g)}at the end point. At this time, the coordinates corresponding to [s

_{c}, s

_{c}

_{′}, …, s

_{c}

^{(g)}] are the paths that the agent can go to the end point at this time and guide the agent to the end point. A series of policies is [π

_{k}(s

_{c}), π

_{k}(s

_{c}

_{′}), …, π

_{k}(s

_{c}

^{(g − 1)})]; if s

_{c}′ is substituted into (5) at the next time point, π

_{k}(s

_{c}

_{′}), then the whole process will be transformed into the guidance mechanism of the agent, so that the agent will obtain a policy corresponding to the position at each time point, and this policy will change with the map and transform into [π

_{k}(s

_{c}), π

_{k}

_{ + 1}(s

_{c}

_{′}), …, π

_{k + h}

_{ − 1}(s

_{c}

^{(g − 1)})], but it still eventually leads the agent to the end. Since the latest utility value is used every time a new policy is established, and the value function covers the influence from the agent’s behavioral ability, and from the static and dynamic obstacles, this method can achieve good performance in complex environments.

#### 2.5. Time Complexity Analysis

^{2}), which is far more than the average time complexity O(nlogn) of the A* algorithm because it is necessary to continuously iterate the value until the value function at the starting point is greater than 0. However, after the second path, only one “value iteration” and one “path finding” are required for each execution, and the time complexity is reduced to O(n). Due to this feature, it is quite suitable for execution in a dynamic environment.

## 3. Simulation Results

#### 3.1. Parameter Settings

_{z}= −3 × 10

^{−5}, α = 3, and d

_{max}= 5.

#### 3.2. Real-Time Obstacle Avoidance and Path Planning

#### 3.3. Comparison and Discussion of Different γ Values

## 4. Experimental Results

_{z}= −3 × 10

^{–3}, k

_{d}= −0.1, α = 1, d

_{max}= 3 and r2

_{max}= 3.

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Li, D.; Yin, W.; Wong, W.E.; Jian, M.; Chau, M. Quality-Oriented Hybrid Path Planning Based on A* and Q-Learning for Unmanned Aerial Vehicle. IEEE Access
**2022**, 10, 7664–7674. [Google Scholar] [CrossRef] - Tang, G.; Tang, C.; Claramunt, C.; Hu, X.; Zhou, P. Geometric A-Star Algorithm: An Improved A-Star Algorithm for AGV Path Planning in a Port Environment. IEEE Access
**2021**, 9, 59196–59210. [Google Scholar] [CrossRef] - Chi, W.; Ding, Z.; Wang, J.; Chen, G.; Sun, L. A Generalized Voronoi Diagram-Based Efficient Heuristic Path Planning Method for RRTs in Mobile Robots. IEEE Trans. Ind. Electron.
**2022**, 69, 4926–4937. [Google Scholar] [CrossRef] - Karaman, S.; Frazzoli, E. Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res.
**2011**, 30, 846–894. [Google Scholar] - Chen, L.; Shan, Y.; Tian, W.; Li, B.; Cao, D. A Fast and Efficient Double-Tree RRT*-Like Sampling-Based Planner Applying on Mobile Robotic Systems. IEEE/ASME Trans. Mechatron.
**2018**, 23, 2568–2578. [Google Scholar] [CrossRef] - Moon, C.; Chung, W. Kinodynamic Planner Dual-Tree RRT (DT-RRT) for Two-Wheeled Mobile Robots Using the Rapidly Exploring Random Tree. IEEE Trans. Ind. Electron.
**2015**, 62, 1080–1090. [Google Scholar] [CrossRef] - Mashayekhi, R.; Idris, M.Y.I.; Anisi, M.H.; Ahmedy, I. Hybrid RRT: A Semi-Dual-Tree RRT-Based Motion Planner. IEEE Access
**2020**, 8, 18658–18668. [Google Scholar] [CrossRef] - Li, Y.; Jin, R.; Xu, X.; Qian, Y.; Wang, H.; Xu, S.; Wang, Z. A Mobile Robot Path Planning Algorithm Based on Improved A* Algorithm and Dynamic Window Approach. IEEE Access
**2022**, 10, 57736–57747. [Google Scholar] [CrossRef] - Jhong, B.; Chen, M. An Enhanced Navigation Algorithm with an Adaptive Controller for Wheeled Mobile Robot Based on Bidirectional RRT. Actuators
**2022**, 11, 303. [Google Scholar] [CrossRef] - Qi, J.; Yang, H.; Sun, H. MOD-RRT*: A Sampling-Based Algorithm for Robot Path Planning in Dynamic Environment. IEEE Trans. Ind. Electron.
**2021**, 68, 7244–7251. [Google Scholar] [CrossRef] - Yao, Q.; Zheng, Z.; Qi, L.; Yuan, H.; Guo, X.; Zhao, M.; Liu, Z.; Yang, T. Path Planning Method with Improved Artificial Potential Field—A Reinforcement Learning Perspective. IEEE Access
**2020**, 8, 135513–135523. [Google Scholar] [CrossRef] - Willms, R.; Yang, S.X. An efficient dynamic system for real-time robot-path planning. IEEE Trans. Syst. Man Cybern. Part B
**2006**, 36, 755–766. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Puterman, M.L. Markov decision processes. Handb. Oper. Res. Manag. Sci.
**1990**, 2, 331–434. [Google Scholar] - Monteiro, N.S.; Gonçalves, V.M.; Maia, C.A. Motion Planning of Mobile Robots in Indoor Topological Environments using Partially Observable Markov Decision Process. IEEE Lat. Am. Trans.
**2021**, 19, 1315–1324. [Google Scholar] [CrossRef]

**Figure 2.**Behavior number: 1 to 8 correspond to the adjacent 8 directions; 9 represents the position of the agent.

**Figure 15.**The path planning result in the environment with an unknown moving obstacle by the A* algorithm.

Completeness | Optimality | Time Complexity | Suitable for Dynamic Environments | |
---|---|---|---|---|

Proposed algorithm | yes | yes | First path: O(n^{2})After second path: O(n) | yes |

A* algorithm | yes | yes | Average: O(nlogn) Best: O(n) | no |

Value Iteration | Path Finding (ms) | Total Time after 2nd Path (ms) | |
---|---|---|---|

Proposed algorithm | After 2nd path: 6.86 (every iteration) | 0.12 | 6.98 (6.86 + 0.12) |

A* algorithm | -- | 17.50 | 17.50 |

Value Iteration | Path Finding (ms) | Total Time after 2nd Path (ms) | |
---|---|---|---|

Proposed algorithm | After 2nd path: 7.20 (every iteration) | 0.09 | 7.29 (7.20 + 0.09) |

A* algorithm | -- | 17.54 | 17.54 |

Value Iteration | Path Finding (ms) | Total Time after 2nd Path (ms) | |
---|---|---|---|

Proposed algorithm | After 2nd path: 7.30 (every iteration) | 0.11 | 7.41 (7.30 + 0.11) |

A* algorithm | -- | 18.10 | 18.10 |

Value Iteration | Path Finding (ms) | Total Time after 2nd Path (ms) | |
---|---|---|---|

Proposed algorithm | After 2nd path: 7.43 (every iteration) | 0.19 | 7.62 (7.73 + 0.19) |

A* algorithm | -- | 18.13 | 18.13 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chen, Y.-J.; Jhong, B.-G.; Chen, M.-Y.
A Real-Time Path Planning Algorithm Based on the Markov Decision Process in a Dynamic Environment for Wheeled Mobile Robots. *Actuators* **2023**, *12*, 166.
https://doi.org/10.3390/act12040166

**AMA Style**

Chen Y-J, Jhong B-G, Chen M-Y.
A Real-Time Path Planning Algorithm Based on the Markov Decision Process in a Dynamic Environment for Wheeled Mobile Robots. *Actuators*. 2023; 12(4):166.
https://doi.org/10.3390/act12040166

**Chicago/Turabian Style**

Chen, Yu-Ju, Bing-Gang Jhong, and Mei-Yung Chen.
2023. "A Real-Time Path Planning Algorithm Based on the Markov Decision Process in a Dynamic Environment for Wheeled Mobile Robots" *Actuators* 12, no. 4: 166.
https://doi.org/10.3390/act12040166