Indoor Emergency Path Planning Based on the Q-Learning Optimization Algorithm

The internal structure of buildings is becoming increasingly complex. Providing a scientific and reasonable evacuation route for trapped persons in a complex indoor environment is important for reducing casualties and property losses. In emergency and disaster relief environments, indoor path planning has great uncertainty and higher safety requirements. Q-learning is a value-based reinforcement learning algorithm that can complete path planning tasks through autonomous learning without establishing mathematical models and environmental maps. Therefore, we propose an indoor emergency path planning method based on the Q-learning optimization algorithm. First, a grid environment model is established. The discount rate of the exploration factor is used to optimize the Q-learning algorithm, and the exploration factor in the ε-greedy strategy is dynamically adjusted before selecting random actions to accelerate the convergence of the Q-learning algorithm in a largescale grid environment. An indoor emergency path planning experiment based on the Q-learning optimization algorithm was carried out using simulated data and real indoor environment data. The proposed Q-learning optimization algorithm basically converges after 500 iterative learning rounds, which is nearly 2000 rounds higher than the convergence rate of the Q-learning algorithm. The SASRA algorithm has no obvious convergence trend in 5000 iterations of learning. The results show that the proposed Q-learning optimization algorithm is superior to the SARSA algorithm and the classic Q-learning algorithm in terms of solving time and convergence speed when planning the shortest path in a grid environment. The convergence speed of the proposed Qlearning optimization algorithm is approximately five times faster than that of the classic Qlearning algorithm. The proposed Q-learning optimization algorithm in the grid environment can successfully plan the shortest path to avoid obstacle areas in a short time.


Introduction
In recent years, with the advancement of urbanization, the internal structure of urban buildings has become more complex and variable. The hidden danger of urban disasters is partially aggravated by the highly concentrated urban population and resources [1]. Urban disasters are complex and occur suddenly, and the intricate spatial structure of urban buildings significantly affects emergency rescue [2,3]. From the summary and analysis of many emergency cases, the probability of casualties caused by improper evacuation is shown to be increasing [3]. In actual disaster scenarios, most casualties are caused by a lack of timely and effective rescue. It is of great significance to scientifically and reasonably analyze the internal structure of the indoor environment, quickly determine the dynamic changes of the evacuees' positions, and realize the rapid and safe evacuation and rescue of personnel in emergencies [4]. Considering the increasing frequency of disaster events and increasing demand for disaster prevention and mitigation, new technologies and theoretical knowledge such as deep learning and reinforcement learning should be applied scientifically and rationally to design reasonable emergency evacuation path planning according to the internal environment of the disaster area. Reasonable path planning can arrange the orderly transfer of the affected people and effectively shorten the evacuation time, which is one of the current leading issues of social public security urgently being solved and a research hotspot of relevant scholars at home and abroad [5].
Since the mid-twentieth century, many scholars have carried out extensive research on path planning in emergency rescue situations [6]. Around the 1960s, with the rapid development of computer science, various path planning algorithms endlessly emerged. The path planning algorithm has developed from the original traditional algorithm and graphics algorithm to a bionic search algorithm and artificial intelligence algorithm. These path planning algorithms have different characteristics in the development stages, and the scope of their application and scenarios are different. In practical applications, the problem to be solved and the characteristics of the algorithm are considered comprehensively, and an appropriate path planning algorithm is selected [7]. In recent years, with the development of intelligence science and the boom in artificial intelligence, artificial intelligence path planning technology has rapidly become the focus of research by experts and scholars, while reinforcement learning algorithms have received more attention [8]. Reinforcement learning can be used to solve obstacle avoidance, path planning, and other problems collaboratively without establishing mathematical models and environmental maps for path planning problems. Lu et al. [9] proposed a neural network based on a reinforcement learning algorithm, conducted local path planning experiments, and obtained path planning results in an environment without prior knowledge. He et al. [10] proposed a combination of Q-learning and fuzzy logic technology to achieve self-learning of mobile robots and path planning in uncertain environments. Hyansu et al. [11] proposed a combination of deep Q-learning and CNN so that the robot can move flexibly and efficiently in various environments. Maw et al. [12] proposed a hybrid path planning algorithm that uses the path planning algorithm of the time graph for global planning and deep reinforcement learning for local planning so that unmanned aerial vehicles (UAVs) can avoid collisions in real time. Junior et al. [13] proposed a Q-learning algorithm based on a reward matrix to meet the route planning requirements of marine robots. Due to its characteristics, reinforcement learning has been widely used for path planning, especially for local path planning in unknown environments. However, reinforcement learning has the inherent problem of balancing exploration and utilization. In reinforcement learning, the environment is unknown to the agent. Excessive exploration of the environment by the agent will reduce the efficiency of the solution, and excessive use of the environment will cause the agent to miss the optimal solution. Therefore, the balance of exploration and utilization is an important research topic in reinforcement learning. Jaradat et al. [14] applied the Q-learning algorithm for the navigation of mobile robots in a dynamic environment and controlled the size of the Q value table to increase the speed of the navigation algorithm. Wang et al. [15] combined the two algorithms based on the better final performance of the Q-learning algorithm and the faster convergence of the SARSA (State-Action-Reward-State-Action) algorithm and proposed a reverse Q-learning algorithm, which improved the learning rate and algorithm performance. Zeng et al. [16] proposed a supervised reinforcement learning algorithm based on nominal control and introduced supervision into the Q-learning algorithm, thereby accelerating the algorithm convergence. Fang et al. [17] proposed a heuristic reinforcement learning algorithm based on state backtracking, which improved the action selection strategy of reinforcement learning, removed meaningless exploration steps, and greatly improved the learning rate. Song et al. [18] established a mapping relationship between the existing or learned environmental information and the initial value of the Q value table and accelerated learning by adjusting the initial value in the Q value table [19,20]. Zhang et al. [21] used an enhanced exploration strategy to replace ε-greedy in the traditional Q-learning algorithm and proposed a self-adaptive reinforcement exploration Q-learning (SARE-Q) algorithm to improve the exploration efficiency. Zhuang et al. [22] proposed a multi-destination global path planning algorithm based on the optimal obstacle value. According to the Q-learning algorithm, the parameters of the reward function were optimized to improve the path planning efficiency of a mobile robot driving in multiple destinations. Soong et al. [23] introduced the concept of partially guided Q-learning and initialized the Q-table through the flower pollination algorithm (FPA) to accelerate the convergence of Q-learning. εgreedy strategy is a common method to solve the problem of balance between exploration and utilization. On the basis of Qlearning combined with ε-greedy, Li C et al. [24] proposed a parameter dynamic adjustment strategy and trial-and-error action deletion mechanism, which not only realized the balance between adaptive adjustment and utilization in the learning process, but also improved the exploration efficiency of the agent. Yang T et al. [25] proposed an ε-greedy strategy that adaptively adjusts the exploration factor, which improves the quality of the strategy learned by the agent and better balances exploration and utilization.
The abovementioned scholars have made helpful advances to improve the efficiency of reinforcement learning algorithms. However, in large and complex emergency environments, it is difficult for reinforcement learning algorithms to achieve the desired results. Since there is no prior learning knowledge, the agent can only randomly select actions for blind search, which leads to the disadvantages of low learning efficiency and slow convergence speed in the complex environment. Therefore, this paper proposes a path planning algorithm based on a grid environment and optimizes the Q-learning algorithm by introducing the calculation of the exploratory factor discount rate. The discount rate of the exploration factor is calculated before the agent selects random actions to solve the problem of blind searching in the learning process. The main contributions are summarized as follows: 1.
Aimed at the path planning problem of indoor complex environments in disaster scenarios, a grid environment model is established, and the Q-learning algorithm is adopted to implement the path planning problem of the grid environment.

2.
Aimed at the problems of slow convergence speed and low accuracy of the Q-learning algorithm in a large-scale grid environment, the exploration factor in the ε-greedy strategy is dynamically adjusted, and the discount rate variable of the exploration factor is introduced. Before random actions are selected, the discount rate of the exploration factor is calculated to optimize the Q-learning algorithm in the grid environment.

3.
An indoor emergency path planning experiment based on the Q-learning optimization algorithm is carried out using simulated data and real indoor environment data of an office building. The results show that the Q-learning optimization algorithm is better than both the SARSA algorithm and the Q-learning algorithm in terms of solving time and convergence when planning the shortest path in a grid environment. The Q-learning optimization algorithm has a convergence speed that is approximately five times faster than that of the classic Q-learning algorithm. In the grid environment, the Q-learning optimization algorithm can successfully plan the shortest path to avoid obstacles in a short time.
The rest of the paper is organized as below: Section 1 introduces indoor emergency path planning based on the proposed Q-learning optimization algorithm in grid environment. Section 2 introduces the algorithm simulation experiment and the experiment of indoor emergency path planning based on the proposed Q-learning optimization algorithm. Section 3 concludes this paper and shows the interesting future work related to our studies.

Indoor Emergency Path Planning Method
Based on the advantage that Q-learning in reinforcement learning can solve obstacle avoidance, path planning, and other problems in a unified way without establishing a mathematical model and environment map, this paper proposes an indoor emergency path planning method based on Q-learning. First, the grid graph method is used to model the grid environment. Next, the path planning strategy of the Q-learning algorithm is designed based on the grid environment, and then the Q-learning algorithm based on the grid environment is optimized by the dynamic adjustment of exploration factors.

Grid Environment Modeling
The environmental modeling problem refers to how to effectively express environmental information through specific models. Environmental modeling is necessary to perform before path planning. Before global path planning, it is necessary to model the environment where the emergency personnel are located and obtain complex environmental information. This allows the emergency personnel to know the location of fixed obstacles in the environment in advance, which is an essential step in path planning.
Common methods used in environmental modeling include the viewable method [26], cell tree method [27], link graph method [28], grid graph method [29], etc. The advantages and disadvantages of these four methods are shown in Table 1: According to the characteristics of the emergency environment and the comparison of the advantages and disadvantages of modeling methods, this paper adopts the grid graph method for environmental modeling. The principle is to rasterize the environmental information and use various color features to represent the different environmental information. As shown in Figure 1, the black grid represents the impassable area, denoted by "1"; the white grid represents the free areas that can be accessed, denoted by "0".

Indoor Emergency Path Planning Method
Based on the advantage that Q-learning in reinforcement learning can solve obstacle avoidance, path planning, and other problems in a unified way without establishing a mathematical model and environment map, this paper proposes an indoor emergency path planning method based on Q-learning. First, the grid graph method is used to model the grid environment. Next, the path planning strategy of the Q-learning algorithm is designed based on the grid environment, and then the Q-learning algorithm based on the grid environment is optimized by the dynamic adjustment of exploration factors.

Grid Environment Modeling
The environmental modeling problem refers to how to effectively express environmental information through specific models. Environmental modeling is necessary to perform before path planning. Before global path planning, it is necessary to model the environment where the emergency personnel are located and obtain complex environmental information. This allows the emergency personnel to know the location of fixed obstacles in the environment in advance, which is an essential step in path planning.
Common methods used in environmental modeling include the viewable method [26], cell tree method [27], link graph method [28], grid graph method [29], etc. The advantages and disadvantages of these four methods are shown in Table 1: The timeliness of the algorithm decreases as the number of grids increases According to the characteristics of the emergency environment and the comparison of the advantages and disadvantages of modeling methods, this paper adopts the grid graph method for environmental modeling. The principle is to rasterize the environmental information and use various color features to represent the different environmental information. As shown in Figure 1, the black grid represents the impassable area, denoted by "1"; the white grid represents the free areas that can be accessed, denoted by "0". After learning the initial position and the target position, the agent explores and learns in a black-and-white grid environment to obtain the shortest obstacle avoidance After learning the initial position and the target position, the agent explores and learns in a black-and-white grid environment to obtain the shortest obstacle avoidance path plan. The theoretical knowledge of the grid graph method is concise and easy to understand, which is convenient for program code writing and operation. The grid graph information can be represented by a matrix. The matrix corresponding to the grid graph in Figure 1 is Equation (1):

Q-Learning Optimization Algorithm
Reinforcement learning is a method of machine learning, its essence is to find an optimal decision through continuous interaction with the environment [30]. The idea of reinforcement learning is as follows: The agent affects the environment by performing actions. The environment receives a new action, it will generate a new state and give reward feedback to the agent's actions. Finally, the agent chooses the next action to perform according to the new state and reward feedback. The reinforcement learning model is shown in Figure 2. path plan. The theoretical knowledge of the grid graph method is concise and easy to understand, which is convenient for program code writing and operation. The grid graph information can be represented by a matrix. The matrix corresponding to the grid graph in Figure 1 is Equation (1)

Q-Learning Optimization Algorithm
Reinforcement learning is a method of machine learning, its essence is to find an optimal decision through continuous interaction with the environment [30]. The idea of reinforcement learning is as follows: The agent affects the environment by performing actions. The environment receives a new action, it will generate a new state and give reward feedback to the agent's actions. Finally, the agent chooses the next action to perform according to the new state and reward feedback. The reinforcement learning model is shown in Figure 2. Q-learning, proposed by Watkins in 1989, is a landmark discovery in the development of reinforcement learning. Q-learning obtains the optimal strategy by continuously estimating the state value function and optimizing the Q function [31,32]. Q-learning is different from the common time-difference method (TD) to some extent. It adopts the ( , ) Q s a function of the state-action pair to carry out iterative calculations. In the learning process of the agent, it is necessary to check whether the corresponding behavior is reasonable in order to ensure the final result converges [33,34].

Q-Learning Algorithm
Q-learning is a reinforcement learning algorithm based on the finite Markov decision process, which is mainly composed of agent, state, action, and environment. The state at a certain moment is represented by s , and the action to be performed by the agent is represented by a . Q-learning initializes the function ( , ) Q s a and state value s , and selects an action a according to a strategy, such as ε-greedy [35], to obtain the next state s′ and the immediate return r . Next, the Q value is updated according to the update Q-learning, proposed by Watkins in 1989, is a landmark discovery in the development of reinforcement learning. Q-learning obtains the optimal strategy by continuously estimating the state value function and optimizing the Q function [31,32]. Q-learning is different from the common time-difference method (TD) to some extent. It adopts the Q(s, a) function of the state-action pair to carry out iterative calculations. In the learning process of the agent, it is necessary to check whether the corresponding behavior is reasonable in order to ensure the final result converges [33,34].

Q-Learning Algorithm
Q-learning is a reinforcement learning algorithm based on the finite Markov decision process, which is mainly composed of agent, state, action, and environment. The state at a certain moment is represented by s, and the action to be performed by the agent is represented by a. Q-learning initializes the function Q(s, a) and state value s, and selects an action a according to a strategy, such as ε-greedy [35], to obtain the next state s and the immediate return r. Next, the Q value is updated according to the update rules [34]. When the agent reaches the destination during the movement, the algorithm completes an iteration. The agent returns to the initial node to continue the iteration cycle until the iterative learning process is complete [36,37]. In the Q-learning process, the optimal value function is determined and approximated by the optimal iterative calculation of the Q(s, a) function. The update rules of the function are shown in Equations (2) and (3): where γ represents the discount factor, α represents the learning rate, and a represents the next action. The Q-learning process includes many episodes, and all episodes will repeat the following calculation process. When the agent is at time t:

1.
Observe the status s t at this time; 2.
Select the action a t to perform next; 3.
Continue to observe the next step s t+1 ; 4.
Get immediate reward r t ; 5. Update 6. t ← t + 1 goes to the next moment.
The Q function is represented and implemented by a lookup table and neural network. When using a lookup table, the number of elements in the Cartesian product of S × A represents the size of the table. When the state set S and the environment's possible operation set A are relatively large, a huge storage space will be occupied, and the learning efficiency will be greatly reduced. This has certain deficiencies for daily applications.
When using a neural network, s t = [s t1 , s t2 , . . . , s tm ] is the corresponding state vector of the network input. The output results of each network correspond to the Q value of an action. Neural networks store input-output correspondence [33]. The Q function definition is shown in Equation (5): Equation (5) is effective only when the optimal strategy is obtained. In the process of the learning operation, Equation (6) is: where Q(s t+1 , a t+1 ) represents the corresponding Q value of the next state. The purpose of ∆Q is to reduce error. The weight adjustment calculation is shown in Equation (7): The specific algorithm is as follows: Select the status s t at time t; 3.
Select the next a i action according to the updated Q(s t , a i ); 5.
Perform action a i , obtain the new state s t+1 and immediate reward value r t ; 6.
Calculate Adjust the weight of the Q network to minimize error ∆Q i , as shown in Equation (8): 8. Go to 2.

Path Planning Strategy
When the agent uses the Q-learning algorithm to plan a path in an unknown obstacle environment, experience must be accumulated by continuously exploring the environment. The agent uses the ε-greedy strategy for action selection and obtains immediate rewards when performing state transitions. During each iteration, when the agent reaches the target location, the Q value table is updated. The agent position is instantly transferred to the starting point for loop iteration until the value function tends to converge, which means the learning process has finished. To improve the exploration efficiency and convergence speed of the value-added function, the agent will be given a negative reward if an obstacle is encountered during the learning process. Then, the agent can change directions to explore other positions, reduce the probability of falling into the local optimal solution, and reduce the consumption of the total number of episodes in trial-and-error learning.

Action and state of agent
If the agent is regarded as a particle that does not need to consider the area, the size of the agent does not need to be taken into account when performing experimental analysis. The grid occupied by the particle is the current position of the agent, and the coordinates (x i , y i ) are used to represent the corresponding state information. In the grid environment, the agent moves one grid for every step. The agent can move in four directions: up, down, left, and right. The action space A = {0, 1, 2, 3} corresponds to the four directions of movement.

2.
Set the reward function The reward function is the value feedback that the agent receives when exploring the environment. If the agent performs the optimal action, a larger reward will be obtained. If the agent performs a poor action, a smaller reward will be obtained. Actions with a high reward value will have an increased chance of being selected, while actions with a low reward value will have a decreased chance of being selected. In this section, the path planning strategy in the grid environment is to select the path along which the agent obtains the largest cumulative reward in the learning process. The specific reward function is defined by a nonlinear piecewise function, as shown in Equation (9): reach the target position 0, reach other position −1, reach the obstacle position (9) In the learning process, when the agent reaches the target location while exploring the environment, a reward value of r = 1 is obtained and the training continues to the next plot. When the agent moves in the free zone, the reward value of feedback is r = 0 When an agent encounters an obstacle area, the reward is r = −1.

Action strategy selection
The ε-greedy algorithm is used to select the action strategy. The probability of 1 − ε is used to select the action with the maximum state action value. The probability ε is used to ISPRS Int. J. Geo-Inf. 2022, 11, 66 8 of 18 select the random action. Finally, the strategy with the largest cumulative reward value is selected. The calculation of the ε-greedy strategy is shown in Equation (10): where prob(a(t)) represents the agent's choice of action strategy.

Q Value Table
The agent selects the action sequence with the maximum reward value as the optimal path in the final Q value

Dynamic Adjustment of Exploration Factors
The Q-learning algorithm adopts the ε-greedy exploration strategy, which determines the decision the agent makes each time. ε is the exploration factor, which ranges from 0 to 1. As ε approaches 1, the agent is more inclined to explore the environment, i.e., to try random actions. However, if the agent is always inclined to explore the environment, random actions are not suitable for finding the final goal [35]. As ε approaches 0, the agent tends to take advantage of the external environment and choose the action with the largest action value function. In this case, the value function may not converge effectively, and the result will be influenced by the environment. Thus, the optimal solution can be easily missed and the final solution may not be obtained. The ε value is closely related to the agent's exploration strategy, which determines the accuracy and efficiency of the final solution. Thus, selection of the ε value is vital [38].
This paper optimizes the Q-learning algorithm by dynamically adjusting the exploration factor in the ε-greedy strategy, introduces the discount rate of the exploration factor, and calculates the discount rate of the exploration factor before selecting random actions, as shown in Equation (11): where i is the episode (number of iterations). When the initial value of i is 0, the initial value of ε is 1 according to the formula, and the rate is at the maximum value. Many explorations have been conducted by randomly choosing actions and constantly training the Q function, and the agent has become increasingly confident about the estimated Q value. At the same time, with an increasing number of iterations, the proportion of the exploration factor will gradually decrease so that the agent will make more use of the external environment to improve the convergence speed when making the next selection.

Algorithm Flow
First, initialize the relevant parameters and set i to 0. Before choosing a random action, the discount rate of the exploration factor is calculated. Then, perform the action according to the ε-greedy action strategy and move to the next position to obtain the corresponding state s and immediate reward r. The Q value is updated according to the update rules of the value function calculation formula. When the training times do not meet the initial set value, the cycle is iterated. After meeting the requirements, Q values corresponding to all states are output, and the algorithm learning is finished.
A flow chart based on the Q-learning optimization algorithm in the grid environment is shown in Figure 3: according to the ε-greedy action strategy and move to the next position to obtain the corresponding state s and immediate reward r . The Q value is updated according to the update rules of the value function calculation formula. When the training times do not meet the initial set value, the cycle is iterated. After meeting the requirements, Q values corresponding to all states are output, and the algorithm learning is finished.
A flow chart based on the Q-learning optimization algorithm in the grid environment is shown in Figure 3:

Experiment and Analysis
In the algorithm simulation experiment, the Q-learning algorithm, SARSA algorithm, and the proposed Q-learning optimization algorithm are used to plan the emergency path, and the experimental results are compared and analyzed. Aiming at the simulation scene, according to a real indoor environment, indoor emergency path planning analysis based on the proposed Q-learning optimization algorithm is performed.

Parameter Settings
After many experiments, the parameters of the proposed Q-learning optimization algorithm are set as follows: learning rate α = 0.01, exploration probability ε = 0.9, and discount factor γ = 0.9. In the algorithm simulation experiment, the number of training episodes is set to 5000. The Q-learning algorithm and SARSA algorithm parameters are set to be consistent with the proposed Q-learning optimization algorithm parameter settings. In the simulation scene experiment, the number of training scenarios is increased to 10,000 due to the increased grid size.

Algorithm Simulation Experimental Analysis
The TD can be divided into two types: online control algorithm SARSA and offline control algorithm Q-learning. The largest difference between Q-learning and the SARSA algorithm is the method for updating the Q value. The Q-learning algorithm is bolder in the choice of actions and is more inclined to select a behavior that is not related to a strategy corresponding to the current state but a behavior that represents the maximum action value. The SARSA algorithm is more conservative in its selection and will update the Q value according to its own learning rhythm [39]. The following is based on the Q-learning algorithm, SARSA algorithm, and the proposed Q-learning optimization algorithm to determine the shortest path under the grid environment model and compare and analyze the experimental results.

Environmental Spatial Modeling
In this experiment, the Tkinter toolkit [40] was used to build an environmental model. In the algorithm simulation experiment, a grid environment with 20 pixels and a total number of 25 × 25 grids is constructed as an obstacle map. The agent selects and executes actions under this environmental model. The total number of grids in the environment is the number of states of the agent's activity. As shown in Figure 4, there are a total of 625 states. The agent is represented by a red circle. The agent moves a grid on the map each time it performs an action. The initial point for this experiment is (1, 1), and the target point is the blue grid at the lower right corner (21,21). The white grid represents the passable area, and the black grid represents the obstacle area. In Tkinter, the position of the rectangle is represented by the coordinates of two diagonal points, and the circle representing the agent is the inscribed circle of the rectangle. This has considerable convenience for representing the location. In the program, the map environment and the location of the agent can be designed by adjusting the grid coordinates. By observing the operation interface in Figure 4, the real-time position of the agent is known at each moment so that we can observe when the agent can plan the shortest path, which provides a reference for setting and adjusting the parameters.

Comparison and Analysis of Experimental Results
The simulation result of path planning for the agent from the starting position to the target position using the SARSA algorithm in a grid environment is shown in Figure 5a. The shortest path was 102 steps, and the longest path was 3682 steps. The simulation results of path planning using the Q-learning algorithm strategy are shown in Figure 5b.

Comparison and Analysis of Experimental Results
The simulation result of path planning for the agent from the starting position to the target position using the SARSA algorithm in a grid environment is shown in Figure 5a. The shortest path was 102 steps, and the longest path was 3682 steps. The simulation results of path planning using the Q-learning algorithm strategy are shown in Figure 5b. The shortest path was 42 steps, and the longest path was 1340 steps. The result of the proposed Qlearning optimization algorithm is shown in Figure 5c. The shortest path also was 42 steps, and the longest path was 1227 steps. There is little difference between the path planning results of the Q-learning algorithm and the proposed Q-learning optimization algorithm, but the results are significantly better than the results using the SARSA algorithm.

Comparison and Analysis of Experimental Results
The simulation result of path planning for the agent from the starting position to the target position using the SARSA algorithm in a grid environment is shown in Figure 5a. The shortest path was 102 steps, and the longest path was 3682 steps. The simulation results of path planning using the Q-learning algorithm strategy are shown in Figure 5b. The shortest path was 42 steps, and the longest path was 1340 steps. The result of the proposed Q-learning optimization algorithm is shown in Figure 5c. The shortest path also was 42 steps, and the longest path was 1227 steps. There is little difference between the path planning results of the Q-learning algorithm and the proposed Q-learning optimization algorithm, but the results are significantly better than the results using the SARSA algorithm. In the training process, since there is not any signal accumulation in the early stage of learning, a large amount of time is spent finding the path at the beginning, during which obstacles are constantly encountered. However, with continuous learning, the knowledge accumulated by the agent continues to increase, and the number of steps required in the path finding process is gradually reduced. Figure 6a shows that the SARSA algorithm does not have an obvious convergence trend in the process of 5000 iterations of learning. Figure 6b shows that the Q-learning algorithm tends to converge around the 2500th step through continuous exploration of the environment and the accumulation of In the training process, since there is not any signal accumulation in the early stage of learning, a large amount of time is spent finding the path at the beginning, during which obstacles are constantly encountered. However, with continuous learning, the knowledge accumulated by the agent continues to increase, and the number of steps required in the path finding process is gradually reduced. Figure 6a shows that the SARSA algorithm does not have an obvious convergence trend in the process of 5000 iterations of learning. Figure 6b shows that the Q-learning algorithm tends to converge around the 2500th step through continuous exploration of the environment and the accumulation of knowledge. Figure 6c shows that the convergence speed of the proposed Q-learning optimization algorithm is significantly faster when the exploration factor is optimized. It has basically converged around the 500th step, which is nearly 2000 steps less than the convergence rate before optimization. In the same environment, the total elapsed time of the SASRA algorithm is 164.86 s, the total elapsed time of the Q-learning algorithm is 68.692 s, and the total elapsed time of the proposed Q-learning optimization algorithm is 13.738 s.
The Q-learning algorithm continuously accumulates rewards during the learning process and maximizes the cumulative reward value as the learning goal. At the beginning of learning, the agent is randomly selected and easily hits obstacles. When hitting an obstacle, the reward value is −1, so the initial reward value is negative or approximately 0. As the number of training sessions increases, the number of times the agent hits obstacles continue to decrease, and the accumulated rewards gradually increase. From Figure 7a, as the number of training iterations increases, the cumulative reward of the SARSA algorithm increases by approximately 0, and there is not an obvious trend of change during the training process. From Figure 7b, the cumulative reward of the Q-learning algorithm increases gradually with increasing training times and finally approaches 10. From Figure 7c, the cumulative reward change of the proposed Q-learning optimization algorithm is more stable than that of the previous algorithm, and approaches 10 at approximately 3000 steps.
knowledge. Figure 6c shows that the convergence speed of the proposed Q-learning optimization algorithm is significantly faster when the exploration factor is optimized. It has basically converged around the 500th step, which is nearly 2000 steps less than the convergence rate before optimization. In the same environment, the total elapsed time of the SASRA algorithm is 164.86 s, the total elapsed time of the Q-learning algorithm is 68.692 s, and the total elapsed time of the proposed Q-learning optimization algorithm is 13.738 s. The Q-learning algorithm continuously accumulates rewards during the learning process and maximizes the cumulative reward value as the learning goal. At the beginning of learning, the agent is randomly selected and easily hits obstacles. When hitting an obstacle, the reward value is −1, so the initial reward value is negative or approximately 0. As the number of training sessions increases, the number of times the agent hits obstacles continue to decrease, and the accumulated rewards gradually increase. From Figure  7a, as the number of training iterations increases, the cumulative reward of the SARSA algorithm increases by approximately 0, and there is not an obvious trend of change during the training process. From Figure 7b, the cumulative reward of the Q-learning algorithm increases gradually with increasing training times and finally approaches 10. From Figure 7c, the cumulative reward change of the proposed Q-learning optimization algorithm is more stable than that of the previous algorithm, and approaches 10 at approximately 3000 steps. The above experimental results are integrated into Table 2. Based on the experimental results shown in Table 2, the Q-learning algorithm performs better than the SARSA algorithm in terms of path selection and convergence. Compared with the SARSA algorithm, the Q-learning algorithm is more suitable for emergency path planning in the grid  The Q-learning algorithm continuously accumulates rewards during the learning process and maximizes the cumulative reward value as the learning goal. At the beginning of learning, the agent is randomly selected and easily hits obstacles. When hitting an obstacle, the reward value is −1, so the initial reward value is negative or approximately 0. As the number of training sessions increases, the number of times the agent hits obstacles continue to decrease, and the accumulated rewards gradually increase. From Figure  7a, as the number of training iterations increases, the cumulative reward of the SARSA algorithm increases by approximately 0, and there is not an obvious trend of change during the training process. From Figure 7b, the cumulative reward of the Q-learning algorithm increases gradually with increasing training times and finally approaches 10. From Figure 7c, the cumulative reward change of the proposed Q-learning optimization algorithm is more stable than that of the previous algorithm, and approaches 10 at approximately 3000 steps. The above experimental results are integrated into Table 2. Based on the experimental results shown in Table 2, the Q-learning algorithm performs better than the SARSA algorithm in terms of path selection and convergence. Compared with the SARSA algorithm, the Q-learning algorithm is more suitable for emergency path planning in the grid The above experimental results are integrated into Table 2. Based on the experimental results shown in Table 2, the Q-learning algorithm performs better than the SARSA algorithm in terms of path selection and convergence. Compared with the SARSA algorithm, the Q-learning algorithm is more suitable for emergency path planning in the grid environment. At the same time, compared with the classic Q-learning algorithm, the proposed Q-learning optimization algorithm has significantly improved solution efficiency, and the convergence speed is also significantly faster. This proves the proposed Q-learning optimization algorithm is effective in the grid environment.

Simulation Scene Experiment Analysis
To verify the effectiveness of the proposed algorithm, a simulation experiment was carried out with an office building in Beijing as the environment. When a fire breaks out on one floor of an office building, it poses a serious threat to the personal and property safety of the local people. For fire obstacle information that is changing, a Q-learning optimization algorithm in the grid environment is used to plan the rescue path in order to evacuate the trapped people.

Experimental Data and Scene Construction
The simulation environment is the internal scene of an office building in Beijing. To visually display the movement position of rescuers during the evacuation process, Glodon software is used to construct a 3D virtual environment map of the scene. The office building environment in the event of a fire can be modeled as shown in Figure 8. The blue rectangle represents people that need to be rescued, and the red and yellow concentric circles show where the fire occurs. This floor of the office building has only one exit in the top left corner. Rescuers are in the position shown by the cylinder in the upper left corner and move towards the target point for rescue. If rescuers want to evacuate the scene as soon as possible, they must consider the impact of distance and also the impact of different path sections and exit information on evacuation time. Therefore, the research content of this section can be verified through the 3D scene model, as shown in Figure 8. Based on the structural characteristics of uniformly distributed indoor wall buildings, the grid graph method is used to construct an indoor grid map, as shown in Figure  9. According to the office building scene, a grid map of 30 × 30 is constructed. During the construction of the grid map, the unit grid is set as a rectangle of equal length and width. Combining the uniformity and compactness of the length and width of the building, an indoor grid environment is constructed, as shown in Figure 10. Based on the structural characteristics of uniformly distributed indoor wall buildings, the grid graph method is used to construct an indoor grid map, as shown in Figure 9. According to the office building scene, a grid map of 30 × 30 is constructed. During the construction of the grid map, the unit grid is set as a rectangle of equal length and width. Combining the uniformity and compactness of the length and width of the building, an indoor grid environment is constructed, as shown in Figure 10.

Analysis of Experimental Results
The Q-learning optimization algorithm is used to carry out the shortest path planning experiment for an indoor environment without a fire and obstacles when fires occur. The experimental results are shown in Figures 11 and 12.
Based on the structural characteristics of uniformly distributed indoor wall buildings, the grid graph method is used to construct an indoor grid map, as shown in Figure  9. According to the office building scene, a grid map of 30 × 30 is constructed. During the construction of the grid map, the unit grid is set as a rectangle of equal length and width. Combining the uniformity and compactness of the length and width of the building, an indoor grid environment is constructed, as shown in Figure 10.

Analysis of Experimental Results
The Q-learning optimization algorithm is used to carry out the shortest path planning experiment for an indoor environment without a fire and obstacles when fires occur. The experimental results are shown in Figures 11 and 12.   Figure 11. Fire-free environment path planning.

Analysis of Experimental Results
The Q-learning optimization algorithm is used to carry out the shortest path planning experiment for an indoor environment without a fire and obstacles when fires occur. The experimental results are shown in Figures 11 and 12.  During the training process, as shown in Figure 13a, the Q-learning optimization algorithm in the non-fire scenario gradually accumulates knowledge during the learning process, and the number of steps required from the initial point to the target point gradually decreases and tends to converge around the 4000th step. From Figure 13b, the Qlearning optimization algorithm in the fire scenario tends to converge when the number of steps is approximately 6600 as the learning process progresses. As the complexity of the indoor environment increases, the efficiency of the agent exploring the path is reduced to some extent. However, there is little difference between the two, and the shortest path can still be obtained in a short time to achieve convergence. In the training process, as the agent continuously learns, the cumulative reward of the Q-learning optimization algorithm in the fire-free scenario continues to accumulate, which is synchronized with the change in the number of steps. As shown in Figure 14a, when the number of steps is approximately 4000, the cumulative reward begins to increase significantly and finally approaches 10. However, as the learning process continues in the fire scenario, the cumulative reward value of the Q-learning optimization algorithm begins to increase when the number of steps is approximately 6600 and finally approaches 10, as shown in Figure 14b. Although the cumulative reward efficiency is lower in the more complex environment, the convergence results are not affected. During the training process, as shown in Figure 13a, the Q-learning optimization algorithm in the non-fire scenario gradually accumulates knowledge during the learning process, and the number of steps required from the initial point to the target point gradually decreases and tends to converge around the 4000th step. From Figure 13b, the Q-learning optimization algorithm in the fire scenario tends to converge when the number of steps is approximately 6600 as the learning process progresses. As the complexity of the indoor environment increases, the efficiency of the agent exploring the path is reduced to some extent. However, there is little difference between the two, and the shortest path can still be obtained in a short time to achieve convergence. During the training process, as shown in Figure 13a, the Q-learning optimization algorithm in the non-fire scenario gradually accumulates knowledge during the learning process, and the number of steps required from the initial point to the target point gradually decreases and tends to converge around the 4000th step. From Figure 13b, the Qlearning optimization algorithm in the fire scenario tends to converge when the number of steps is approximately 6600 as the learning process progresses. As the complexity of the indoor environment increases, the efficiency of the agent exploring the path is reduced to some extent. However, there is little difference between the two, and the shortest path can still be obtained in a short time to achieve convergence. In the training process, as the agent continuously learns, the cumulative reward of the Q-learning optimization algorithm in the fire-free scenario continues to accumulate, which is synchronized with the change in the number of steps. As shown in Figure 14a, when the number of steps is approximately 4000, the cumulative reward begins to increase significantly and finally approaches 10. However, as the learning process continues in the fire scenario, the cumulative reward value of the Q-learning optimization algorithm begins to increase when the number of steps is approximately 6600 and finally approaches 10, as shown in Figure 14b. Although the cumulative reward efficiency is lower in the more complex environment, the convergence results are not affected. In the training process, as the agent continuously learns, the cumulative reward of the Q-learning optimization algorithm in the fire-free scenario continues to accumulate, which is synchronized with the change in the number of steps. As shown in Figure 14a, when the number of steps is approximately 4000, the cumulative reward begins to increase significantly and finally approaches 10. However, as the learning process continues in the fire scenario, the cumulative reward value of the Q-learning optimization algorithm begins to increase when the number of steps is approximately 6600 and finally approaches 10, as shown in Figure 14b. Although the cumulative reward efficiency is lower in the more complex environment, the convergence results are not affected. The experimental results show that the algorithm achieves reasonable path planning for both environments and determines the shortest path without collision from the starting point to the end point in a short time. Figure 11 retrains and learns on the basis of the shortest path planned in Figure 10 so that the path planned in the fire environment completely avoids the obstacle sections in the fire area. The algorithm is shown to have good adaptability to the obstacle environment, can identify obstacles in a short time, and reasonably avoid obstacles for path planning and reach the destination. This illustrates the feasibility of the algorithm for path planning in an indoor obstacle environment.

Conclusions
Aiming at the problem of indoor path planning in an emergency disaster relief environment, this paper uses the Q-learning algorithm to propose an emergency path planning method based on the grid environment. Considering the lack of prior knowledge of indoor disaster environments, the Tkinter toolkit is used to build the grid map environment. The exploration factor in the ε-greedy strategy is adjusted dynamically. Before selecting a random action, the Q-learning algorithm is optimized by adding a calculation of the exploration factor discount rate, which improves the convergence speed of the Qlearning algorithm. The SARSA algorithm, Q-learning algorithm, and Q-learning optimization algorithm are compared and analyzed in the algorithm simulation experiment. The Q-learning algorithm outperforms the SARSA algorithm in path selection and convergence, and the Q-learning algorithm is more suitable for path planning in the grid environment model. Compared with the SARSA algorithm and Q-learning algorithm, the Qlearning optimization algorithm proposed in this paper greatly improves the solving efficiency and accelerates the convergence rate. Finally, using a fire in an office building in Beijing and the need of relevant personnel for rescue as an example of an emergency, the indoor emergency path planning experiment was analyzed. According to the real indoor environment of the office building, a more complex grid obstacle environment was established as the indoor experiment scene. Experimental results verified the effectiveness of the proposed Q-learning optimization algorithm for path planning in indoor obstacle environments.
With the development of artificial intelligence, path planning algorithms based on reinforcement learning are constantly being updated and optimized. The indoor environment addressed in this paper is static, and the complexity and variability of the actual indoor environment and other factors can be taken into account in subsequent studies. When disasters occur, the disaster area may continue to spread, and the path planning solution in dynamic and complex indoor environments may require further research. The experimental results show that the algorithm achieves reasonable path planning for both environments and determines the shortest path without collision from the starting point to the end point in a short time. Figure 11 retrains and learns on the basis of the shortest path planned in Figure 10 so that the path planned in the fire environment completely avoids the obstacle sections in the fire area. The algorithm is shown to have good adaptability to the obstacle environment, can identify obstacles in a short time, and reasonably avoid obstacles for path planning and reach the destination. This illustrates the feasibility of the algorithm for path planning in an indoor obstacle environment.

Conclusions
Aiming at the problem of indoor path planning in an emergency disaster relief environment, this paper uses the Q-learning algorithm to propose an emergency path planning method based on the grid environment. Considering the lack of prior knowledge of indoor disaster environments, the Tkinter toolkit is used to build the grid map environment. The exploration factor in the ε-greedy strategy is adjusted dynamically. Before selecting a random action, the Q-learning algorithm is optimized by adding a calculation of the exploration factor discount rate, which improves the convergence speed of the Q-learning algorithm. The SARSA algorithm, Q-learning algorithm, and Q-learning optimization algorithm are compared and analyzed in the algorithm simulation experiment. The Qlearning algorithm outperforms the SARSA algorithm in path selection and convergence, and the Q-learning algorithm is more suitable for path planning in the grid environment model. Compared with the SARSA algorithm and Q-learning algorithm, the Q-learning optimization algorithm proposed in this paper greatly improves the solving efficiency and accelerates the convergence rate. Finally, using a fire in an office building in Beijing and the need of relevant personnel for rescue as an example of an emergency, the indoor emergency path planning experiment was analyzed. According to the real indoor environment of the office building, a more complex grid obstacle environment was established as the indoor experiment scene. Experimental results verified the effectiveness of the proposed Q-learning optimization algorithm for path planning in indoor obstacle environments.
With the development of artificial intelligence, path planning algorithms based on reinforcement learning are constantly being updated and optimized. The indoor environment addressed in this paper is static, and the complexity and variability of the actual indoor environment and other factors can be taken into account in subsequent studies. When disasters occur, the disaster area may continue to spread, and the path planning solution in dynamic and complex indoor environments may require further research.
Author Contributions: Shenghua Xu and Yang Gu designed the algorithm, wrote the paper, and performed an indoor emergency path planning experiment based on Q-learning optimization algorithm. Xiaoyan Li, Cai Chen and Yingyi Hu participated in the experimental analysis and revised the manuscript. Yu Sang and Wenxing Jiang revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.