Deep Q-Network for Optimal Decision for Top-Coal Caving

: In top-coal caving, the window control of hydraulic support is a key issue to the perfect economic beneﬁt. The window is driven by the electro-hydraulic control system whose command is produced by the control model and the corresponding algorithm. However, the model of the window’s control is hard to establish, and the optimal policy of window action is impossible to calculate. This paper studies the issue theoretically and, based on the 3D simulation platform, proposes a deep reinforcement learning method to regulate the window action for top-coal caving. Then, the window control of top-coal caving is considered as the Markov decision process, for which the deep Q-network method of reinforcement learning is employed to regulate the window’s action effectively. In the deep Q-network, the reward of each step is set as the control criterion of the window action, and a four-layer fully connected neural network is used to approximate the optimal Q-value to get the optimal action of the window. The 3D simulation experiments validated the effectiveness of the proposed method that the reward of top-coal caving could increase to get a better economic beneﬁt.


Introduction
Coal is one of the most important energy sources in the world [1]. Even though its consumption has decreased in the past years, coal will persist in the domination of primary energy for the next several decades [2][3][4]. Improving the coal mining technology sustainably to alleviate the damage to the environment is the preferred choice for countries that lack oil sources while rich in coal [5,6]. Underground mining the thick coal seam is the most economical and environmentally-friendly mode currently [7,8]. In the longwall workspace, the top-coal caving is the most effective technology to exploit the coal seam whose thickness is greater than 4 m [9][10][11]. Especially in China, more than 40% coal is in the thick coal seam; hence, top-coal caving is the development direction in the future years [12].
By this method, the coal seam is first cut by coal cutter at, then the rest of the top-coal falls by the combined action of gravity and pressure, and the falling coal is captured by pulling back the tail canopy of the hydraulic support. Therefore, the hundreds of hydraulic supports in the workspace are the key devices for the security and top-coal capturing, and controlling the tail canopy of hydraulic supports optimally is the critical issue for top-coal caving [13,14].
The tail-beam, named "window" for capturing the top-coal, opens to capture the falling coal and closes to avoid the falling rocks as much as possible [15]. The traditional control method of the In this paper, a new 3D DEM simulation test platform based on the open-source framework YADE [48][49][50] is developed to analyze the dynamic process of top-coal caving. To get the optimal decision of windows intelligently based on the simulation platform, this paper, along with our preliminary work [51], introduces more information of windows action during top-coal caving as the state of the control system and employs the deep Q-network method of reinforcement learning to approximate the windows optimal decision. The main contributions of this work include: (1) The optimal control of the window's action of hydraulic support is transformed into a Markov decision process and a new method based on deep Q-network is proposed to regulate the optimal decision of the window's action. In the method, the state of the environment, the loss function of the optimizer, and the reward of each step are given according to the process of top-coal caving. (2) A 3D discrete element method simulation platform is created to analyze the process of top-coal caving based on Yade. Based on the simulation platform, simulation experiments were carried out and the results theoretically validate an available way of applying the intelligent method to top-coal caving.
The rest of the article is organized as follows. In Section 2, the 3D simulation platform is introduced. The optimal decision of top-coal caving by deep reinforcement learning is presented in Section 3. In Section 4, the experiment and result analyses are given. The conclusion is shown in Section 5.

Top-Coal Caving 3D Simulation Platform
Most top-coal caving simulations about the optimal decision of windows action is based on DEM in two dimensions, as shown in Figure 1. For this simulation, the boundary between rock and coal can be shown clearly, and the effect of the drawing coal could be analyzed directly. Top-coal caving 2D simulation based on DEM. This is our early work developed by Matlab [51]. The particles in blue, red, and yellow are coal, immediate roof, and basic roof, respectively. In this simulation platform, the controller gets the states of the system, and then calculates the optimal decision of windows' action by the design algorithm. The source code can be accessed at: "https://github.com/YangYi-HPU/Reinforcement-learning-simulation-environment-for-top-coalcaving" (for non-commercial uses only).
In most 2D simulations, the window executing the action often ignores the process, i.e., the window changes the state from close to open instantaneously. Hence, even though the regulation mechanism of the windows' action is only based on "close the window if the rock emerges", it can get a good result. However, in practice, it takes time to open and close the window. If the rock emerges near the window, the closing process would make the rock fall into the drag conveyor. Furthermore, 2D simulation fails to reflect the real scenario, especially when the particles are lying on a plane that the movement of the coal on shield-beam cannot depict, and the real boundary between coal and rock is not a line but a plane. Hence, the method to control the boundary should expand to adjust the flatness of the plane.
Yade is a well known open-source framework driven by python for DEM [49,52] in the Linux operation system. It is flexible to integrate the control algorithm of the window action with the complex calculation. In this paper, we focus on the windows optimal control from a control system perspective; hence, we only demonstrate five windows for the process of top-coal caving. The three scenarios are shown in Figure 2.  There are five hydraulic supports simplified to the top-beam, shield-beam, and tail-beam. The tail-beam can rotate around the bottom of shield-beam with the given speed, and the scope of swaying is restricted by lower and upper bounds. The parameters of the simulation platform are shown in Table 1. In Table 1, w sp is the width of workspace, w hy is the width of hydraulic support, h hy is the height of hydraulic support, l sh is the length of shield-beam, l ta is the length of tail-beam, θ s is the angle of between shield-beam and top beam, θ u is the upper angle between tail-beam and shield-beam, and θ l is the lower angle between tail-beam and shield-beam. The height of the rock layer and coal layer can be set as the requirement. The other parameters such as height of the space, the boundary of environment, and the location of each hydraulic support can be calculated according to the geometric relationship.
It should be noted that the parameters of the simulation are set as close to the real situation as possible; however, some parameters such as the height of rock could not set the same as the real case because the huge calculation of DEM may take several days. Meanwhile, this scheme does not hinder the validation of the algorithm.
In this platform, the material properties of rock and coal are shown in Table 2. They are set according to Tashan coal mine in China.

Markov Process of Top-Coal Caving
The process of windows executing action is a time series and meets the Markov property. The window action decision is considered as a control system; the input is the environment state and the output is the window action. Hence, the optimal control of window action is a Markov decision process essentially.
Given the state space of the top-coal caving denoted by S = {s 1 , s 2 , ..., s n }, s i is the state of environment, i = 1, 2, ..., n; n is the dimension of state space.For top-coal caving; and s i is a continuous variable. The window action space is mathbbA = {a 1 , a 2 , ..., a m }; a j is the discrete action value, j = 1, 2, ..., m; and m is the dimension of action. The Markov decision process is denoted by M = {S, A, R, P, γ}, where S ∈ S, A ∈ A, and R is the reward of action under the given state. P is the transition probability distribution of the current state to next state with the action. γ is the discount factor, γ ∈ (0, 1).
The policy π is the function of state s. There are two ways to get policy value. The first is deterministic policy, a = π (s). It indicates that, if the state is s, the action a must be chosen as the deterministic value. The second is probabilistic policy, denoted by π a s . That means the probability of executing action a under the state s.
According to dynamic programming [33,53], define the value function v π (s) at point t with the state s under the given policy π.
The purpose of reinforcement learning is to find an optimal policy that can get the maximum value of the v π (s) by Equation (1). More formally, the action-value function is defined based on v π (s), i.e., the cumulative rewards at point t if the action is chosen as a. Action-value is formalized as Equation (2).
By Equations (1) and (2), the optimal policy can be calculated from the optimal action-value function shown as Equation (3).
The deterministic policy to get optimal action is shown as Equation (4) if the state is s.
Hence, to get the optimal action of the window, the action-value function Q (s, a) should be trained. If the state is a discrete variable, Q (s, a) could be considered as the table that depicts the relation between state space and action space. The Q-learning algorithm is a mechanism to train Q (s, a). The Q (s, a) iteration of Q-learning shown as (5).
where s is the next state and α is the learning rate. However, for the continuous state, it is hard or even impossible to show all the states in a table due to the dimension curse [33]. Fortunately, the deep neural network [36,37,39] provides an effective method to approximate the Q (s, a), that is deep Q-network(DQN) [32,40,41].

Deep Q-Network for Top-Coal Caving
The framework of deep Q-network for top-coal caving is based on the work in [54]. Consider the state of top-coal caving system where s i , i = 1, 2, ..., n are continuous variables and the actions a j , j = 1, 2, ..., d are discrete variables. DQN employs a deep neural network to approach Q (s, a) in Equation (2). The Q-network is formalized as Q (s, a; θ), where θ is the parameter of the neural network and it needs to be trained. Hence, there are two import issues for training θ: the samples to train the neural network and the loss function to indicate the effect of neural network training.
During the top coal caving, the decision system gets the state s from the environment in each step, and the decision mechanism gives an action a to the window. The window executes the action, and then the decision system gets the next state s and calculates the reward R by the captured coal and rock. Hence, each step can produce a quaternion {s, a, R, s } that is a sample to train the Q-network. In the DQN, the experience store and replay are the key technologies to deal with the samples. For the top-coal caving, the samples of the quaternion {s, a, R, s } produced by each step are stored in the experience dataset, and then a batch of samples are selected randomly for replaying to train the parameters.
The loss function for training Q-network is based on Equation (5). Because dynamic programming [53] deduces the optimal decision from the destination inversely, the term R (s, a) + γ max a Q s , a is the related value of next state; hence, this term is regarded as the target the current Q (s, a) should approach. That target network is shown in Equation (6): Hence, the loss function is formalized as Equation (7).
Training the parameters means making the loss function close to zero. In practice, the target network uses the same architecture of Q-network and the parameter θ − is not trained but gets the Q-network parameter θ to update periodically. The Q-network is trained during the top-coal caving, at the beginning of the process, in order to cover as much of the state and action as possible. The window action is chosen by greedy algorithm [33], and the value is set favorably for exploration. The details of DQN for top-coal caving are shown as Algorithm 1.

Algorithm 1: DQN for top-coal caving.
1 Initialize N, the capacity of experience data set, n = 0, the number of stored experience 2 Initialize M, the episode for training; Initialize K, the maximum step of each episode 3 Initialize J, the windows number 4 Initialize γ, α the parameters for target-network 5 Initialize C, the name of target-network updates its parameters from Q-network 6 Initialize , the index to explore the relationship of state space and action space 7 Initialize parameter θ of Q-network Q (s, a; θ), with constant 8 Initialize parameter θ − of target-network Q (s, a; θ − ) with θ − = θ 9 for m ← 1 to M do 10 Load the initial scenario top-coal caving.

11
for k ← 1 to K do 12 for j ← 1 to J do 13 Get state s k With probability select a random action a k otherwise select a k = arg max a Q (s, a; θ) 14 Window j executes a k 15 Observe reward R k and next state s k+1 16 Store experience {s k , a k , R k , s k+1 }, n = n + 1 17 if n > N then 18 Sample random minibatch of experiences {s t , a t , R t , S t+1 } from experience storage 19 According to Equation (6), calculate the target 20 According to Equation (7), calculate the loss 21 Train the Q-net based on the loss by stochastic gradient descent [55]

DQN Model of Top-Coal Caving
In the 3D simulation platform introduced in Section 2, the state space of each window is S = {s 1 , s 2 }, s 1 is total particle number of windows checking area, and s 2 is the percentage of coal particle in s 1 . The action space of window is A = {a 1 , a 2 }, and a 1 = 0, and a 2 = 1 mean close and open action, respectively. Hence, we establish the Q-network architecture as a four-layer fully connected neural network, and the parameters are set as Table 3. The reward R of each step depends on the captured particles formalized Equation (8).
where n c and n r are, respectively, the captured particles of coal and rock, and r c and r r are, respectively, the reward of capturing a coal and a rock. In this paper, r c = 1 and r r = −3. Table 3. Architecture of DQN.

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer
Number of neurons 2 56 128 2 Initial θ 0.5 0.5 0.5 0.1 To distinguish which falling particle belong to which window, four clapboards are added to the deposition area, as shown in Figure 3a. The area for checking the state is the stereo region for A to B, as shown in Figure 3b. In the simulation, if the particle location is lower than the bottom of the tail-beam, that means it is captured by the window. The reward of each window is calculated by Equation (8).

Experiment and Result Analysis
Two algorithms were used to control window, the proposed DQN and the "close window if rocks emergency", which are denoted by "DQN" and "Cmethod", respectively. In the experiment, the thickness of coal and rock were set as 2 m, because the thickness was not the determining factor to validate the proposed algorithm. The radius of the particles was 0.15 m. The other parameters of training DQN were batchsize = 100, N = 10000, C = 300, α = 0.001, γ = 0.99, and = 0.9. For the DQN algorithm, the first step was to fill the experience dataset, and then train the Q-network. The experiences distribution shown in Figure 4 indicates that the states for the Q-network are closely related to the reward.
(a) relationship between s 1 to R (b) relationship between s 2 to R Figure 4. Experience distribution: (a) it is easy to find that the absolute value of R grows as s 1 increases; (b) the outline of R is proportional to s 1 . They indicate that a network could be found to approximate the relationship between particles number, coal rate, and reward.
The training processing is shown in Figure 5. The parameters of Q-network can convergence to an optimal value obviously; hence, the optimal action-value function of Q(s, a) can be approximated by a deep neural network. After the parameters of Q-network were trained, 10 tests were carried out to validate the effectiveness of the proposed algorithm. The typical final scenario is shown in Figure 6. It is clear that, along the top of shield-beam, the coal is covered by rock, and the boundary is not a smooth line. Hence, in the next step, the hydraulic support moves forward; if the window control criterion were "close window if rock emergency", the top-coal would be difficult to capture completely. DQN could avoid the problem since in Figure 6a the coal on the tail-beam is nearly captured completely. Furthermore, Figure 6b shows the boundary of the rock layer on the tail-beam obviously; the performance is the same as with 2D simulation. The details of the experiment results are shown in Table 4. For the average value of each index, we can find that the captured coal particles and the reward of DQN are greater than for the Cmethod obviously with a high coal rate and low rock rate. That means the economic effect of top-coal by DQN would better than by Cmethod.

Conclusions
This paper aims to get the optimal decision of the window's action of hydraulic support in top-coal caving by control theory. A DQN method based on the Markov optimal decision process is proposed according to the property of the top-coal caving. The effectiveness of the method was validated by the created 3D simulation platform. The experimental results show that: (1) The DQN method can get more coal particles obviously than the classical method with a very small price of increasing the rock rate. In the 10 tests, the average coal particles of the classical method and DQN are 658.7 and 682.3, respectively. Meanwhile, the rock rate of the DQN only rises 0.001. (2) The reward of the window's action by the DQN is better than the classical method. In the 10 tests, the average reward of the DQN is 633.7 while the classical method is 613.1. That means the DQN can produce more benefits than the classical method.
This paper tries to resolve the key problem of top-coal caving by artificial intelligent methodology, and the simulation results validate one of the ways that deep learning is applied to the top-coal caving. That means the optimal decision of windows would be regulated adaptively and intelligently. Even though the simulation result is better than that of the classical method, there are several issues that should be researched deeply in future work.
(1) The state of the DQN is selected as the total number of the particles and the coal rate. At present, our method is only used in the simulation; one of the obstacles for practice application is that the DQN needs the data of the state. However, the data are difficult to obtain in practice; hence, in future work, we will try to use a deep neural network to approximate the needed data from other geological information.
(2) The DQN gets the optimal Q-value by training, therefore there should be as much experience as possible, while, in practice, the experience obtained from top-coal caving is not as convenient as simulation. Hence, in future work, the learning mechanism of the DQN will be researched to get a lightweight learning framework based on the state space.
At present, the top-coal caving is mainly applied in China, and the serviceability of this method would be limited to Chinese coal mines. If the above issue is resolved, this method could be applied in practice. It could obviously increase the benefit of the top-coal caving and decrease the number of workers. Because the issue of the optimal decision for the control equipment in the other mineral resources is similar, the proposed method could be useful for other equipment in coal mining, and even for the optimal decision of the equipment in the underground metal mine.