Deep Forest Reinforcement Learning for Preventive Strategy Considering Automatic Generation Control in Large-Scale Interconnected Power Systems

To reduce occurrences of emergency situations in large-scale interconnected power systems with large continuous disturbances, a preventive strategy for the automatic generation control (AGC) of power systems is proposed. To mitigate the curse of dimensionality that arises in conventional reinforcement learning algorithms, deep forest is applied to reinforcement learning. Therefore, deep forest reinforcement learning (DFRL) as a preventive strategy for AGC is proposed in this paper. The DFRL method consists of deep forest and multiple subsidiary reinforcement learning. The deep forest component of the DFRL is applied to predict the next systemic state of a power system, including emergency states and normal states. The multiple subsidiary reinforcement learning component, which includes reinforcement learning for emergency states and reinforcement learning for normal states, is applied to learn the features of the power system. The performance of the DFRL algorithm was compared to that of 10 other conventional AGC algorithms on a two-area load frequency control power system, a three-area power system, and the China Southern Power Grid. The DFRL method achieved the highest control performance. With this new method, both the occurrences of emergency situations and the curse of dimensionality can be simultaneously reduced.


Introduction
Over the past few decades, there has been a growing trend of connecting new and renewable resources to large-scale interconnected power systems [1,2]. Automatic generation control (AGC) aims to balance the active power between generators and system loads in such large-scale interconnected power systems [3]. Recently, numerous control algorithms have been proposed for the AGC of largescale interconnected power systems. For example: an optimized sliding mode controller (SMC) [4] was proposed for the AGC of interconnected multi-area power systems in deregulated environments [5]; a two-layer active disturbance rejection controller (ADRC) was designed for the load frequency control (LFC) of interconnected power systems [6]; a fractional-order proportional-integral-derivative (FOPID) controller with two or three degrees of freedom was employed for AGC [7,8]; and optimized fuzzy logic control (FLC) was utilized for LFC in hydrothermal systems [9]. A modified cuckoo search algorithm [10] and an efficient and new modified differential evolution algorithm [11] were also proposed for hydrothermal power systems. Furthermore, many reinforcement learning algorithms which can update their control strategies online have been utilized for AGC: a relaxed Q learning-based 1. The strategy should predict the next systemic state of the power system, and it should learn the feature of systemic frequency in the interconnected power system. That is to say, the preventive strategy should know whether the next state of the power system is an emergency state or a normal state, whereas conventional AGC without a preventive strategy cannot determine the next systemic state. 2. The strategy should provide an advanced generation command to the AGC unit with the prediction of the next systemic state, which includes the emergency state and normal state.
Recently, Zhihua Zhou and Ji Feng proposed an alternative to the deep neural network method for classification; this new method is known as deep forest or multi-grained cascade forest (gcForest) [27]. In [27], deep forest achieved a highly competitive performance in numerous classification experiments, such as image categorization, face recognition, music classification, hand movement recognition, sentiment classification, and the classification of low-dimensional data. The deep forest algorithm has been improved and further applied. For example, a discriminative deep forest was proposed in combination with Euclidean and Manhattan distances [28]; transductive transfer learning was applied to a convex quadratic optimization problem with linear constraints [29]; hyperspectral image classification was integrated into local binary patterns and a Gabor filter to extract local/global image features [30]; a distributed deep forest was applied to the automatic detection of cash-out fraud [31]; a Siamese deep forest was proposed for the prevention of overfitting, which takes place in neural networks when only limited training data are available [32]. Therefore, as an efficient algorithm for classification with low-dimensional data, deep forest can be applied to predict the next systemic state in a large-scale interconnected power system.
To reduce occurrences of emergency situations and simultaneously mitigate the curse of dimensionality, deep forest reinforcement learning (DFRL) applied as a preventive strategy for AGC is proposed in this paper. The DFRL method consists of multiple subsidiary reinforcement learning and a deep forest. The multiple subsidiary reinforcement learning component of DFRL is applied to provide the generation command to the AGC unit of the large-scale interconnected power system, while the deep forest of DFRL is used to predict the next systemic state. Consequently, the major features of the DFRL method can be summarized as follows: 1. Since reinforcement learning is applied to DFRL, DFRL can update its control strategy online. 2. The systemic states of a power system, including emergency states and normal states, can be predicted by the deep forest of DFRL using low-dimensional data. 3. Since both the Q-value matrix and the action set of reinforcement learning are split into those of an emergency situation and a normal situation, calculation memory is reduced. Thus, the curse of dimensionality is mitigated.
This paper is organized as follows. The emergency state and automatic generation control are discussed in Section 2. Section 3 describes basic principle of deep forest reinforcement learning. Simulation results obtained by the DFRL method for a two-area LFC power system, a three-area power system, and the China Southern Power Grid are presented in Section 4. Finally, Section 5 provides brief conclusions of this paper.

Emergency State
AGC not only aims to balance the active power between generators and system loads in a large-scale interconnected power system but is also designed to reduce the frequency deviation of the power system. That is to say, a high control performance of an AGC controller is significantly important for a control area to maintain the frequency deviation within normal levels.
Generally, frequency control has three gradations, i.e., a primary control zone, a secondary control zone, and an emergency control zone (see Figure 1). The primary control zone, which is subject to primary frequency regulation (PFR), is automatically regulated by the generator in response to frequency changes. The secondary control zone, which leads to AGC or secondary frequency regulation (SFR), is regulated by a control algorithm. The emergency control zone or tertiary control zone results in the implementation of economical dispatch and unit commitment, which occur with longer timescales, e.g., 15 min for economical dispatch (ED) and 24 h for unit commitment (UC). In particular, the range of the area control error (ACE) for each gradation is dependent on the basic capacity of the power system. Each gradation has a different frequency deviation range and a different control performance standard (CPS) index range:

•
The range of the frequency deviation for the dead zone is from where the frequency deviation ∆ f 1 is set to 0.025 Hz; The range of the CPS index for the dead zone is from k CPS(1) to 100%, where the CPS index k CPS(1) is set to 99% in this paper. • The range of the frequency deviation for primary control is from where the frequency deviation ∆ f 2 is set to 0.1 Hz; The range of the CPS index for primary control is from k CPS(2) to k CPS(1) , where the CPS index k CPS(2) is set to 95% in this paper. • The range of the frequency deviation for secondary control is from the frequency deviation ∆ f 3 is set to 0.5 Hz; The range of the CPS index for secondary control is from k CPS(3) to k CPS (2) , where the CPS index k CPS(3) is set to 85% in this paper. • The range of the frequency deviation for emergency control is from where the frequency deviation ∆ f 4 is set to ∞ Hz; The range of the CPS index for emergency control is from k CPS(4) to k CPS (3) , where the CPS index k CPS(4) is set to 0% in this paper.
To minimize emergency situations for a power system when a large load suddenly occurs, a preventive strategy for AGC is described below. Frequency deviation ranges from were selected as the ranges for the preventive strategy. The preventive strategy aims to maintain the system frequency deviation ∆ f at a minimum value, i.e., ∆ f → 0 and |∆ f | < ∆ f e . The frequency deviation for the preventive strategy ∆ f e was set to 0.2 Hz in this study.

Framework of Automatic Generation Control
A basic AGC/LFC model contains two control areas. Each control area contains an AGC controller, a governor, a non-reheat turbine generator, and a system power flow load ∆P LA or ∆P LB [33]. Three major features required for the AGC controller in an interconnected power system can be summarized as follows.
1. The controller should provide generation commands to the AGC unit to balance the real-time active power flow between the generator and system loads; 2. The controller should reduce the frequency deviation in the control area; 3. The controller should decrease the scheduled tie-line power deviation between any two areas, i.e., mitigate the value of the ACE.
Therefore, the frequency deviation ∆ f , the ACE e ACE , the scheduled tie-line power deviation ∆P T , CPS indices (including CPS index k CPS , CPS1 index k CPS1 , and CPS2 index k CPS2 ) are the inputs to the AGC controller. The generation command provided to the AGC unit is then the output of the controller.

Control Objective of Automatic Generation Control
In each control area, the AGC controller aims to (i) minimize the systemic frequency deviation ∆ f , (ii) reduce the value of the ACE e ACE , and (iii) maximize the CPS index k CPS .
The value of the ACE e ACE can be calculated as where ∆P t is the scheduled tie-line power deviation; B is the frequency response coefficient of the control area (in MW/0.1 Hz); ∆ f is the frequency deviation (in Hz). The CPS index k CPS , which includes CPS1 index and CPS2 index, is established by the North American Electric Reliability Council (NERC) [12,34,35]. The CPS index is the statistic index of ∆ f and e ACE over a long period of time, rather than the real-time values of ∆ f and e ACE . The CPS index k CPS can be calculated as where N α CF =1 is the number of periods when α CF = 1; N α CF =0 is the number of periods when α CF = 0. The variable α CF can be calculated as where α CF1 and α CF2 can be calculated as follows.
where E AVE-1min is the clock-1-min average of the ACE; E AVE-10min is the the clock-10-min average of the ACE; ∆F AVE-1min is the clock-1-min average of the frequency deviation; ε 1min is the targeted frequency bound for the CPS1 index with clock-1-min; ε 10min is the targeted frequency bound for the CPS1 index with clock-10-min; B represents the frequency bias of the control area, expressed in MW/0.1 Hz; B S represents the frequency bias of the power system, expressed in MW/0.1 Hz. Furthermore, the CPS1 index evaluates the impact of the ACE deviation on the frequency of system, while the CPS2 index is used to restrict the ACE magnitude. They can be calculated as follows: where {α CF1 } T is the average value of the variable α CF1 at the period of T, which is always 1 year; N α CF2 <1 is the number of periods when α CF2 < 1; N α CF2 ≥1 is the number of periods when α CF2 ≥ 1.

Deep Forest
As a decision tree ensemble algorithm, deep forest can perform representation learning [27]. Actually, deep forest contains two procedures, i.e., cascade forest structure and multi-grained scanning ( Figure 2).
In the cascade forest structure procedure, deep forest cascades decision tree forests level-by-level. The input of each level is the feature information and is processed by the preceding level. The cascade forest structure of deep forest includes two types of forests, i.e., two completely-random tree forests ('Forest A') and two random forests ('Forest B') [27]. Both types of forests contain 500 trees. The completely random trees, which randomly select a feature to split at each node of the tree, will stop growing if each leaf node contains only the same classes of instances. The model of a random forest is an ensemble algorithm, which contains a group of decision tree classifiers {h(X, Θ k )}, where k = 1, 2, ..., n c ; n c is the number of classes; Θ k is a random vector; Θ k and the kth decision tree are independent with the same distribution. Then, the class that obtains the maximum vote is the right class to use for prediction, where x is a sample; h i (x) is the classification model of the ith decision tree; Y is the target classes; and I(•) is the indicator function.
The random trees randomly select √ d features as candidates and then choose the one with the best Gini coefficient for splitting, where d is the number of input features, and d rounds d to the nearest integer that is less than or equal to d. The Gini coefficient G can be calculated as where p(i|t) is the conditional probability of the ith class given the tth class; n c is the number of samples. All the samples are of the same class when the Gini coefficient G = 0. When the set C is split into subset C 1 and subset C 2 , the split of the Gini coefficient index G split (C) can be calculated as where n 1 and n 2 are the number of samples of subset C 1 and subset C 2 , respectively. To obtain the average of the final class vector, each instance should be trained k − 1 times with k-fold cross-validation. The training performance of the whole cascade forest is estimated based on the validation set when a new level cascade is expanded. The total number of levels N is determined when the significant performance stops increasing.
In the multi-grained scanning procedure: (I) sliding windows are applied to scan the raw features of the input information; (II) an n f -dimensional feature vector is generated by the scanning window for each feature if a window size of n f features is selected ( Figure 3); (III) (d − n f + 1) feature vectors are produced as transformed features for 'Forest A' and 'Forest B'; (IV) a 2n c (d − n f + 1)-dim transformed feature vector is produced for the inputs of the cascade forest structure if the number of classes is n c .  For example: (I) n f(1) , n f (2) , and n f(n w ) sizes of sliding windows are applied to scan the d-dim raw features ( Figure 2) if the number of classes is n c and the number of sliding windows is n w ; (II) n w × 2n c (d − n f + 1)-dim transformed feature vectors are produced by the multi-grained scanning procedure for the cascade forest; (III) the deep forest cascade has N · n c levels of forests, where each n c level has 4n c + 2n c (d − n f(1) + 1), 4n c + 2n c (d − n f(2) + 1), ..., 4n c + 2n c (d − n f(n w ) + 1) dimensional features, respectively, where 4n c -dim features are 4 (2 'Forest A' + 2 'Forest B') times the n c -dim class vector; N is the number of each n w level; (IV) the deep forest terminates model complexity training when adequate.
Compared to the deep neural networks in [27], deep forest can obtain a higher performance with low-dimensional data, while deep neural networks perform well for high-dimensional data. In [27], convolutional neural networks, machine learning practical, random forest, support vector machine, and k-nearest neighbors algorithms were compared to deep forest using low-dimensional data. The results of these comparisons showed that the other deep learning algorithms have many hyperparameters and a higher performance for high-dimensional data. Meanwhile, the prediction of the next systemic state in the AGC problem is a problem with low-dimensional data.

Reinforcement Learning
As one of the most famous methods for reinforcement learning, Q learning is a model-free control algorithm. The framework of Q learning contains a controller and an environment. The Q learning-based controller can update its strategy for an environment online. The inputs of the controller are the state value and reward value, while the output is an action for the environment. A controller that is based on Q learning provides an action a at the current state s in an environment based on the Q-value matrix Q and probability distribution matrix P. These two matrices can be subsequently updated as where α is the learning rate of Q learning; γ is the discount coefficient of Q learning; β is the probability coefficient of Q learning; s and s are the current state and the next state of the environment, respectively; R(s, s , a) is the reward function obtained from the current state s to the next state s with a selected action a. Both the current state s and the next state s belong to the state set S, i.e., s ∈ S and s ∈ S. The selected action a belongs to the action set A, i.e., a ∈ A.
A particular process in the AGC problem of a interconnected power system is: the frequency deviation ∆ f and the ACE are set to states; the generation command ∆P is set to an action; the parameters of Q learning for the simulation described in this paper are given in Appendix A.
The action for the output is selected by the given strategy, which should balance the exploration and the exploitation of the search of the action space. Generally, a greedy strategy for action selection always selects the maximum probability from the probability distribution matrix. Thus, the exploration for a selection is lost using the greedy strategy. Therefore, a selection strategy with both exploration and exploitation is applied to select an action from the action set. The selected action for the next iteration a is selected by a random probability p rand and the probability distribution matrix. The constraint of action selection can be described as follows: where arg (a ) is the index number of the selected action a in the action set.
Since the states and actions of Q learning are discrete values, the number of states in the state set and the number of actions in the action set should be increased to improve the accuracy of Q learning. Thus, to arrive at the optimal policy, the calculation memory of the state set and the action set of Q learning should be increased. In particular, Q learning can be applied to discrete control processes, such as AGC, whose control period is 4 s. Furthermore, to obtain higher convergence speed, other reinforcement learning algorithms have been employed for the AGC controller, such as, Q(λ) learning [14] and R(λ) learning [15]. Since the controller based on these reinforcement learning algorithms can update the control strategy online without knowing the model of the control object, the controller based on these reinforcement learning algorithms can obtain a high control performance in delay dynamic systems, such as large-scale interconnected power systems.

Deep Forest Reinforcement Learning
To obtain a more accurate control performance, the number of states and the number of actions in the action set of a conventional reinforcement learning should be increased. However, to reduce the effect of the curse of dimensionality, which leads to calculation memory error, the number of states and the number of actions in the action set of reinforcement learning should be reduced. To obtain a more accurate control performance and simultaneously reduce the curse of dimensionality, a DFRL algorithm is proposed in this paper.
The DFRL-based controller contains a recorder for diachronic states and actions, deep forest, and Q learning frameworks (Figure 4). To reduce the memory of calculation, the Q-value matrix Q and the probability distribution matrix P of Q learning are split into a total of n s Q-value matrices and n s probability distribution matrices, respectively. Thus, the calculation memory for matrix Q and matrix P can be reduced by 1 n s , if both matrix Q and matrix P are n q × n q matrices, and if they are symmetrically split into two n q n s × n q n s matrices. Therefore, the curse of dimensionality of the framework of Q learning can be reduced by 1 n s . For instance, n s is set to 3 in the simulations reported in this paper; thus, the curse of dimensionality of the framework of Q learning can be reduced by 1 3 .
In the framework of conventional Q learning, the immediate state S t , which is the state of the t-th time, is applied to represent a power system. However, in the framework of the DFRL, both the diachronic states and the diachronic actions are utilized to represent a power system. The major reasons are that (i) the power system is a time-delay system and (ii) diachronic actions can affect the state of the power system. Therefore, deep forest is applied to represent the next state of a power system with diachronic states and diachronic actions. Consequently, the inputs to the deep forest are the diachronic states of the environment and the diachronic actions of the DFRL -based controller, while the output of the deep forest is the next state of the power system.
The major features of the proposed DFRL-based controller can be summarized as 1. Since the calculation memory is reduced because of the split matrices Q, P, and A, the curse of dimensionality is reduced and a more accurate control performance can be obtained; thus, the number of actions in the action set and the number of states in the state set of the reinforcement learning of DFRL can be increased to obtain an accurate control action.

Deep Forest Reinforcement Learning as a Preventive Strategy for Automatic Generation Control
In the framework of the DFRL-based controller as a preventive strategy for AGC, the Q-value matrix is split into three submatrices, i.e., n s = 3. The number of classes of next states is three, depending on the frequency deviation ∆ f , i.e., RL-I for (−∞, −0.1] Hz, RL-II for (−0.1, 0.1] Hz, and RL-III for (0.1, ∞) Hz. The control period of the controller of the preventive strategy for AGC is set to 4 s, i.e., ∆t = 4 s. The current state of the environment S t is defined as the frequency deviation of the power system ∆ f t , i.e., S t = ∆ f t . The action at the t-th time of the power system a t is defined as the generation command for the AGC unit at the t-th time ∆P Gt , i.e., a t = ∆P Gt ( Figure 5). The reward value r t for the reward function can be described as where the reward value r t is a positive value λ 1 when |∆ f t | is less than or equal to ∆ f s0 ; λ 1 > 0, λ 2 < 0, and ∆ f s0 < 0.1 Hz. The variables λ 1 , λ 2 , ∆ f s0 in this work were set to 10, −100, and 0.005 Hz, respectively. The pseudo-code of the proposed DFRL for the preventive strategy for AGC is given in Algorithm 1.

Algorithm 1
Pseudo-code of the proposed deep forest reinforcement learning for the preventive strategy for automatic generation control 1: Initial parameters of Q learning of the DFRL, i.e., α, γ, β 2: Initial system state s 3: Initial Q-value matrix Q and probability distribution matrix P 4: while loop: do 5: Obtain the system state s from the environment as Equations (1), (6), and (7) 6: Save the system state s to the recorder of the DFRL 7: Calculate the reward value R(s, s , a) as Equation (15) 8: Select Q learning from the next systemic state as Equations (8)- (10) 9: Update Q-value matrix and probability distribution matrix as Equations (11) and (12), respectively 10: Select the output action by P and a random probability as Equation (13) 11: Given the selected output action to the AGC unit and the recorder of the DFRL 12: return loop Large-scale interconnected power system Frequency deviation Δf, scheduled tie-line power deviation ΔP T

Emergency situations
Calculate indices of ACE, CPS1, CPS2 by (1),(6), (7) Calculate the reward function by (15) Predict the next systemic state with deep forest by (8)-(10) Select Q learning from the next systemic state Update Q value matrix and probability distribution matrix by (11), (12) Select an action at the predicted state by (13)

AGC units
Actions: P G(t), P G(tt), P G(t-2 t), ..., P G(t-m t) P G(t) Figure 5. Flowchart of the deep forest reinforcement learning method for automatic generation control (AGC).

Pre-Training Process of Deep Forest Reinforcement Learning
DFRL and reinforcement learning are data-hungry algorithms. Generally, in the pre-training process of a reinforcement learning algorithm, the higher the control performance of the training data, the higher the convergence speed obtained. However, DFRL needs all the training data, including training data with high and low control performance. The training data of DFRL originated from the simulation data of reinforcement learning in this work. In the training process of conventional reinforcement learning, training data with a low control performance are ignored (Figure 6a). However, in the training process of DFRL, training data with both high and low control performances are applied to the deep forest for learning the dynamic system (Figure 6b). Furthermore, the more training data, the higher the control performance obtained by DFRL. Consequently, the major reasons that training data with high and low control performances can be applied to DFRL can be summarized as (i) the more data, which includes data with high and low control performances, the more state spaces of the power system covered; (ii) the more data, the more accurate the representation of DFRL obtained.
The data of transfer process will be ignored The data of transfer process will be applied

Case Study
The MATLAB/Simulink models and programs of the power system models in this study were developed in an Intel Core 8 Duo processor of a 2.4 GHz and 8 GB RAM computer with MATLAB version 9.1.0 (R2016b).
The number of forests in the multi-grained scanning and the cascade of DFRL were set to 2 and 8 as default settings, respectively. The number of trees in each forest of DFRL was set to 500 as a default. A larger number for n t means that more diachronic actions and states are recorded for the deep forest of DFRL and more calculation time. After extensive training, both the number of diachronic actions and states in this paper were set to 10, i.e., n t = 10. The average calculation time of each iteration of DFRL is 0.423 s when the number for n t is set to 10. Also, the prediction results with n t = 10 are the same as the prediction results with n t = 30, while the average calculation time of each iteration of DFRL is 1.862 s when this number n t = 30. The dimensions of the raw features were double the value of n t , i.e., d = 2n t = 20. Sliding window sizes were set to { d/8 , d/4 , d/2 }, i.e., {2, 5, 10}. The maximum depth of each tree growth was 100. Since the number of classifications of the next state n s was set to three, three subsidiary reinforcement learning algorithms were employed for DFRL. The learning rate α of these subsidiary reinforcement learning algorithms was set to 0.1 in this study; the value range of the learning rate should be α ∈ (0, 1); a small learning rate means a slow learning speed, and a small learning rate is suitable for application; a large learning rate means a high learning speed, while a large learning rate is suitable for offline training. Two different learning rates were configured for Q learning in [14], and a dynamic learning rate strategy was proposed by Junhong Nie and Simon Haykin [36]. The discounted rate of reward γ of these subsidiary reinforcement learning algorithms was set to 0.9 in this simulation. The value range of the discounted rate was set to γ ∈ (0, 1]. A larger discounted rate means greater importance of the Q-value history. The constant of the probability distribution β of these subsidiary reinforcement learning algorithms were set to 0.05. A large value of β means a high speed for updating the selection probability. The total states of these reinforcement learning algorithms (i.e., RL-I, RL-II, and RL-III) of DFRL cover from −∞ to ∞ (Table 1). These state range of these reinforcement learning algorithms are divided for AGC, which is a special discrete control system ( Table 1). The optimal control for an ideal discrete control system based on AGC is one in which ∆P i = −1 × e ACE . The number of actions in the action set of each subsidiary reinforcement learning algorithm of DFRL was set to 11 (Table 1), which is according to [3]; thus, the total number of actions in these three subsidiary reinforcement learning algorithms of DFRL was 33, or 11 × 3; meanwhile, the number of actions in the action set of conventional reinforcement learning was set to 33. Table 1. State ranges and action sets of RL-I, RL-II, and RL-III.

Pow. a Stat. b Value
All   Numerous conventional AGC algorithms are compared with the proposed algorithm for the preventive strategy for AGC in this paper, i.e., proportional-integral (PI), proportional-integraldifferential (PID), sliding mode controller (SMC), active disturbance rejection control (ADRC), fractional-order PID (FOPID), fuzzy logic control (FLC), artificial neural network (ANN), Q learning, Q(λ) learning, and R(λ) learning. Both the DFRL and the conventional AGC algorithms were simulated for three power systems, i.e., IEEE two-area power system, three-area power system, and the China Southern Power Grid (Figure 7). The parameters of these conventional AGC algorithms are given in Appendix A. These parameters were obtained by genetic algorithm with a simple configuration, i.e., the population size was set to 100, and the maximum number of generations was set to 100.
Both conventional AGC algorithms and the proposed DFRL-based controller were applied to 'Area A' in these three power systems. The China Southern Power Grid contains four areas of China, i.e., 'Area A' for the Guangdong Power Grid, 'Area B' for the Guangxi Power Grid, 'Area C' for the Guizhou Power Grid, and 'Area D' for the Yunnan Power Grid [37]. In these three power systems: each control area contains a governor 1/(1 + sT g ), a generator 1/(1 + sT t ), and a frequency response model K p /(1 + sT p ); the frequency response coefficient and the droop coefficient are B i and R i , respectively. Thus, the mathematical model of each generation unit can be described as follows.
where T g and T t are the time constants of the governor and generator of each control area, respectively. Also, the generation rate constraint k i GRC and adjustable capacity constraint P i max are included in the real China Southern Power Grid model. In these three power systems, the ith area is connected to the jth area with the alternating current tie-line response model 2πT ij /s. The parameters of these three power systems are given in Appendix B, Appendix C and Appendix D, respectively. Two large continuous disturbances were designed for these power systems, i.e., 'Case 1' and 'Case 2' (Figure 8). These two large continuous disturbances may lead to the occurrence of emergency situations: (i) A large disturbance in the continuous coverage of all active power values represents the system load value space, as designed for 'Case 1'. (ii) Since the power systems are delay systems, a large system load with delay was designed ('Case 2').
The control periods of the algorithms in all cases were set to 4 s, i.e., these algorithms were executed once every 4 s. The number of iterations of the training process of DFRL was set to 300 ( Figure 9). Simulation results show that the proposed DFRL method can obtain the best control performance with the smallest frequency deviation. The major reasons for this superior control performance are that: (i) the next state of a power system can be predicted by DFRL, such as the next states of the three-area power system in 'Case 1' (Figure 10); (ii) ∆ f obtained by conventional Q learning may be larger than 0.03 Hz (Figure 10) when the next state of a power system lacks prediction. The calculation memory of Q learning is 17,688 Bytes, while that of DFRL is 6072 Bytes. Thus, compared to Q learning, the calculation memory of DFRL is reduced by 34.328% in these simulations. The training time for the deep forest of the DFRL algorithm is 35.14 min. The training time for the ANN, Q learning, Q(λ) learning, and R(λ) learning algorithms is less then 10 min. The statistic simulation result obtained by the DFRL and conventional AGC algorithms is shown in Table 2 and Figure 11. Note that, in Table 2 and Figure 11, the frequency deviation ∆ f and ACE e ACE are the average absolute values of all areas in all cases in the two-area power system, three-area power system, and four-area power system, respectively. Statistic simulation results (Table 2 and Figure 11) obtained by the DFRL and conventional AGC algorithms show that: 1. The average absolute values of the frequency deviations obtained by DFRL are less than ∆ f 2 (i.e., 0.1 Hz), while the average absolute values of the frequency deviations obtained by 10 conventional AGC algorithms may larger than 0.1 Hz in both Case 1 and Case 2 ( Table 2); Therefore, the emergency situation of a large-scale interconnected power system and the curse of dimensionality can simultaneously be reduced by the proposed DFRL. 2. Compared to 10 conventional AGC algorithms, DFRL can obtain the highest control performance with a smaller absolute value of frequency deviation ∆ f and larger CPS index k CPS (Figure 11). 3. Since the deep forest of DFRL can perform representation learning for the states of the power system, the preventive strategy for AGC can be considered effective for a large-scale interconnected power system.

Conclusions
To reduce occurrences of emergency situations of power systems and mitigate the curse of dimensionality of reinforcement learning, a DFRL algorithm as a preventive strategy for AGC in large-scale interconnected power systems is proposed. Both the state set and action set of reinforcement learning are split into subsidiary reinforcement learning for mitigating the curse of dimensionality of reinforcement learning. A deep forest is then introduced to subsidiary reinforcement learning for forecasting the next state of the power system. Two cases of three power systems (i.e., two-area power system, three-area power system, and the China Southern Power Grid) with the DFRL and 10 conventional AGC algorithms were simulated in this work. The simulation results show that DFRL achieves the highest control performance. The major contributions of the DFRL algorithm can be summarized as follows: 1. After the pre-training process using the data of reinforcement learning, the deep forest of DFRL can effectively forecast the next state of a power system. Different from the conventional applications of deep forest, the deep forest of DFRL is incorporated into the control algorithm; 2. Since the subsidiary reinforcement learning algorithms of DFRL can update their strategies online, the DFRL can effectively provide generation commands to the controller as a preventive strategy for AGC in power systems. Compared to conventional AGC, the preventive strategy for AGC can predict the next systemic state of the power system; 3. Since the next systemic state can be predicted and the calculation memory can be reduced by the DFRL method, the proposed DFRL-based controller can effectively reduce occurrences of emergency situations in large-scale interconnected power systems and simultaneously mitigate the curse of dimensionality. The conventional framework of reinforcement learning can be divided into multiple subsidiary structures for mitigating the curse of dimensionality.
Author Contributions: All authors contributed equally to this work.
Funding: This research was funded by National Natural Science Foundation of China (51777078, 51477055).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: