Bacteria Foraging Reinforcement Learning for Risk-Based Economic Dispatch via Knowledge Transfer

: This paper proposes a novel bacteria foraging reinforcement learning with knowledge transfer method for risk-based economic dispatch, in which the economic dispatch is integrated with risk assessment theory to represent the uncertainties of active power demand and contingencies during power system operations. Moreover, a multi-agent collaboration is employed to accelerate the convergence of knowledge matrix, which is decomposed into several lower dimension sub-matrices via a knowledge extension, thus the curse of dimension can be effectively avoided. Besides, the convergence rate of bacteria foraging reinforcement learning is increased dramatically through a knowledge transfer after obtaining the optimal knowledge matrices of source tasks in pre-learning. The performance of bacteria foraging reinforcement learning has been thoroughly evaluated on IEEE RTS-79 system. Simulation results demonstrate that it can outperform conventional artiﬁcial intelligence algorithms in terms of global convergence and convergence rate.


Introduction
In recent years, the interconnection of regional power grids and high voltage, long-distance and bulk capacity transmission [1] have become new trends of power systems integrated with large-scale renewable energy sources such as wind and solar energy [2][3][4][5], which however may result in severe challenges to the secure and stable operation of power grids.In order to obtain an appropriate trade-off between system security and economical operation, risk assessment theory has been introduced into automatic generation control (AGC) [6] so as to improve the economic dispatch (EC) in the presence of various operation risks [7].
The security constrained optimal power flow (SCOPF) is an extension of conversional optimal power flow (OPF).The operation constraints of assumed contingencies are employed to enhance the EC security [8,9].With the development of SCOPF theory, the 1990s, several studies have discussed challenges and future trends of SCOPF [10,11].With the development of SCOPF theory, the 'N−1' deterministic security regulations have been widely adopted as a well-known benchmark of SCOPF nowadays.However, such method is inadequate to quantitatively analyse the operation risks, which may sometimes obtain an over-conservative result.To remedy this flaw, the probabilistic risks based OPF and relevant algorithms were developed in [12,13].Meanwhile, some researchers investigated the application of binding contingencies identification and post-contingency model approximation, such Energies 2017, 10, 638 2 of 24 that the size of SCOPF can be considerably reduced [14].In addition, [15] proposed a novel risk-based security-constrained EC, in which a risk index was adopted to accurately describe the overall power system security level.However, as the mathematical models presented in the work mentioned above are based on direct current power flow calculation, which normally ignores the influence of node voltage deviation.Actually, the assessment of the operation risk is inadequate for real power system.In addition, the actual active power demand is constantly fluctuating.Accordingly, generation control should be adjusted with changes of load level [16] in real time.Nevertheless, these existing studies focused only on single load level, which could not satisfy stricter requirement of practical operation.
To deal with the aforementioned issues, this paper introduces an advanced risk index considering the risk of both line overload and node voltage deviation under normal and fault conditions, which is based on nonlinear power flow calculations.The two objectives of risk-based economic dispatch (RBED), that is, fuel costs of generators and operation risk index, are both calculated in the presence of inner connections under different time scenarios during a day.As the fluctuation of load level is considered, 96 scenarios are uniformly selected in a day (24 h) to evaluate the risk based dispatch, with an interval of 15 min between two consecutive scenarios.
Generally, RBED is a complex mixed nonlinear programming problem.Conventional optimization algorithms, such as nonlinear programming [17], gradient decent method [18], interior point method [19], and the Newton method [20], may be easily trapped in a local optimum.Besides, an accurate system model and appropriate feasible initial solutions are needed to achieve a good application effect, based on which software (Gurobi [21] and CPLEX [22]) is not flexible enough and inapplicable for some complex problems.Hence, their application is relatively difficult and usually consumes a long period of time due to the large number of constrains under multiple operation conditions in RBED.
So far, an enormous variety of artificial intelligence (AI) algorithms, including genetic algorithm (GA) [23], quantum genetic algorithm (QGA) [24], artificial bee colony (ABC) [25], particle swarm optimization (PSO) [26] and bacteria foraging optimization (BFO) [27,28] have been successfully applied for an optimal power system operation due to their elegant merits of global convergence, model free feature, and applicability to discrete nonlinear problems.In particular, an optimization task can be tackled by variables, objective functions and the number of unsatisfied constraints.However, they usually tend to cost a long optimization period for each new task as no prior knowledge is exploited.Since there are 96 sub-tasks that need to be executed in RBED, it will consume plenty of time.It is assumed that either the scale of system is large or a large number of faults occur, so that the time limit of RBED is very difficult to meet.
Recently, transfer learning [29,30] has become a very powerful tool to accelerate the optimization based on machine learning.It is inspired from the fact that many practical engineering issues are related to historical ones which often share plenty of similar features in essence.Therefore, the optimization of a new task can be dramatically accelerated by appropriately exploiting the similarities from the experience (prior knowledge) of historical tasks (also called source tasks).Transfer learning has been widely applied in various problems, such as reactive power optimization [31], decentralized optimal carbon-energy combined-flow [32], cross-domain activity recognition [33], and pedestrian detection [34], etc. Q-learning algorithm, as one of the most widely used reinforcement learning, can be adopted for transfer learning.However, It merely employs a single agent to update the Q-value matrix, which leads to a relatively low convergence rate and sometimes even cause the curse of dimension in complex problems.Furthermore, a large number of iterations may be required due to the time-consuming trial-and-error mechanism of Q-learning.
In order to resolve the above disadvantages, this paper proposes a novel bacteria foraging reinforcement learning (BFRL) associated with knowledge transfer to handle RBED, which is developed from the BFO and Q-learning algorithm [35].A Q-value matrix is chosen as the knowledge matrix.The learning mode of BFO is introduced in BFRL to achieve a multi-agent collaboration, which can considerably accelerate the knowledge matrix update and reduce the iteration number.Then, the knowledge extension is employed to dramatically reduce the dimension of knowledge matrix, such that the curse of dimension can be effectively avoided.Through pre-learning, the knowledge matrices save the optimal prior knowledge from source tasks at first, on which the initial knowledge matrices of new tasks are developed thereafter.As a consequence, RBRD can be rapidly resolved by properly exploiting the similarity between source tasks and new tasks.Hence, BFRL is adequate to satisfy the fast calculation of RBED in practice, whose global convergence and the stability of new tasks can also be guaranteed through the knowledge transfer from source tasks.At last, BFRL is applied for RBED of 96 scenarios on RTS-79 system, which achieves better performance compared with that of some typical algorithms.
The following are the main motivations and innovations of this paper:

•
The conventional economic dispatch usually just focuses on the fuel costs of generators.In contrast, the proposed RBED is implemented to obtain a proper trade-off between economical operation and system security, which can simultaneously reduce the fuel costs and the operation risk of power systems.

•
Compared to the conventional method which merely considers the line overload in the SCOPF [9], the risks of both line overload and node voltage deviation are evaluated quantitatively based on the nonlinear power flow calculation by the proposed approach.In addition, it is resolved under various load scenarios thus being applicable to the load changes in practice.

•
The conventional optimization algorithms might be easily trapped at a local optimum due to their dependence on an accurate system model and the feasible initial solutions.In contrast, no accurate system model is required by BFRL, such that it can be easily implemented for a much broader range of practical issues, e.g., nonlinear objective functions and different complex constraints.

•
The knowledge learning of BFO and the trial-and-error of Q-learning can effectively cooperate in BFRL.Particularly, the knowledge matrix can significantly reduce the blindness of the random search via the cooperating bacteria.In turn, the update efficiency of knowledge matrix can be improved greatly via the multi-agent (i.e., the bacteria) collaboration.Besides, the dimension of knowledge matrix can also be reduced by knowledge extension.These merits accelerate the learning process hence being more feasible in practice.

•
The existing AI algorithms are usually incapable of knowledge storage or knowledge transfer, which may easily lead to a high computation burden as significant iterations and population size are needed to obtain a high-quality optimal solution.This would be unable to satisfy the requirement of RBED period (less than 15 min).In contrast, BFRL employs the Q-value matrix as the knowledge matrix to save the optimal knowledge in pre-learning, and then the prior knowledge obtained from the similar source tasks can be fully exploited for the new tasks.Therefore the convergence of BFEL can be dramatically accelerated and cost less than 15 min for practical implementation.

•
The simulation results verify the excellent performance of BFRL, especially on the convergence rate, which can reach 9 to 20 times faster than that of other AI algorithms, while a high-quality optimal solution and a high convergence stability can also be guaranteed.

Operation Risk Assessment
The operation risk assessment of power systems means a comprehensive evaluation with the possibility and severity of contingencies [36].The risk index I R can be calculated as follows: The current condition X f represents the current operating condition of a power system, which is associated with the operation risk Equations ( 2)- (8), thus it can be encoded with the output power of each generator P G , each node voltage U i , the power flow of each transmission line T i , the load demand of each node P Di , and the topology of power grid; E i represents the ith contingency; P r (E i ) and S ev (E i ) are the probability and severity of E i , respectively.
According to the statistical data, the failure rate of alternating current (AC) transmission line i at a certain time interval ∆t follows the Poisson distribution, thus its cumulative failure rate P r (E Fi ) can be described as: where E Fi and λ i denote the fault and failure rate of the ith transmission line, respectively.Assuming there are m transmission lines in a power system with a single fault (the fault of the ith line) occuring at time t, the probability of this fault P r (E S_Fi ) is calculated by [37]: where E S_Fi represents a single line fault in system; S UN is the set of all the normal operational transmission lines.The outage of a transmission line may results in a sudden line overload or a severe node voltage deviation in a power system, whose effect can be usually described by a linear function.However, a linear risk index may not be capable of effectively distinguishing between a minor fault and a severe fault.As a consequence, a nonlinear utility function is employed so as to fully describe different faults.
The overload of the ith faulty line is defined as: where ω Li is the overload of the ith faulty line; L i is the ratio of actual transmission power to transmission power constraint of the ith line; L 0 is the threshold and set to 0.9.Specifically, if L i is less than L 0 , the risk of the i-th line overload is set to zero.The severity of the i-th line overload S ev (ω Li ) can be described by: where a, b and c are all positive constants.Meanwhile, the first-order derivative and second-order derivative of S ev are also positive, which means the severity of the overload line increases monotonously.
Similarly, the voltage deviation of the ith node is defined by: where ω Vi represents the voltage deviation of the ith node ; U i is the voltage amplitude of the ith node, while U imax and U imin are its upper and lower bounds, respectively.The severity of the i-th node voltage deviation S ev (ω Vi ) is written by: Energies 2017, 10, 638 5 of 24 Hence, the global operation risk index I R of power system is developed by combining the risk of line overload and the risk of node voltage deviation, which yields: where I RL and I RV are the total risk index of line overload and node voltage deviation, respectively; Moreover, µ 1 and µ 2 are the corresponding weight coefficients, with µ 1 + µ 2 = 1.

Multi-Objective Risk Economic Dispatch
The aim of RBED is to considerably reduce the fuel costs of generators and the operation risk of power systems together with all the security constraints being satisfied.To simplify the problem, the RBED constraints are replaced by outer penalty functions.With this method, the likelihood of infeasibility can be minimized.In future study, barrier method will be employed to guarantee a feasible solution.Since the outer penalty function C V can be optimized throughout the n-dimensional real space, an initial solution outside the feasible field is acceptable, which can effectively reduce the difficulty of finding a feasible initial solution.The economic objective and the security objective are integrated as a single objective function through the linear weight method, as follows: Subject to [38]: where F C is the fuel costs of generators; C V represents the value of total constraint violations obtained under normal operation; M is the penalty factor; µ 3 and µ 4 are the weight coefficients, with µ 3 ∈ [0, 1], µ 4 ∈ [0, 1], and µ 3 + µ 4 = 1; state vector variable x = [U, θ, P G , Q G , T] T represents the node voltage amplitude, node voltage phase angle, active and reactive power of generator, the apparent power of line, respectively; z 1 and z 2 are the normalization references; P Di and Q Di are the active and reactive power of the ith load, respectively; θ ij is the voltage phase angle difference between the the ith and the jth node, G ij and B ij are the conductance and susceptance of line i-j, respectively; P Gimax and P Gimin are the upper and lower bounds of the generator active power while Q Gimax and Q Gimax are the upper and lower bounds of the generator reactive power, respectively; T imax denotes the power limit of the ith line; S G , S D and S L are the set of generators, load buses, and lines, respectively.The fuel costs can be chosen as: where ζ 0i , ζ 1i and ζ 2i are the coefficients of fuel costs, respectively.The value of total constraint violations can be defined as: Energies 2017, 10, 638 6 of 24 where P Gs is the generator active power on the slack bus; N v is the number of variables.It can be found that the integrated objective function f is the linear sum of two objective function (i.e., fuel costs F C and global operation risk index I R ) with the linear weights, and total constraint violations C V with the penalty factor M. Hence, the quality of an obtained optimal solutions is determined by the integrated objective function f, instead of the fuel costs F C or the global operation risk index I R solely.In general, a smaller fuel costs F C will not always lead to a smaller f due to an inevitable conflict between fuel costs F C and the global operation risk index I R .In other words, a smaller F C may even results in a larger f.

Standard BFO Algorithm
Standard BFO algorithm is inspired by the foraging behaviour of bacteria which normally has three typical modes: chemotactic mode, dispersal mode and reproduction mode [28].
Normally, the local searching of BFO is enhanced through the chemotactic mode, which can be described as: where ψ i (j, k, l) represents the position of the ith bacteria during the lth dispersal, the kth reproduction and the jth chemotactic; C i k is the step of swimming of the ith bacterium at the kth iteration; and ∆ is a unit vector in the direction of swimming, respectively.
Here the nonlinear decreasing inertia step C i k is introduced to replace the fixed step in standard BFO so as to balance the global and local search, which is written as: where C i start and C i end are the initial and the final steps, respectively; Iter max is the maximum iteration number.
In the reproduction mode, the bacteria are ranked according to the energy intensity firstly.Millions of years of struggle in harsh environment has driven bacteria to gradually evolve an optimal survival pattern for the overall benefits of the whole species: the superiors (those have the highest energy intensity) are eligible to freely and rapidly reproduce while the inferiors (those have the lowest energy intensity) are forced to inevitably die out.Assuming the number of employed bacteria in standard BFO to be N P , the number of bacteria to be eliminated is N P /2.Then the ones with the energy intensity ranking the second half of all bacteria are replaced by the other half bacteria.
In this paper, the reproduction mode is improved by introducing a crossover to spread the diversity of bacteria.The new bacteria to replace the eliminated ones are generated as: where i ∈ [1, N P /2]; and r 1 ∈ [0, 1] is a random number, respectively.The global convergence is improved via the dispersal mode, which occurs at a given probability P r_ed .When the dispersal probability is satisfied, the positions of the bacteria will change randomly.

Knowledge Matrix
Q-learning is one of the most famous and widely used reinforcement learning techniques, which contains three key elements including state s, action a, and reward R. The state-action value function Q is the knowledge matrix of all the state-action pairs, i.e., Q(s, a).The agent of Q-learning can update the knowledge matrix by feedback reward R from taking an action a to the environment in the current state s.Each element of the matrix represents the knowledge of the corresponding state-action pair, which is used to estimate the discounted sum of future rewards started from the current state and action policy.Note that the text 'Q' represents the name of Q-learning while the symbol 'Q' denotes the knowledge matrix.
Since the knowledge matrix saves the optimization policy, it can be treated as the brain of the agent.The knowledge matrix in Q-learning algorithm is updated by a single agent through the trial-and-error.The agent tries an action a and obtains a reward R from the environment in state s.Thus the corresponding knowledge value of the state-action pair Q(s, a) can be updated.Then in a certain state s, the agent will prefer to choose the action related to a large knowledge element Q(s, a).Hence, the knowledge matrix will gradually converge.
To accelerate the update of the knowledge matrix, the bacteria are employed as multiple agents of BFRL.The bacteria (i.e., agents) in different modes can change their positions (i.e., select actions a) according to Equations ( 13)-( 15), ( 18) and ( 19) and acquire the energy (i.e., get rewards R) from the solution space (i.e., the environment).
As shown in Figure 1, a bacterium can obtain an action policy under a given state from the knowledge matrix and update its prior knowledge by the feedback of reward, which helps to boost the accumulative energy intensity of bacteria during foraging.The bacteria can obtain higher energy intensity from the red area of the figure.
Energies 2017, 10, 638 7 of 24 Since the knowledge matrix saves the optimization policy, it can be treated as the brain of the agent.The knowledge matrix in Q-learning algorithm is updated by a single agent through the trial-and-error.The agent tries an action a and obtains a reward R from the environment in state s.Thus the corresponding knowledge value of the state-action pair Q(s, a) can be updated.Then in a certain state s, the agent will prefer to choose the action related to a large knowledge element Q(s, a).Hence, the knowledge matrix will gradually converge.
To accelerate the update of the knowledge matrix, the bacteria are employed as multiple agents of BFRL.The bacteria (i.e., agents) in different modes can change their positions (i.e., select actions a) according to Equations ( 13)-( 15), ( 18) and ( 19) and acquire the energy (i.e., get rewards R) from the solution space (i.e., the environment).
As shown in Figure 1, a bacterium can obtain an action policy under a given state from the knowledge matrix and update its prior knowledge by the feedback of reward, which helps to boost the accumulative energy intensity of bacteria during foraging.The bacteria can obtain higher energy intensity from the red area of the figure.Basically, the knowledge matrix of Q-learning is a lookup table with the size of |Spa|×|A|, where Spa is the state space and A is the action space.If Q-learning is used for solving RBED, the actions, namely the value of controlled variables, are independent from each other [38].Assuming there are k variables and Ni available actions in each space, then the size of action space |A| is calculated by N1 × N2 × … × Nk−1 × Nk.It is obvious that the curse of dimension may be emerged if the action space is too large.
As illustrated in Figure 2, BFRL employs a knowledge extension in order to considerably reduce the dimension of the original knowledge matrix Q. Q is divided into several knowledge sub-matrices Q i , which are one-to-one correspondence with the variables.Furthermore, the elements of neighboring sub-matrices are defined as related knowledge, which means the action space of Q i , i.e., the range of the ith variable, is the same as the state space of Q i+1 .In other words, the value of the (I + 1)th controlled variable cannot be selected until the ith variable has been determined.Note that the original high-dimension knowledge matrix is decomposed into multiple low-dimension sub-matrices through extension chains between related knowledge.Basically, the knowledge matrix of Q-learning is a lookup table with the size of |S pa |×|A|, where S pa is the state space and A is the action space.If Q-learning is used for solving RBED, the actions, namely the value of controlled variables, are independent from each other [38].Assuming there are k variables and N i available actions in each space, then the size of action space |A| is calculated by It is obvious that the curse of dimension may be emerged if the action space is too large.
As illustrated in Figure 2, BFRL employs a knowledge extension in order to considerably reduce the dimension of the original knowledge matrix Q. Q is divided into several knowledge sub-matrices Q i , which are one-to-one correspondence with the variables.Furthermore, the elements of neighboring sub-matrices are defined as related knowledge, which means the action space of Q i , i.e., the range of the ith variable, is the same as the state space of Q i+1 .In other words, the value of the (I + 1)th controlled variable cannot be selected until the ith variable has been determined.Note that the original high-dimension knowledge matrix is decomposed into multiple low-dimension sub-matrices through extension chains between related knowledge.
The Knowledge matrix is merely updated by a single agent in Q-learning.As a result, only one element can be updated in each cycle, which leads to a relatively slow convergence.In contrast, the multi-agent collaboration is adopted in BFRL, where the bacteria share the same knowledge sub-matrices.Consequently, multiple elements can be updated in a single iteration, which would significantly accelerate the learning rate.The knowledge sub-matrix Q i is updated as follows [39]: Energies 2017, 10, 638 8 of 24 where i denotes the ith knowledge sub-matrix and j denotes the jth bacteria; actions, namely the value of controlled variables, are independent from each other [38].Assuming there are k variables and Ni available actions in each space, then the size of action space |A| is calculated by N1 × N2 × … × Nk−1 × Nk.It is obvious that the curse of dimension may be emerged if the action space is too large.
As illustrated in Figure 2, BFRL employs a knowledge extension in order to considerably reduce the dimension of the original knowledge matrix Q. Q is divided into several knowledge sub-matrices Q i , which are one-to-one correspondence with the variables.Furthermore, the elements of neighboring sub-matrices are defined as related knowledge, which means the action space of Q i , i.e., the range of the ith variable, is the same as the state space of Q i+1 .In other words, the value of the (I + 1)th controlled variable cannot be selected until the ith variable has been determined.Note that the original high-dimension knowledge matrix is decomposed into multiple low-dimension sub-matrices through extension chains between related knowledge.

Knowledge Learning
The search pattern of the BFO is completely random, which usually leads to a blindness and inefficiency of problem solving.Different from the random exploration of BFO, BFRL can search the solution space according to the knowledge matrix and update the knowledge matrix using the received reward, such that a more informative and meaningful exploration can be realized.
As illustrated in Figure 3, there are bacteria in either chemotactic mode or dispersal mode at the beginning of each iteration.In a given iteration, the mode of each bacterium is assigned in a certain percentage.Then the learning of bacteria in two modes is conducted in different ways, while each bacterium receives a reward and updates the knowledge matrix accordingly.Furthermore, all the bacteria move to the reproduction mode, which means the end of each iteration.As described in Section 3.1, the bacteria are either reproduced or died out according to their obtained reward ranking.
In the next iteration, the modes of bacteria are reassigned.The bacteria with higher reward are assigned to the chemotactic mode while others are assigned to the dispersal mode.The Knowledge matrix is merely updated by a single agent in Q-learning.As a result, only one element can be updated in each cycle, which leads to a relatively slow convergence.In contrast, the multi-agent collaboration is adopted in BFRL, where the bacteria share the same knowledge sub-matrices.Consequently, multiple elements can be updated in a single iteration, which would significantly accelerate the learning rate.The knowledge sub-matrix Q i is updated as follows [39]: where i denotes the ith knowledge sub-matrix and j denotes the jth bacteria;

Knowledge Learning
The search pattern of the BFO is completely random, which usually leads to a blindness and inefficiency of problem solving.Different from the random exploration of BFO, BFRL can search the solution space according to the knowledge matrix and update the knowledge matrix using the received reward, such that a more informative and meaningful exploration can be realized.
As illustrated in Figure 3, there are bacteria in either chemotactic mode or dispersal mode at the beginning of each iteration.In a given iteration, the mode of each bacterium is assigned in a certain percentage.Then the learning of bacteria in two modes is conducted in different ways, while each bacterium receives a reward and updates the knowledge matrix accordingly.Furthermore, all the bacteria move to the reproduction mode, which means the end of each iteration.As described in Section 3.1, the bacteria are either reproduced or died out according to their obtained reward ranking.In the next iteration, the modes of bacteria are reassigned.The bacteria with higher reward are assigned to the chemotactic mode while others are assigned to the dispersal mode.In BFRL, the knowledge learning of bacteria in dispersal mode is guided by the knowledge matrix, which is different from that of standard BFO.For a given state, a larger knowledge element means a higher reward value obtained under the corresponding action.In other words, the information belonging to superiors has been saved with the update of the knowledge matrix.Furthermore, a roulette wheel selection is used based on the state-action probability matrix O i when the dispersal probability P r_ed is satisfied.Otherwise, the action with the largest knowledge element argmax , a i ) is selected.For a controlled variable, an action of each bacterium is selected as follows: where r 2 ∈ [0, 1] is a random number; a s denotes a random global action determined by the distribution of state-action probability matrix O i , which is updated by: where β is the divergence factor to magnify the divergence of Q i and e i is the introduced transition matrix in the calculation.

Knowledge Transfer
Assuming there are multiple similar tasks to complete for BFRL, the efficiency of new tasks can be improved greatly via knowledge transfer.
As shown in Figure 4, knowledge transfer can accelerate the learning of new tasks based on the existing ones.If the state space and the action space remain constant, the optimal knowledge matrices of the source tasks can be treated as the initial knowledge matrices of the target tasks, which are called the prior knowledge [31].The source tasks need to be executed during the pre-learning to obtain the optimal knowledge matrices, from which the prior knowledge is exploited for the relevant new tasks.Then the initial knowledge matrices of source tasks Q S will be transferred to the prior knowledge matrices of new tasks Q N in transfer learning.

Convergence Characteristics
Firstly, it is important to note that the conventional Q-learning can converge to the optimal Q-value matrix Q* as all the actions are sufficiently explored in each state space, while the global optimum can be determined by the optimal Q-value matrix Q*, in which the detailed proof can be found in [40].Moreover, the learning mode of BFRL is the same as that of Q-learning, while two main improvements of BFRL compared with Q-learning can be summarized as: (1) knowledge Here, the similarities among different tasks are contained in the prior knowledge matrices, together with some unrelated ones.As a result, a malignant negative transfer may sometimes emerge.To handle this, the extraction of closely relevant knowledge and the identification of similarities among different tasks are emphasized during the transfer learning of BFRL.

Convergence Characteristics
Firstly, it is important to note that the conventional Q-learning can converge to the optimal Q-value matrix Q* as all the actions are sufficiently explored in each state space, while the global optimum can be determined by the optimal Q-value matrix Q*, in which the detailed proof can be found in [40].Moreover, the learning mode of BFRL is the same as that of Q-learning, while two main improvements of BFRL compared with Q-learning can be summarized as: (1) knowledge transfer and (2) exploration and exploitation based on bacteria foraging mechanism.Specifically, the first one only changes the initial Q-value matrix, thus it can approximate the optimal Q-value matrix for a current optimization task.Besides, the second one only accelerates the update efficiency of Q-value matrix.Therefore, BFRL will only accelerate the convergence compared with Q-learning, while the convergence can be completely guaranteed as all the actions are sufficiently explored in each state space.

BFRL Structure
The RBED is different from the conventional AC optimal power flow.To obtain the objective function, the risk index of a power system should be calculated at first, so the AC power flow calculations under normal condition and all the fault conditions need to be executed by AI algorithms.When N f faults are included in the contingency, the number of power flow calculation in RBED will be (N f + 1) times higher than that of the conventional AC optimal power flow.Therefore, RBED requires much longer time.Assuming there are N ed dispersals, N re reproductions and N c chemotactic in BFO for RBED, as well as a maximum swimming number N s , the total times of power flow calculation becomes N ed × N re × N c × N s × (N f + 1); this leads to an extremely slow calculation.In contrast, the optimization efficiency can be dramatically improved due to the removal of nested cycles in BFRL.

Design of State and Action
The generator active power on PV nodes is selected as the controlled variable.The action space A(A PG1 , A PG2 , . . ., A PGNq ) is consistent with the controlled variable space, namely the positions of the bacteria, where N q is the number of controlled variables.Besides, the action space of the former one is the state space of the latter one.The knowledge sub-matrices corresponding to the state-action pair of the variables are denoted as Q PG1 , Q PG2 , . . ., Q PGNq , respectively.
In general, the penalty factor M should be appropriately chosen: If it is too small, the minimal point of the penalty function is apart from the optimal solution and results in a low efficiency; if it is too large, the penalty function minimization would be very slow [41].Since C V is large enough compared to that of the normalized fuel costs and risk index, M is chosen to be 1.

Knowledge Transfer
The core task of learning efficiency improvement is to extract the similarities between the source tasks and the new tasks.The optimization of RBED is mainly determined by the power flow of power systems.In practice, it is closely dependent on the active power demand as the topology and the operation conditions are relatively steady in a short time.Thus, the active power deviation is defined as the similarity between the source tasks and the new tasks.The active power demand is divided into multiple load intervals as follows: [P Ds1 , P Ds2 ), [P Ds2 , P Ds3 ), . . ., [P Dsi−1 , P Dsi ), . . ., [P Dsn−1 , P Dsn ) (21) where [P Dsi-1 , P Dsi ) is a half-open load interval; P Dsi represents the power demand of the ith load intervals in the source task, with P Ds1 < P Ds2 < P Dsi < P Dsn-1 < P Dsn .Moreover, the closely related knowledge of source tasks should be exploited in priority for a new task in order to enhance the transfer learning effectiveness.
Assuming the power demand of a new task x is represented by P Dx , with P Di < P Dx < P Dk , the similarities between the new task and two source tasks can be calculated as: where η 1 and η 2 are the similarities coefficients, with η 1 + η 2 = 1.
The knowledge matrix of the new task x can be obtained by a linear transfer, which yields: where Q i x , Q i j and Q i k denote the knowledge sub-matrices of the ith variable in source task x, source task j and new task k, respectively.
The overall knowledge transfer can be summarized as follows: Step 1 Select several scenarios as the source tasks from the daily load curve at a fixed time interval.
Step 2 Execute the pre-learning and save the knowledge in the knowledge matrices of the source tasks.
Step 3 Calculate the similarities between the new tasks and the closest source tasks based on the active power deviation.
Step 4 Obtain the initial knowledge matrix of the source tasks.

Execution Procedure of BFRL for RBED
The execution procedure of BFRL for RBED is shown in Figure 5.

Parameters Setting
In BFRL, the crucial parameters include the population size N P , dispersal probability P r_ed , learning factor α, discount factor γ and Iter max [42].Basically, these parameters should be carefully set by the following guidelines:

•
A larger population size N P may increase the probability of approaching the global optimum with longer time, here N P ≥ 1.

•
The dispersal probability P r_ed determines the trade-off between exploration and exploitation.A larger P r_ed means the roulette wheel selection is preferred, with 0 < P r_ed < 1.

•
The learning factor α influences the learning rate.A larger α tends to accelerate the learning rate while the algorithm may however reach a pre-convergence.

•
The discount factor γ discounts the future rewards of the knowledge matrix.A smaller discount factor γ means the current reward is more important.

•
Iter max is the maximum number of the iterations, which determines the quality of optimal solutions and the calculation time.In this paper, Iter max is selected from some given values such as 50, 100, 150, 200 and 250.Iter max is designed to balance the quality of optimal solutions and the calculation time via trial-and-error.Generally, a larger Iter max will result in a higher quality of optimal solutions while it will consume more time.According to the result of trial-and-error, it can be found that the objective function obtained by BFRL can achieve a stable minimum value or fluctuate in a very small range when the number of iterations is larger than 150.So the Iter max is set to 150 as it is large enough to ensure stable optimal solutions and shorten the calculation time.
Through extensive trial-and-error, the optimal parameters are listed in Table 1.
Energies 2017, 10, 638 12 of 24 Step 4 Obtain the initial knowledge matrix of the source tasks.

Execution Procedure of BFRL for RBED
The execution procedure of BFRL for RBED is shown in Figure 5.

Parameters Setting
In BFRL, the crucial parameters include the population size NP, dispersal probability Pr_ed, learning factor α, discount factor γ and Itermax [42].Basically, these parameters should be carefully set by the following guidelines:

•
A larger population size NP may increase the probability of approaching the global optimum with longer time, here NP ≥ 1.

•
The dispersal probability Pr_ed determines the trade-off between exploration and exploitation.
A larger Pr_ed means the roulette wheel selection is preferred, with 0 < Pr_ed < 1.

•
The learning factor α influences the learning rate.A larger α tends to accelerate the learning rate while the algorithm may however reach a pre-convergence.

•
The discount factor γ discounts the future rewards of the knowledge matrix.A smaller discount factor γ means the current reward is more important.

•
Itermax is the maximum number of the iterations, which determines the quality of optimal solutions and the calculation time.In this paper, Itermax is selected from some given values such as 50, 100, 150, 200 and 250.Itermax is designed to balance the quality of optimal solutions and the calculation time via trial-and-error.Generally, a larger Itermax will result in a higher quality of optimal solutions while it will consume more time.According to the result of trial-and-error, it can be found that the objective function obtained by BFRL can achieve a stable minimum value or fluctuate in a very small range when the number of iterations is larger than 150.So the

Case Studies
The simulation is undertaken on an AMAX server with an Intel Xeon E5-2670 CPU at 2.3 GHz with 64 GB of RAM.The power flow calculation is based on the Matpower 6.0 toolbox in MATLAB R2014a.The performance of BFRL for RBED has been evaluated on IEEE RTS-79 system [43] compared with that of other algorithms, e.g., GA [23], QGA [24], ABC [25], PSO [26], BFO [27,28] and Q-learning [40].For each algorithm, there are both feasible and infeasible solutions to the proposed RBED problem.If an algorithm finds infeasible solutions, the power flow calculation may not converge or the controlled variables may violate the constraints.Then C V is greater than 0 and its fitness function become larger than that of others, so that an individual is forced to find another solution and the previous infeasible solution can be eliminated.If an algorithm finds feasible solutions, then C V becomes 0 and the fitness function is smaller, which leads others to approach it.When all the algorithms complete the default maximum number of iteration, the optimal one can guarantee the smallest fitness function as well as the convergence of the power flow with all the operation constraints being satisfied.Therefore, for each AI algorithm, their final convergence will be a feasible solution as long as the population size and the maximum number of iterations are set to be sufficiently large.
In this paper, the main parameters of each algorithm have been determined via trial-and-error.Therefore, the simulation results obtained by these algorithms can achieved a proper trade-off between the quality of optimal solutions and the calculation time.To shorten the execution time of fitting the parameters effectively, the uniform design is adopted [44].For example, there are four crucial parameters which may have a great influence on the performance of GA in different optimization tasks.Assume that the value of each parameter is divided into 10 discrete levels, e.g., the mutation probability can be quantized into 10 discrete levels as [0.05, 0.1, . . ., 0.5], then 10 × 10 × 10 × 10 = 10 4 experiments should be executed to fit all the parameters, which will result in an extremely high computational burden.In contrast, only 10 experiments are needed via the uniform design.The main parameters of other algorithms have been listed in Table 2.

Simulation Scheme
The detailed simulation scheme can be illustrated in Table 3.
Table 3.The detailed simulation scheme of the proposed technique.

Number of Step Detailed Simulation Scheme
Step 1: Calculate the fault probability of each line in RTS-79 system according to Equations ( 1)-( 3).Then five 'N−1' line faults and two 'N−2' line faults are selected as the contingencies.
Step 2: Choose a typical load curve and divide it into 96 optimization tasks.
Step 3: Determine the source tasks via trail-and-error, which are usually chosen to be as small as possible to ensure the effectiveness of knowledge transfer.
Step 4: Select the output active power of generators as the controlled variable, and then determine the action space A of BFRL.
Step 5: Determine the parameters used in the pre-learning of BFRL via trail-and-error and evaluate the pre-learning, then the optimal knowledge matrices obtained under the selected 21 source tasks will be saved.
Step 6: Develop the initial knowledge matrices of the new tasks from the prior optimal knowledge matrices according to Equation ( 23) and select the optimal parameters.
Step 7: Choose the optimal parameters of other AI algorithms for RBED via trial-and-error.
Step 8: Implement each algorithm for RBED in 10 runs with 96 new tasks to compare their performance, including computation time, convergence time, quality of obtained optimal solution, and convergence stability.
Step 9: Analyse and conclude the simulation results.

Simulation Model
The IEEE RTS-79 is a typical benchmark with a base capacity of 100 MVA, including 24 buses, 34 transmission lines/transformers and 32 generators, the configuration of which is illustrated in Figure 6 [45].Here, bus 21 is chosen as the slack bus as it has the largest capacity.Furthermore, the generator active power of other buses is chosen as the controlled variables.The fuel costs coefficients of RTS-79 system can be found in [46].The daily load curve almost represents the trend of each day in a period of time (e.g., a month or a season).Based on this typical daily load curve, the operators will make an optimal operating schedule of the power system.And the typical daily load curve in Figure 7 is modelled from an actual province grid of southern China.As illustrated by Figure 7, a typical daily load curve can be divided into 96 scenarios with 15 min for each.In order to evaluate the adaptability of BFRL under  The daily load curve almost represents the trend of each day in a period of time (e.g., a month or a season).Based on this typical daily load curve, the operators will make an optimal operating schedule of the power system.And the typical daily load curve in Figure 7 is modelled from an actual province grid of southern China.As illustrated by Figure 7, a typical daily load curve can be divided into 96 scenarios with 15 min for each.In order to evaluate the adaptability of BFRL under different load levels, several case studies are carried out in all scenarios, which lead to 96 tasks.The daily load curve almost represents the trend of each day in a period of time (e.g., a month or a season).Based on this typical daily load curve, the operators will make an optimal operating schedule of the power system.And the typical daily load curve in Figure 7 is modelled from an actual province grid of southern China.As illustrated by Figure 7, a typical daily load curve can be divided into 96 scenarios with 15 min for each.In order to evaluate the adaptability of BFRL under different load levels, several case studies are carried out in all scenarios, which lead to 96 tasks.
The contingencies are listed in Table 4, which includes five 'N−1' transmission line faults and two 'N−2' transmission line faults most likely to occur with related outage mode.The probability of each fault can be calculated according to Equation (3).The contingencies are listed in Table 4, which includes five 'N−1' transmission line faults and two 'N−2' transmission line faults most likely to occur with related outage mode.The probability of each fault can calculated according to Equation (3).

The Pre-Learning
Before the online learning of BFRL, several appropriate scenarios need to be chosen as source tasks in the pre-learning, on which the initial knowledge matrices of new tasks can be based.Figure 7 demonstrates that the active power demand in 96 scenarios is distributed between 1685 MW to 2850 MW, while 21 scenarios are sampled with the same capacity of 55 MW, ranked from low to high as 16,21,24,5,26,2,1,31,32,94,56,51,36,88,40,42,68,70, 80, 79 and 77, respectively.
In addition, the convergence of BFRL obtained under scenario 1 is presented in Figure 8, which is compared with that of BFO.In around 270 s, BFRL can almost find the minimal fitness function.In contrast, BFO needs about 768.8 s.The convergence of BFRL is nearly 2.8 times faster than that of BFO with a better optimal solution.Moreover, it needs to claim that the searching efficiency of BFRL is not important in the pre-learning process, thus a large population size and a huge number of iterations are adopted to ensure its global convergence.However, the searching rate of BFO is still slower than that of BFRL due to the nested cycles.Besides, the random search in BFO is relatively blind.To handle this obstacle, the P r_ed -greedy rule and multi-mode exploration are integrated into BFRL.As a result, the optimal objective function of BFRL is 33% smaller than that of BFO.
BFRL is not important in the pre-learning process, thus a large population size and a huge number of iterations are adopted to ensure its global convergence.However, the searching rate of BFO is still slower than that of BFRL due to the nested cycles.Besides, the random search in BFO is relatively blind.To handle this obstacle, the Pr_ed-greedy rule and multi-mode exploration are integrated into BFRL.As a result, the optimal objective function of BFRL is 33% smaller than that of BFO.

Transfer Learning
The optimal action policies of source tasks are obtained through pre-learning and saved in the knowledge matrices, which will be transferred to be the initial knowledge matrices of new tasks according to their similarities.For example, the power demand of scenario 4 is 1887 MW while scenario 5 and 26 are the two closest, whose power demands are 1858 MW and 1916 MW, respectively.Then the initial knowledge matrix of scenario 4 can be developed from the linear weighed sum of the optimal matrices of scenario 5 and 26.
The convergence time of each algorithm of the 4th new task are given in Figure 9 and Table 5.In Tables 5-7, the best convergence results of all the algorithms are bolded.Note that the convergence time of BFRL is only 46 s thanks to the knowledge transfer, which is about 5.6 to 10% of that of other algorithms.Furthermore, compared to the convergence time in pre-learning, the rate of BFRL is increased by nearly 10 times, which verifies the efficiency of transfer learning.Since the time period of RBED for each scenario optimization is about 15 min, even if more faults are considered, the BFRL are still fast enough to meet such time limits.Moreover, the reinforcement learning needs to undergo the whole Markov process before convergence.As illustrated in Figure 10, the fuel costs of generators grow with the power demand as the generators should increase the output to balance such load increases.On the other hand, the line overload and node voltage deviation are more severe with a higher load level.Compared with Figure 7, the variations of fuel costs and risk index are consistent with the daily load curve.This demonstrates that the prior knowledge is effectively exploited.

Transfer Learning
The optimal action policies of source tasks are obtained through pre-learning and saved in the knowledge matrices, which will be transferred to be the initial knowledge matrices of new tasks according to their similarities.For example, the power demand of scenario 4 is 1887 MW while scenario 5 and 26 are the two closest, whose power demands are 1858 MW and 1916 MW, respectively.Then the initial knowledge matrix of scenario 4 can be developed from the linear weighed sum of the optimal matrices of scenario 5 and 26.
The convergence time of each algorithm of the 4th new task are given in Figure 9 and Table 5.In Tables 5-7, the best convergence results of all the algorithms are bolded.Note that the convergence time of BFRL is only 46 s thanks to the knowledge transfer, which is about 5.6 to 10% of that of other algorithms.Furthermore, compared to the convergence time in pre-learning, the rate of BFRL is increased by nearly 10 times, which verifies the efficiency of transfer learning.Since the time period of RBED for each scenario optimization is about 15 min, even if more faults are considered, the BFRL are still fast enough to meet such time limits.Moreover, the reinforcement learning needs to undergo the whole Markov process before convergence.As illustrated in Figure 10, the fuel costs of generators grow with the power demand as the generators should increase the output to balance such load increases.On the other hand, the line overload and node voltage deviation are more severe with a higher load level.Compared with Figure 7, the variations of fuel costs and risk index are consistent with the daily load curve.This demonstrates that the prior knowledge is effectively exploited.The daily optimal objective function of RTS-79 system obtained by each algorithm is illustrated by Figure 11.The curve of objective function by BFRL is just slightly higher than that of GA while lower than that of other algorithms, which verifies the superior global convergence ability of BFRL.In general, AI algorithms are random and uncertain in finding an optimal solution, i.e., the obtained optimal solution may vary in different runs.To further compare the optimization performance, each AI algorithm is implemented in 10 runs.In each run, the optimization processes are evaluated under 96 scenarios.So for each algorithm, the total number of runs is equal to 10 times × 96 scenarios = 960, which is considered to be proper to evaluate the convergence stability of each algorithm [32].Furthermore, the significance of our simulation results has been proven for performance comparisons, including calculation time, convergence time, quality of obtained optimal solution, and the distribution statistical results of obtained objective function (i.e., the convergence The daily optimal objective function of RTS-79 system obtained by each algorithm is illustrated by Figure 11.The curve of objective function by BFRL is just slightly higher than that of GA while lower than that of other algorithms, which verifies the superior global convergence ability of BFRL.The daily optimal objective function of RTS-79 system obtained by each algorithm is illustrated by Figure 11.The curve of objective function by BFRL is just slightly higher than that of GA while lower than that of other algorithms, which verifies the superior global convergence ability of BFRL.In general, AI algorithms are random and uncertain in finding an optimal solution, i.e., the obtained optimal solution may vary in different runs.To further compare the optimization performance, each AI algorithm is implemented in 10 runs.In each run, the optimization processes are evaluated under 96 scenarios.So for each algorithm, the total number of runs is equal to 10 times × 96 scenarios = 960, which is considered to be proper to evaluate the convergence stability of each algorithm [32].Furthermore, the significance of our simulation results has been proven for performance comparisons, including calculation time, convergence time, quality of obtained optimal solution, and the distribution statistical results of obtained objective function (i.e., the convergence In general, AI algorithms are random and uncertain in finding an optimal solution, i.e., the obtained optimal solution may vary in different runs.To further compare the optimization performance, each AI algorithm is implemented in 10 runs.In each run, the optimization processes are evaluated under 96 scenarios.So for each algorithm, the total number of runs is equal to 10 times × 96 scenarios = 960, which is considered to be proper to evaluate the convergence stability of each algorithm [32].Furthermore, the significance of our simulation results has been proven for performance comparisons, including calculation time, convergence time, quality of obtained optimal solution, and the distribution statistical results of obtained objective function (i.e., the convergence stability).In Table 6, the calculation time of each algorithm is the total execution time to solve 96 new tasks while the convergence time is the average optimization time of a single load scenario, which can clearly describe the optimization efficiency of the algorithm.The fuel costs of generators, risk index, and the optimal objective function are the sum of 96 new tasks, which are the statistical data on the obtained optimal solutions.Note that the quality of an obtained optimal solution is only determined by the integrated objective function, instead of the fuel costs or the global operation risk index.Although QGA achieves a less fuel costs F C than that of BFRL, QGA has a larger integrated objective function f due to a much larger global operation risk index I R than that of BFRL.Hence, BFRL outperforms QGA as it obtains a higher quality optimal solution with a smaller f, which is shown obviously in Figure 12.Table 7 gives the statistical data of the convergence stability of each algorithm obtained in 10 runs.The data in the first two columns of the table are the worst and the best objective functions obtained by each algorithm in 10 runs.Variance is the expectation of the squared deviation of objective functions from their mean value, which measures how far the results in 10 runs are spread out from the mean value.Furthermore, standard deviation is the arithmetic square root of the variance.The ratio of the standard deviation to the mean value of objective functions is the relative standard deviation (RSD), which is used to indicate the precision of the simulation results.Table 7 gives the statistical data of the convergence stability of each algorithm obtained in 10 runs.The data in the first two columns of the table are the worst and the best objective functions obtained by each algorithm in 10 runs.Variance is the expectation of the squared deviation of objective functions from their mean value, which measures how far the results in 10 runs are spread out from the mean value.Furthermore, standard deviation is the arithmetic square root of the variance.The ratio of the standard deviation to the mean value of objective functions is the relative standard deviation (RSD), which is used to indicate the precision of the simulation results.It's obvious that the variance of BFRL is the smallest among all.Particularly, the relative standard deviation of BFRL is only 52.5% of that of PSO.Although the RSD of ABC and QGA are smaller than that of others, their optimal solutions are not satisfactory.
Figures 13 and 14 are the distribution boxplots of fuel costs of generators and risk index, respectively.From top to bottom, the five horizontals are the maximum, upper quartile, median, lower quartile and minimum of convergence result obtained in 10 runs.One can readily find that the length of BFRL is the shortest, which means its variation is the smallest.Besides, its median is also located at a relatively low position, which verifies that BFRL has strong global convergence ability with stable convergence.It's obvious that the variance of BFRL is the smallest among all.Particularly, the relative standard deviation of BFRL is only 52.5% of that of PSO.Although the RSD of ABC and QGA are smaller than that of others, their optimal solutions are not satisfactory.

GA
Figures 13 and 14 are the distribution boxplots of fuel costs of generators and risk index, respectively.From top to bottom, the five horizontals are the maximum, upper quartile, median, lower quartile and minimum of convergence result obtained in 10 runs.One can readily find that the length of BFRL is the shortest, which means its variation is the smallest.Besides, its median is also located at a relatively low position, which verifies that BFRL has strong global convergence ability with stable convergence.

Efficiency and Effectiveness of BFRL
From the above simulation results, it can be concluded that the comprehensive performance of BFRL is the best among all the algorithms, which includes the optimization rate, the quality of obtained optimal solution and the convergence stability.
Compared to other algorithms, BFRL can save more than 11 h in solving 96 new tasks of RBED during a day in total.Moreover, it tends to exploit the prior knowledge when initializing the knowledge matrix for a new task, i.e., the initial knowledge matrix of new task can be effectively It's obvious that the variance of BFRL is the smallest among all.Particularly, the relative standard deviation of BFRL is only 52.5% of that of PSO.Although the RSD of ABC and QGA are smaller than that of others, their optimal solutions are not satisfactory.
Figures 13 and 14 are the distribution boxplots of fuel costs of generators and risk index, respectively.From top to bottom, the five horizontals are the maximum, upper quartile, median, lower quartile and minimum of convergence result obtained in 10 runs.One can readily find that the length of BFRL is the shortest, which means its variation is the smallest.Besides, its median is also located at a relatively low position, which verifies that BFRL has strong global convergence ability with stable convergence.

Efficiency and Effectiveness of BFRL
From the above simulation results, it can be concluded that the comprehensive performance of BFRL is the best among all the algorithms, which includes the optimization rate, the quality of obtained optimal solution and the convergence stability.
Compared to other algorithms, BFRL can save more than 11 h in solving 96 new tasks of RBED during a day in total.Moreover, it tends to exploit the prior knowledge when initializing the knowledge matrix for a new task, i.e., the initial knowledge matrix of new task can be effectively Risk index (p.u.) Figure 14.The boxplot of the risk index distribution.

Efficiency and Effectiveness of BFRL
From the above simulation results, it can be concluded that the comprehensive performance of BFRL is the best among all the algorithms, which includes the optimization rate, the quality of obtained optimal solution and the convergence stability.
Compared to other algorithms, BFRL can save more than 11 h in solving 96 new tasks of RBED during a day in total.Moreover, it tends to exploit the prior knowledge when initializing the knowledge matrix for a new task, i.e., the initial knowledge matrix of new task can be effectively developed from the optimal knowledge matrices of the related source task with the highest similarity, thus the knowledge matrix can converge in just a few iterations (less than 50 s).Moreover, BFRL has improved its efficiency by ten times through knowledge transfer.
The obtained integrated objective function f of BFRL is the second smallest among all the algorithms, which is only larger than that of GA.This is due to the following promising features:

•
Deep local search: To approximate a high-quality local optimum with a smaller integrated objective function f, a chemotactic mode is adopted to search new solutions around the current Energies 2017, 10, 638 20 of optimal solution, while a reproduction mode is employed to eliminate the bacteria with the low-quality local optimums.

•
Wide global search: To increase the probability of obtaining a global optimum, several bacteria will be assigned for a random search in the action space with the dispersal mode.

•
Proper balance between local search and global search: Each bacterium will implement a new action based on the common knowledge matrices Equations ( 18) and ( 19), a greedy action (i.e., a local search) will be selected if the random number is larger than the dispersal probability; otherwise a non-greedy action (i.e., a global search) will be chosen.As a consequence, a proper trade-off between local search and global search can be achieved.
For the convergence stability, other heuristic algorithms are incapable of knowledge transfer thus they may easily result in a low convergence stability, i.e., a significantly different optimum obtained in different runs.In contrast, BFRL can extract the optimal knowledge matrices from the sources tasks, thus the blind random search can be effectively avoided by utilizing the approximate optimal knowledge matrices, so that a convergence stability with a high-quality optimum can be realized.

Conclusions
In this paper, a novel model-free BFRL associated with transfer learning is proposed for RBED, which can be applied for discontinuous convex or nonconvex problems with multiple constraints.Besides, it can transform the informative information of source tasks into the state-action pair value function to accelerate the online optimization of a new task.Moreover, BFRL has a relatively simple structure and can converge with higher quality solutions in a short period of time.The main contributions are summarized as follows: • The bacteria are regarded as multi-agent to accelerate the update of knowledge matrix in BFRL, while the high dimension of knowledge matrix can be considerably reduced by knowledge extension, such that the curse of dimension can be avoided;

•
The active power deviation is defined as the similarity between source tasks and new tasks, and the online learning is accelerated significantly through transfer learning so that an online dynamic RBED can be achieved.Moreover, BFRL is adequate to handle large-scale problems;

•
The multi-objective RBED is transformed into a single-objective problem via linear weighed method, and future research will investigate multi-objective algorithms associated with transfer learning for RBED.Moreover, this paper assumes the active power deviation is the only difference between source tasks and new tasks, which reduces the difficulty of transfer learning.In fact, the power gird topology, unit commitment and the fault type may vary dramatically, therefore how to extract these similarities is worth studying.

•
RBED is based on the static ED while the dynamic multi-period coupled constraints are not considered.Hence, ongoing studies will also focus on the extension from single-scenario static optimization to dynamic multi-scenario optimization.

Figure 1 .
Figure 1.Interaction between a bacteria's knowledge matrix and the environment.

Figure 1 .
Figure 1.Interaction between a bacteria's knowledge matrix and the environment.
the reward of a transition from state s ij k to state s ij k+1 under a selected action a ij k in the kth iteration; α and γ are the learning factor and discount factor, respectively.
k a ) is the reward of a transition from state ij k s to state +1 ij k s under a selected action ij k a in the kth iteration; α and γ are the learning factor and discount factor, respectively.

Figure 3 .
Figure 3. Knowledge learning of BFRL associated with chemotactic and dispersal.

Figure 3 .
Figure 3. Knowledge learning of BFRL associated with chemotactic and dispersal.

Figure 4 .
Figure 4.The procedure of knowledge transfer.

Figure 9 .
Figure 9. Online optimization process obtained by each algorithm under scenario 4.

Figure 9 .
Figure 9. Online optimization process obtained by each algorithm under scenario 4.

Figure 9 .
Figure 9. Online optimization process obtained by each algorithm under scenario 4.

Figure 11 .
Figure 11.Convergence results of objective function of 96 scenarios obtained by each algorithm.

Figure 9 .
Figure 9. Online optimization process obtained by each algorithm under scenario 4.

Figure 11 .
Figure 11.Convergence results of objective function of 96 scenarios obtained by each algorithm.

Figure 11 .
Figure 11.Convergence results of objective function of 96 scenarios obtained by each algorithm.

Figure 12 .
Figure 12.The histogram of optimization results obtained by different algorithms.

Figure 12 .
Figure 12.The histogram of optimization results obtained by different algorithms.

Figure 13 .
Figure 13.The boxplot of the fuel costs distribution.

Figure 14 .
Figure 14.The boxplot of the risk index distribution.

Figure 13 .
Figure 13.The boxplot of the fuel costs distribution.

Figure 14 .
Figure 14.The boxplot of the risk index distribution.

Table 1 .
Optimal BFRL parameters obtained through trial-and-error.

Table 2 .
The main parameters setting of each algorithm.

Table 4 .
The contingencies used in the studied power system.

Table 5 .
Convergence time of each algorithm under scenario 4.

Table 5 .
Convergence time of each algorithm under scenario 4.

Table 5 .
Convergence time of each algorithm under scenario 4.

Table 5 .
Convergence time of each algorithm under scenario 4.

Table 6 .
Average optimization results of 96 scenarios obtained by each algorithm in 10 runs.

Table 7 .
Convergence performances of objective function of 96 scenarios obtained by each algorithm in 10 runs.