Multi-Agent Cooperation Based Reduced-Dimension Q( λ ) Learning for Optimal Carbon-Energy Combined-Flow

: This paper builds an optimal carbon-energy combined-ﬂow (OCECF) model to optimize the carbon emission and energy losses of power grids simultaneously. A novel multi-agent cooperative reduced-dimension Q( λ ) (MCR-Q( λ )) is proposed for solving the model. Firstly, on the basis of the traditional single-objective Q( λ ) algorithm, the solution space is reduced e ﬀ ectively to shrink the size of Q -value matrices. Then, based on the concept of ant cooperative cooperation, multi-agents are used to update the Q -value matrices iteratively, which can signiﬁcantly improve the updating rate. The simulation in the IEEE 118-bus system indicates that the proposed technique can decrease the convergence speed by hundreds of times as compared with conventional Q( λ ), keeping high global stability, which is very suitable for dynamic OCECF in a large and complex power grid compared with other algorithms.


Introduction
With the increasing impact of the greenhouse effect on the environment, low-carbon economy has gradually become the key development direction of various energy consumption industries. As the largest CO 2 emitter, the electric power industry will play an important role in low-carbon economic development [1]. All kinds of energy-consuming enterprises have also commenced on focusing on the control of carbon emissions, especially in the power industry, which makes up approximately 40% of CO 2 emissions in the whole world [2]. Generally speaking, low-carbon power involves four sectors: generation, transmission, distribution and consumption. Therefore, how to reduce the carbon emissions of transmission and distribution sectors in the power grid industry has turned into an instant issue to be solved [3,4].
Up to now, numerous scholars have carried out research on all aspects of low-carbon power, including optimal power flow (OPF) [5][6][7], economic emission dispatching [8,9], low-carbon power system dispatch [10], unit commitment [11,12], carbon storage and capture [13,14] and other issues. However, the previous studies mainly focused on the carbon emissions of the generation side, with a lack of research on how to reduce the carbon emissions of the power network (i.e., the transmission and distribution sides). Therefore, the optimal carbon-energy combined-flow (OCECF) model, which can reflect the energy flow and carbon flow distribution of the power grid, is further established in this paper. Basically, the OCECF is on the basis of the conventional reactive power optimization model, which should not only attempt to minimize the power loss and voltage deviation, but also aim to

Low-Carbon Power
To achieve a low-carbon operation of a power system, extensive studies were devoted to addressing the environmental economic dispatch (EED). In EED, the minimization of emissions [25] is generally designed as one part of the objective function. To further improve the operation economy, the uncertainty of wind power was considered in [26,27], in which the power output of a wind turbine was evaluated based on a probability distribution function of the wind speed. Besides, a modified EED, by combining heat and power economic dispatch, was presented in [28], which can achieve an optimal operation for the heat and power system simultaneously. Furthermore, a coordinated operation of an integrated regional energy system with various energies (e.g., a CO 2 -capture-based power) was proposed in [29], while the demand response was also introduced in EED. To further reduce carbon emissions, the CO 2 emission trading system was combined into the daily operation of an energy system. In [30], a decentralized economic dispatch was proposed by considering the carbon capture power plants with carbon emission trading. Moreover, the power uncertainty of wind and photovoltaic energy was fully taken into account in [31,32] based on carbon emission trading. For the purpose of clarifying the internal relation between energy consumption and carbon emissions from power grids, the concept of carbon emission flow is put forward for the first time in reference [33]. On this basis, the authors of [34][35][36] carried out a theoretical analysis and case verification on the carbon emission flow calculation and the carbon flow tracking of a power system, respectively.

Application of Meta-Heuristic Algorithms
In fact, the optimal low-carbon operation of a power system faces with various complex and difficult optimization problems, e.g., EED. Hence, various meta-heuristic algorithms have been employed for these optimization problems due to their strong searching ability and high application flexibility. In [25], an improved PSO combining the differential evolution algorithms was designed for EED. In [26], a so-called exchange market algorithm was used for EED due to its fast convergence and strong global searching ability. In [27], a population-based honey bee mating optimization with an online learning mechanism was presented. Inspired by the well-known tag-team game in India, the novel Kho-Kho optimization algorithm [28] with an excellent optimization performance was proposed for EED. To achieve a distributed optimization for real-time power dispatch, a novel adaptive distributed auction-based algorithm with a varying swap size was proposed in [37]. On the other hand, the reinforcement learning-based optimization attracted many investigations for optimal operations of power systems. In [23], a distributed multi-step Q(λ) learning was proposed for the complex OPF of a large-scale power system. To satisfy the requirement of multi-objective optimization, an approximate ideal multi-objective solution Q(λ) learning was presented in [36] via a design of multiple Q matrices for different objective functions.

Carbon-Energy Combined-Flow
The carbon-energy combined-flow (CECF) of the power grid is a comprehensive network flow [36], which combines the power flow of the power grid with the carbon emission flow attached to the power flow of the power grid. Among them, the energy flow is the actual network flow, and the carbon emission flow is the virtual network flow, which can be referred to as the carbon flow in the power system. Carbon flow is generated in the power generation, which represents the concept that the carbon emission is transferred from the generation side to the demand side. The energy flow transfers from the power supply end to the receiving end, but unlike the energy flow, only the power supply that produces carbon emissions at the power supply end can be called a carbon source, as shown in Figure 1. For a given carbon source, the carbon emission is equivalent to the product of the energy flow and the carbon emission rate of the corresponding power generation side [35].
Energy flow is the transmission of electric energy in the power grid. In the process of transmission, there will be power losses, commonly known as network losses, which are generally described as follows: where V i and V j are the voltage amplitudes of the interconnection node i and j, respectively; θ ij means the voltage phase angle difference between node i and j; g ij denotes the conductance between node i and j; N L denotes the branch set of the power network. Energy flow is the transmission of electric energy in the power grid. In the process of transmission, there will be power losses, commonly known as network losses, which are generally described as follows: where and are the voltage amplitudes of the interconnection node and , respectively; means the voltage phase angle difference between node and ; denotes the conductance between node and ; L denotes the branch set of the power network.
In the process of power transmission, the energy flow should bear the corresponding amount of carbon flow losses. The tracking of the grid carbon emission flow is based on load flow tracking, and the source of network loss is traced in light of the proportional sharing rule [35]. The ratio of the th generator to the whole active power injected at node is where is the active output of the th generator; ′ represents the whole active power injection of the node in the equivalent lossless network; (−1) means the active power injection weight of the th generator at node , its specific derivation process can be found in [23]. The proportion of the th generator outgoing line at node is the same, and the line loss is decomposed according to the utilization share of the carbon source to the line. Hence, is the component ratio of the active power losses of the th generator in line -. Here, the active power losses of line -can be expressed as follows: where denotes the generator set. Therefore, the total carbon flow losses of the power grid can be described by where denotes the carbon emission rate of the th generator. In the process of power transmission, the energy flow should bear the corresponding amount of carbon flow losses. The tracking of the grid carbon emission flow is based on load flow tracking, and the source of network loss is traced in light of the proportional sharing rule [35]. The ratio of the wth generator to the whole active power injected at node j is where P sw is the active output of the wth generator; P nj represents the whole active power injection of the j node in the equivalent lossless network; a (−1) jw means the active power injection weight of the wth generator at node j, its specific derivation process can be found in [23].
The proportion of the wth generator outgoing line at node j is the same, and the line loss is decomposed according to the utilization share of the carbon source to the line. Hence, β wj is the component ratio of the active power losses of the wth generator in line i-j. Here, the active power losses of line i-j can be expressed as follows: where W denotes the generator set. Therefore, the total carbon flow losses of the power grid can be described by where δ sw denotes the carbon emission rate of the wth generator.

OCECF Model
The OCECF model aims to reduce the network losses and carbon flow losses as much as possible according to satisfying the constraints of the power grid and maintaining the stability of the power system voltage. Therefore, the OCECF model is able to describe as follows [23,36]: where nonlinear functions f 1 (x) and f 2 (x) are the components of carbon flow loss and active power loss; V d is the voltage stability component; µ 1 and µ 2 are the weight coefficients, Q C ] T corresponds to the voltage value of each node of the power grid V, the phase angle of each node θ and the on-load tap changer (OTLC) ratio k t , reactive power compensation Q C . The remaining variables can be referenced in the nomenclature and V d can be described as [23] where n represents the number of load nodes; V j is the node voltage of load node j; and V jmax and V jmin denote the maximal and minimal voltage ranges of load node j, respectively.

Q(λ) Learning
Multi-step backtrack Q(λ) learning is a conventional algorithm of RL, in which Q-learning combines the idea multi-step TD(λ) returns [38] and introduces the eligibility trace, such that the convergence speed of the algorithm can be improved to a certain extent. The eligibility trace can be described as [38] e k (s, a) = γλe k−1 (s, a) + 1, if (s, a) = (s k , a k ) γλe k−1 (s, a), otherwise where e k (s, a) stands for the eligibility trace under a state-action pair (s, a) corresponding to the kth iteration; (s k , a k ) denotes the actual state-action pair of the kth iteration; γ means the discount factor; and λ represents the trace-decay factor. The eligibility trace (λ) uses the "backward estimation" mechanism to approximate the optimal value function matrix Q*, and sets Q k as the kth iterative value of the estimated value Q*, thus the value function of the algorithm can be updated iteratively as follows [39]: Q k+1 (s, a) = Q k (s, a) + αδ k e k (s, a) (10) where α is the learning factor; R(s k , s k+1 , a k ) is the reward function value of the kth iterative time environment from state s k to s k+1 through the selected action a k ; and a g is the greedy action strategy, which also represents the action corresponding to the highest Q-value in the current state, which can be written by [39] where A represents the action set, which is also the alternative action set for each variable.

Reduced-Dimension of Solution Space
As shown in Figure 2, the traditional single-objective Q(λ) algorithm does not decompose the action space of all the variables. Assume that the ith variable x i has m i alternative solutions, the number of action set elements |A| = m 1 m 2 · · · m n , when the number of variables n is large, the alternative action combination will increase accordingly, which leads to a slow convergence and difficulties in the iterative calculation. Up to now, the most usual way to work out this "dimension disaster" issue is hierarchical reinforcement learning (HRL) [40]. However, it is difficult to determine the hierarchical design and connection, which usually leads to the convergence of the algorithm to the local optimal solution. g = arg max ∈ ( +1 , ) where represents the action set, which is also the alternative action set for each variable.

Reduced-Dimension of Solution Space
As shown in Figure 2, the traditional single-objective Q(λ) algorithm does not decompose the action space of all the variables. Assume that the th variable has alternative solutions, the number of action set elements | | = 1 2 ⋯ , when the number of variables is large, the alternative action combination will increase accordingly, which leads to a slow convergence and difficulties in the iterative calculation. Up to now, the most usual way to work out this "dimension disaster" issue is hierarchical reinforcement learning (HRL) [40]. However, it is difficult to determine the hierarchical design and connection, which usually leads to the convergence of the algorithm to the local optimal solution.
Under the framework of the proposed MCR -Q(λ) learning algorithm, each variable has a corresponding value function matrix, and the action set is respectively divided into In the iterative optimization of each matrix, the difficulty of optimization is greatly reduced due to the action space being obviously smaller. Meanwhile, the action space of each variable is the state space of the next variable, which enhances the internal relationship between variables, as can be illustrated in Figure 2. The state space of the first variable is divided according to the load scenario.

Reduced-dimension of solution space
Start End Figure 2. Difference between Q(λ) and MCR-Q(λ).

Multi-agent Cooperative Search
In the iterative optimization of Q(λ) learning, which only employs a single agent for exploration and exploitation, the matrix is less efficient at updating just one element per iteration. On the contrary, in MCR-Q(λ) learning, there are multiple agents for exploration and exploitation at the same time, in which multiple elements of the matrix can be updated at each iteration, and the update speed of the matrix is greatly improved. Here, the value function of MCR-Q(λ) learning can be updated iteratively as follows [23]: Under the framework of the proposed MCR-Q(λ) learning algorithm, each variable has a corresponding value function Q i matrix, and the action set is respectively divided into (A 1 , A 2 , · · · , A n ) with |A i | = m i . In the iterative optimization of each Q matrix, the difficulty of optimization is greatly reduced due to the action space being obviously smaller. Meanwhile, the action space of each variable is the state space of the next variable, which enhances the internal relationship between variables, as can be illustrated in Figure 2. The state space of the first variable is divided according to the load scenario.

Multi-Agent Cooperative Search
In the iterative optimization of Q(λ) learning, which only employs a single agent for exploration and exploitation, the Q matrix is less efficient at updating just one element per iteration. On the contrary, in MCR-Q(λ) learning, there are multiple agents for exploration and exploitation at the same time, in which multiple elements of the Q matrix can be updated at each iteration, and the update speed of the Q matrix is greatly improved. Here, the value function of MCR-Q(λ) learning can be updated iteratively as follows [23]: where the superscript i represents the ith variable or the ith Q-value matrix; the superscript j represents the jth objective; e i k s i , a i and a i g are similar to Equations (7) and (12), respectively. As with the Ant-Q algorithm, MCR-Q(λ) does not calculate the global reward function after each individual selects all the variables, i.e., from the start to the end, as shown in Figure 2. The reward function value can be calculated as follows [24]: where L Best represents the function value of an individual (i.e., the best individual) that has the lowest value of the objective function value at the kth iteration; W is a positive constant; SA Best denotes the state-action pair set of the optimal individual executed at the kth iteration.

Action Selections
As all individuals are exploring and learning, they are faced with action selections. When the individual j prepares to determine the variable x i , its action selection is based on the following equation [41]: where q is a random number; q 0 is a positive constant for determining the probability of a pseudo-random selection; a s denotes the action determined by the pseudo-random selection. In this paper, the rotary selection method is adopted to determine the action to be selected according to the P i k distribution of the action probability matrix, and the probability matrix is calculated as follows: When an individual finds the best value of the objective function, the probability of its state-action for the corresponding action will be increased, which will attract other individuals to perform the same action. When the algorithm converges, all individuals will perform the same state-action pair when selecting all variables from the start to the end.

Design of State and Action
As mentioned above, the action space of each variable is designed to be the state space of the next variable, in which the state space of the first variable is designed to be the state set of the environment (i.e., the power grid). For OCECF, the power grid load scenario can be designed as the state of the first variable, where a load scenario is divided at every 15 min and the scenarios with similar loads are set to the same state, e.g., the power grid load scenarios with different loads at 11:00 a.m. and 11:15 a.m. can be regarded as two different states.
In addition, OCECF mainly optimizes the carbon emissions on the power grid side, and the variables in the model are mainly divided into two categories: (a) reactive power compensation device and (b) the OTLC ratio. Thus, the action set corresponding to each variable is a discrete optional action of the reactive power compensation quantity or transformer changer ratio.

Design of Reward Function
As shown in Equation (17), L Best represents the optimal objective function value of all individuals. According to the OCECF model described by Equation (5), the inequality constraint is brought in by the objective function, and then the objective function value obtained by the individual j becomes [41] where N j denotes the number of unsatisfied inequality constraints calculated by the power flow after the individual j determines the variable, and J is the number of groups.

Parameter Setting
In MCR-Q(λ) learning, six parameters γ, λ, α, q 0 , J and W, have great influence on the effect of the algorithm [36]. After a large number of simulation tests using trial-and-error, all the parameters can be set as indicated in Table 1.

Algorithm Flow of the OCECF
Generally speaking, the algorithm flow of OCECF based on MCR-Q(λ) learning is shown in Algorithm 1. Initialization: functions Q I , action probability P i , eligibility trace matrices e i , and i = 1, 2, · · · , n; 2: Input power flow calculation result; 3: Calculate fitness values of all individuals; 4: Set k: = 0; 5: WHILE k < k max ; 6: FOR i = 1 to n 7: According to Equations (18) and (19), individual j selects the corresponding action a i k of each variable in turn and records the next state; 8: Calculate power flow for all variables x determined by individuals; 9: END FOR 10: According to Equations (1) and (4)-(6) respectively calculate the linear loss P loss , the carbon loss C ds , the number of constraints N of dissatisfaction inequality, and the voltage stable component V d ;

Parameters
Range Value

Case Studies
For purpose of testing the optimization performance of MCR-Q(λ) learning, the simulation results of Q(λ) learning, Q learning [41], quantum genetic algorithm (QGA) [42], GA [43], PSO [44], ant colony system (ACS) [45], group search optimizer (GSO) [46] and artificial bee colony (ABC) [47] were also introduced for comparison. Note that the weight coefficient in Equation (5) can be adjusted according to the preference on different components of the objective function. In the simulation analysis, since three components of the objective function in Equation (5) have the same preferences, and the weight coefficient in Equation (5) is set to be 1/3, both the testing IEEE 118-bus system and IEEE 300-bus system are referenced from the tool called MATPOWER [48], in which the detailed parameters can be found in [49]. Besides, it assumes that both the wind and solar energy outputs can be accurately acquired by using effective forecasting techniques, e.g., the deep long-short-term memory recurrent neural network [50]. Among them, the algorithms are simulated and tested in Matlab 2016b by a personal computer with an Intel(R) Core TM i5-4210 CPU at 2.6 GHz with 8 GB of RAM.

Simulation Model
According to different generator types, the carbon emission rate δ sw of each unit in the IEEE 118-bus system is summarized in Table 2. Besides, this paper adopts the same benchmark model of IEEE 118-bus system in all case studies, related detail parameters can be referenced in [36]. Moreover, the system load of the IEEE 118-bus system is mainly divided into five scenarios, as shown in Table 3. Particularly, the scenarios from 1 to 5 represent the system with different load demands, where the load demand gradually increases from scenarios 1 to 5 for all the presented nodes in Table 3. As mentioned above, Tables 2 and 3 are obtained under the same benchmark model of IEEE 118-bus system [36].
In fact, reactive power compensation can be designed for the nodes with generators or load demand to provide adequate reactive power, while the OLTC ratio can be selected for the line with two different voltage nodes. According to this rule, the reactive power compensation of nodes 45, 79, and 105, and the OLTC ratio of lines 8-5, 26-25, 30-17, 63-59, and 64-61 are respectively selected as controllable variables, which are defined in sequence as (x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 ), with (1) The reactive power compensation is divided into five configurations as {−40%, −20%, 0%, 20%, 40%} with its reference value; (2) The OLTC ratio is divided into three grades, which are {0.98, 1.00, 1.02}.
Hence, the optimization variables of the IEEE 118-bus system can be found in Table 4, where the variables can be divided into two types, i.e., the reactive power compensation and OLTC ratio; the "no. of bus" represents the location of each variable in the power network; the "action space" denotes the set of the alternative control actions for each variable; and the "variable number" is the number of all the optimization variables.    Figure 3 illustrates the convergence process of the Q-value deviation between Q(λ) learning and MCR-Q(λ) learning under scenario 1, where the Q-value deviation is defined as the 2-norm of matrix (Q k+1 − Q k ), that is, Q k+1 − Q k 2 . As obtained from Figure 3a, since the Q matrix of single-objective Q(λ) learning is large and the updating speed is slow, the algorithm can converge to the optimal Q* matrix through a variety of trial-and-error explorations, while the convergence time is about 530s. In contrast, after reducing the dimension of the solution space of MCR-Q(λ) learning, the Q i matrix corresponding to each variable is very small, and 20 objectives are updated at the same time. The optimization speed is more than 100 times of that of Q(λ) learning, which can converge after about 3.5 s, as shown in Figure 3b. Moreover, it can be obtained from the convergence of the objective function values in Figure 4 that the optimization speed of MCR-Q(λ) learning is much faster, and both algorithms can converge to the global optimal solution.     When MCR-Q(λ) learning converges, the value function matrix and probability matrix corresponding to all variables will prefer a state-action pair, and all individuals will tend to be consistent in selecting the action, as demonstrated in Figure 5.

Comparative Analysis of Simulation Results
For the purpose of evaluating the optimization capability of MCR-Q(λ) learning, this section applies all the algorithms to solve the OCECF model for 10 repetitions. For each method, the objective function value is directly taken to evaluate the quality of a solution during the searching process, which is the most crucial index to evaluate the optimization performance. Table 5 indicates the average convergence results of 10 repetitions for the different algorithms, When MCR-Q(λ) learning converges, the value function matrix Q i and probability matrix P i corresponding to all variables will prefer a state-action pair, and all individuals will tend to be consistent in selecting the action, as demonstrated in Figure 5.  When MCR-Q(λ) learning converges, the value function matrix and probability matrix corresponding to all variables will prefer a state-action pair, and all individuals will tend to be consistent in selecting the action, as demonstrated in Figure 5.

Comparative Analysis of Simulation Results
For the purpose of evaluating the optimization capability of MCR-Q(λ) learning, this section applies all the algorithms to solve the OCECF model for 10 repetitions. For each method, the objective function value is directly taken to evaluate the quality of a solution during the searching process, which is the most crucial index to evaluate the optimization performance. Table 5 indicates the average convergence results of 10 repetitions for the different algorithms,

Comparative Analysis of Simulation Results
For the purpose of evaluating the optimization capability of MCR-Q(λ) learning, this section applies all the algorithms to solve the OCECF model for 10 repetitions. For each method, the objective function value is directly taken to evaluate the quality of a solution during the searching process, which is the most crucial index to evaluate the optimization performance. Table 5 indicates the average convergence results of 10 repetitions for the different algorithms, and it can be found that: (a) The optimal solution obtained by Q learning and Q(λ) learning is the best, but the optimization time is also the longest, which also shows the strong ergodicity of RL; (b) The convergence objective value of MCR-Q learning and MCR-Q(λ) learning is the closest to Q learning and Q(λ) learning, and the convergence time is the shortest, while the convergence speed is about 100 times that of single-objective Q learning and Q(λ) learning; (c) RL improves the algorithmic speed by up to 37.13% with the introduction of the eligibility trace (λ) returns mechanism; (d) With the increase in the load scenario, the line losses and carbon losses of the power grid will also increase correspondingly. However, since the power system has a sufficient reactive power supply, its voltage stability component just changes slightly.  Figure 6 gives the results comparison between different methods, where each value is the average of the sum value of five scenarios in 10 runs. It is obvious that the result obtained by GA is the worst among all the methods due to its premature convergence. On the other hand, the proposed MCR-Q(λ) learning only has a slight improvement on each index compared with the other methods, but it also can obtain the lowest total carbon flow loss and objective function. It verifies that the proposed method can effectively satisfy the low-carbon requirement from the viewpoint of power networks.   Lastly, Table 6 gives the statistic convergence results of 10 repetitions for the different algorithms, and it can be found that: (a) The Q learning and Q(λ) learning have the highest convergence stability and can converge to the global optimal solution every time; (b) The statistical variance and standard deviation of MCR-Q(λ) learning are the closest to Q learning and Q(λ) learning, which have a relatively high convergence stability; (c) Except RL, other algorithms are more likely to trap at a local optimum because of the parameter setting and the lack of learning ability. According to different generator types, the carbon emission rate δ sw of each unit in the IEEE 300-bus system is summarized in Table 7. Besides, 96 different load scenarios are designed to simulate different optimization tasks in a day for the IEEE 300-bus system, as shown in Figure 7. Moreover, the optimization variables are given in Table 8. According to different generator types, the carbon emission rate of each unit in the IEEE 300-bus system is summarized in Table 7. Besides, 96 different load scenarios are designed to simulate different optimization tasks in a day for the IEEE 300-bus system, as shown in Figure 7. Moreover, the optimization variables are given in Table 8.  Figure 7. The load scenarios of the IEEE 300-bus system. Figure 7. The load scenarios of the IEEE 300-bus system.

Comparative Analysis of Simulation Results
For the purpose of evaluating the optimization capability of MCR-Q(λ) learning, this section applies all the algorithms to solve the OCECF model for 10 runs. Since the number of optimization variables of the IEEE 300-bus system dramatically increases, the conventional Q and Q(λ) algorithms cannot implement an optimization due to the dimension disaster. Figure 8 provides the results comparison between different methods, where each value is the average of the sum value of a day in 10 runs. It can be found that the proposed MCR-Q(λ) learning significantly outperforms other methods on the total carbon flow loss, total power loss, voltage stability component and the objective function. Hence, the MCR-Q(λ) learning-based OCECF can achieve a low-carbon operation for the power network. Particularly, these values obtained by MCR-Q(λ) learning are 2.0%, 3.4%, 45.9% and 10.3% lower than that obtained by GSO. It verifies that the optimization performance of MCR-Q(λ) is much better than other conventional meta-heuristic algorithms as the system scale increases.   Besides, Table 9 gives the distribution statistics of the objective function under different algorithms in the IEEE 300-bus system, where each value is the sum value of the objective function of a day in 10 runs; the best, worst, variance and standard deviation (Std. Dev.) are calculated to evaluate the convergence stability [51]. It can be seen from Table 9 that the convergence stability of MCR-Q(λ) learning is the highest among all the methods with the smallest variance and standard deviation of the objective function.

Conclusions
This paper builds an OCECF model to optimize the carbon emission and energy losses of power grids simultaneously and proposes a new MCR-Q(λ) learning to solve this problem, which has the following four contributions/novelties: (1) The OCECF model carefully considers the distribution of carbon flow in the power grid, which effectively resolves the carbon emission optimization at the power grid side; (2) MCR-Q(λ) learning is proposed for the first time, which largely reduces the dimension of the solution space, and significantly accelerates the updating rate of the Q-value matrix via multi-agent cooperative exploration learning, such that the optimization speed can be considerably accelerated; (3) Compared with Q(λ) learning, the convergence rate of MCR-Q(λ) learning can be increased by about 100 times, while a higher global convergence stability is guaranteed. Hence, it is very suitable for resolving dynamic OCECF in a large and complex power grid compared with other algorithms; (4) Like ACO, MCR-Q(λ) learning is also suitable for solving various complex optimization problems.
To further improve the operation benefit of power grids, future works can focus on the carbon trading system-based optimal power flow and the Pareto-based multi-objective learning methods, while a decentralized optimization will be studied for high operation privacy and reliability.

Conflicts of Interest:
The authors declare no conflict of interest. Nomenclature P Gi , Q Gi active and reactive power generation of the ith node P Di , Q Di active and reactive power demand of the ith node V i , V j voltage magnitude of the ith and jth node b ij susceptance of line i-j S i apparent power flow of the ith transmission line N i node set N L set of branches of the power network N G set of units N H set of hydro units N B set of PQ nodes N C set of compensation equipment N K set of on-load transformers k t on-load tap changer ratio Q c reactive power compensation θ phase angle of each node V d component of voltage stability V jmin , V jmax minimum and maximum voltage limit of load node j µ 1 , µ 2 weight coefficients W generator set (s k , a k ) actual state-action pair of the kth iteration δ k , ρ k estimates of Q-function errors R(s k , s k+1 , a k ) reward function value of the kth iterative time environment from state s k to s k+1 through a selected action a k a g greedy action strategy A action set L Best function value of an individual (i.e., the best individual) that has the least value of the target function value at the kth iteration SA Best state-action pair set of the best individual executed at the kth iteration γ discount factor λ trace-decay factor α learning factor J number of groups Abbreviations OCECF optimal carbon-energy combined-flow OTLC on-load tap changer MCR-Q(λ) multi-agent cooperative reduced-dimension Q(λ) HRL hierarchical reinforcement learning EED environmental economic dispatch