Pruning the Communication Bandwidth between Reinforcement Learning Agents through Causal Inference: An Innovative Approach to Designing a Smart Grid Power System

Electricity demands are increasing significantly and the traditional power grid system is facing huge challenges. As the desired next-generation power grid system, smart grid can provide secure and reliable power generation, and consumption, and can also realize the system’s coordinated and intelligent power distribution. Coordinating grid power distribution usually requires mutual communication between power distributors to accomplish coordination. However, the power network is complex, the network nodes are far apart, and the communication bandwidth is often expensive. Therefore, how to reduce the communication bandwidth in the cooperative power distribution process task is crucially important. One way to tackle this problem is to build mechanisms to selectively send out communications, which allow distributors to send information at certain moments and key states. The distributors in the power grid are modeled as reinforcement learning agents, and the communication bandwidth in the power grid can be reduced by optimizing the communication frequency between agents. Therefore, in this paper, we propose a model for deciding whether to communicate based on the causal inference method, Causal Inference Communication Model (CICM). CICM regards whether to communicate as a binary intervention variable, and determines which intervention is more effective by estimating the individual treatment effect (ITE). It offers the optimal communication strategy about whether to send information while ensuring task completion. This method effectively reduces the communication frequency between grid distributors, and at the same time maximizes the power distribution effect. In addition, we test the method in StarCraft II and 3D environment habitation experiments, which fully proves the effectiveness of the method.


Introduction
Context, Literature and Limitations: With the gradual increase in electricity demand, the power grid system meets the needs of a more diversified market while ensuring safe, reliable and stable performance, which brings great challenges to the traditional power grid [1][2][3]. Therefore, the next-generation power grid system, the smart grid, is designed to meet the above challenges [4]. It is based on a more intelligent power distribution method to ensure the stability of users' electricity consumption [5], and requires the simultaneous cooperation of multiple distributors on the network to meet the customers' needs [6]. To achieve this, researcher begin to use multi-agent reinforcement learning algorithms to solve predicted by estimating individual treatment effect (ITE) [29][30][31]. Therefore, we propose a causal inference communication model to address the communication bandwidth problems among agents. By employing causal inference, we regard whether to or not communicate as a intervention variable. Based on the intervention variable, we can estimate the effect of communication on the outcome to determine the intervention variable and decide whether the agents need to communicate or not. The learned causal model allows agents to send messages selectively, which reduces the amount of communication exchanged between agents, thus reducing the communication bandwidth.
This paper proposes a collaborative multi-agent model with causal inference, referred as the Causal Inference Communication Model (CICM), to help each reinforcement learning agent in the system decide whether to communicate or not. Specifically, we implement CICM using the following steps. First, to connect reinforcement learning with causal model, CICM embeds the control problem into a Graphical model with Bayesian formula, which is a causal model at the same time. Then, we parameterize the causal graph as a latent variable model based on Variational Autoencoders (VAE) [32]. By evaluating the ITE [30], we can determine whether the communication is beneficial. Thus, agents can only communicate at critical moments and necessary states, reducing the communication bandwidth in the smart grid system. Therefore, a summary of our objectives is listed below: (1) To apply the causal model to optimize the communication selection in the problem of the smart grid, also in the collaborative agent tasks. (2) To formulate a graphical model which connects the control problem and causal inference that decides the necessity of communication. (3) To prove empirically that the proposed CICM model effectively reduces communication bandwidth in smart grid networks and collaborative agent tasks while improving task completion.
Contributions of this Paper: This paper makes the following original contributions to the literature in these following areas: (1) This is the first study to apply the causal model to optimize the communication selection in the problem of the collaborative agent tasks in smart grid. (2) A new graphical model is developed which connects the control problem and causal inference that decides the necessity of communication. (3) Innovative numerical proof that the proposed CICM model effectively reduces communication bandwidth is provided.
Structure of the Paper: Section 2 presents the existing literature on the smart grid, communications in cooperative agents, and causal inference and their limitations. Section 3 discusses the concepts and structure of a graph model to adopt in this study, while Section 4 presents the causal graphical model and the inference method, Causal Inference Communication Model (CICM). Data sets, the environment and computational experiments are reported in Section 5 and the results of these experiments are analyzed in Section 6. Research limitations and threat to validity are discussed in Section 7. Conclusion and further research are stated in Section 8.

Related Works: Smart Grids, Communications in Cooperative Agents, Causal Inference
This section discusses related works, including the smart grid, communications in cooperative agents, and causal inference. Firstly, the recent research progress of smart grid distributors is introduced. It is a trend to model the distributors in power system as reinforcement learning agents. Because of the problem of communication bandwidth between power grid systems, the research progress of multi-agent communication research is further discussed. Finally, causal inference algorithms for reinforcement learning are introduced.

Smart Grid
Smart grids are functionally able to provide new abilities such as self-healing [33], energy management [6], and real-time pricing [5]. The realization of the above functions requires advanced technologies such as intelligent power distribution, distributed power generation, and distributed storage [34]. Of which, intelligent power distribution plays a key role in the smart grid, which is the intermediate component connecting power generation and consumers [35]. Much work has been done on the optimization of power distribution systems in recent years. In order to reduce the number of repeaters in the power distribution system, Ahmed et al. [35] combine customers and distributors into one domain and improves network performance with a limited number of smart relays. In terms of power distribution, Ma et al. [6] utilize a collaborative learning multiswarm artificial bee colony algorithm to search for the optimal value of power distribution systems. Kilkki et al. [36] propose an agent-based model for simulating different smart grid frequency control schemes. While communication-based models also play a role in power regulation stability, Li [1] studies cooperative communication to improve the reliability of transmission in smart grid networks. Yu et al. [37] propose a novel distributed control strategy based on a back-and-forth communication framework to optimally coordinate multiple energy storage systems for multi-time-step voltage regulation. Although many achievements have been made in distributed power regulation, and the research on communication in distributed power has made great progress, there is still little research on how to reduce communication in cooperative grid systems. Therefore, our work mainly focuses on how to reduce the communication between distributed grid nodes while ensuring system stability.

Communications in Cooperative Agents
Communication, which is the process of agents sharing observations and experiences [38], is the bridge of cooperation in multi-agent systems. In collaborative multi-agent reinforcement learning, much attention has been paid to how to learn more effective communication strategies. Foerster et al. [22] propose differentiable inter-agent learning claim that agents can back-propagate error derivatives through communication channels, which share the gradients between different agents to learn effective communications mechanism. Hoshen [39] proposes attentional multi-agent predictive modeling, which is an attention vector for selecting helpful agents to interact with, achieving a better performance than the average pooling method [23]. ATOC [40] extends the methods through local interaction. Das et al. [25] propose targeted multi-agent communication that distinguishes the importance of a message and transmits it to different agents. Recently, there have been some attention-based methods to infer the relationship between agents and guide an agent to send messages to the most relevant agents [17,[41][42][43][44][45]. These algorithms have made many improvements in the way of communication utilization to improve the effect of collaboration. However, every time the agent system communicates, all information exchange between agents is required, which does not essentially reduce the unnecessary communication bandwidth. Therefore, our method focus on letting each agent selectively send important information to reduce communication bandwidth.

Causal Inference
Causal inference plays a significant role in improving the correctness and interpretability of deep learning systems [46][47][48]. Researchers focus on using causal inference in the reinforcement learning community to mitigate estimation errors of off-policy evaluation in partially observable environments [49][50][51]. Bibaut et al. [49] use the maximum likelihood estimation principle of causal inference objectives to study the problem of off-policy evaluation in reinforcement learning. Tennenholtz et al. [50] focus on the second level of causal theory, hoping to know the effects of using strategy to intervene in the world and take different actions. Oberst and Sontag [51] focuse on the counterfactual and studies what happens in a specific trajectory under different strategies with all other variables (including random noise variables) being the same. Besides, causal inference helps enhance the interpretability of the machine learning model in online learning and bandit methods [52][53][54][55]. Jaber et al. [52] provide a method to learn policy values from off-policy data based on state and action using a latent variable model. Lu et al. [55] take the confounding variant into account, solves the complete reinforcement learning problem with the observed data, and puts forward the general formula of the reinforcement learning problem.
The causal inference in single-agent reinforcement learning has been extensively studied. However, the research on the communication effect of reinforcement learning agents using causal inference is relatively few. Our research establishes a generalized reinforcement learning formula with communication similar to [55]. With causal inference, we focus on the different effects of communication intervention on agents.

A Graphical Model: Preliminaries
Transforming the distributor cooperation problem in smart grids into a multi-agent cooperation problem facilitates our modeling and better problem definition. In multiagent and collaborative agents systems, not all of the communications are necessary and some of them will even degrade model performance [17,26]. Therefore, an appropriate communication selection mechanism can not only promote cooperation among agents, but also can reduce communication bandwidth. As a mechanism to reduce communication, causal inference can evaluate the impact of communication by the rewards of agents, so that agents can determine whether an communication sending is necessary and beneficial at current state. Therefore, in this section, we first introduce the definitions of the causal model and then elaborate the use of graphical models to represent reinforcement learning.

Causal Models
In a smart grid system, the difference in the effect of distributors communication or non-communication can be predicted by estimating ITE. Estimating ITE, which is defined as the expected difference between treatment and control outcomes, is one of the most challenging in causal inference [30,31]. An individual belongs to either the treatment group or the control group. The entire vector of potential outcomes can never be obtained exactly because only factual outcomes can be revealed by experiment results, and counterfactual outcomes cannot be accessed directly. One way to tackle this issue is to estimate counterfactual outcomes from visible observation data. Let us first sort through some definitions of the identification of causal inference.

Definition 1. Graphs as Models of Interventions
Probability model is used to describe the relationship between variables in smart grid system. Compared to other probabilistic models, causal models can be used to estimate the effect of interventions. This additional feature requires the joint distribution to be complemented with a causal diagram, i.e., a directed acyclic graph (DAG) G that identifies the causal connections between variables [56][57][58], which is called Bayesian networks. Based on the Bayes theorem, we can decompose the joint distribution into a child-parent relation form as DAG. Then, by leveraging the chain rule of probability calculus, we formulate the conditional distribution as: where x j is vertices (nodes) and pa j is the parent node of x j .

Definition 2. Individual Treatment Effect (ITE)
As shown in Figure 1a, ITE estimation is used to measure the effect of outcome Y h by a treatment h in causal graph G. In the case of a binary action set h = {0, 1}, where action 1 represents "treatment" and action 0 is "control". ITE are formulated as ITE = Y 1 − Y 0 , where Y 0 is the potential outcome of treatment h = 0, and Y 1 indicates the potential outcome of treatment h = 1. In this work, we construct our objective using notation from probability theory. Z; H; Y are random variables, and z; h; y are instances of random variables.

Definition 3. Back-door Criterion
The pre-and post-intervention processes are shown in Figure 1a. After intervention, the edge pointing to the intervention variable will be removed, and the data distribution will change accordingly. Calculating the impact of interventions on outcomes requires back-door criterion. A set of variables Z satisfies the back-door criterion relative to an ordered pair of variables (H, Y) in a DAG G if (1) no node in Z is a descendant of the intervention variable H; and (2) Z blocks every path between H and Y that contains an arrow into Y. The Back-door adjustment formula is written as: where the do(·) means to apply an intervention on the variable.

Reinforcement Learning as a Graphical Model
In smart grid systems, the strategy of distributors is obtained using reinforcement learning algorithms. Describing reinforcement learning as a graphical model can facilitate us to relate the graphical model with a causal inference. Therefore, this subsection integrates the notation of the causal model into the reinforcement learning (RL) formulation. At time step t, the state and action are x t and a t , respectively. States, actions, and the next states are represented by nodes in the graph model and linked by relational edges. The state transition can be expressed as p(x t+1 |x t , a t ), which can be viewed as the influence relationship between state and action. In the state transition process, the environment will feed back the agent with a reward r, which cannot be directly expressed in the graph model. A binary variable O t is introduced to indicate whether an action is optimal: O t = 1 means action a t is optimal at time step t given state s t and O t = 0 means action a t is not optimal. The probability distribution of O t is p(O t = 1|x t , a t ), which has an exponential relationship with the reward r(x t , a t ) as follows: In the following, for convenience, we will remove = 1 from the rest of the derivation and default to O t = 1, p(O t |x t , a t ) = exp(r(x t , a t )). The graph model after adding node representing random variable O t is represented in Figure 1b. By optimizing this distribution p(O t ), via approximate inference, we can get an objective function of maximum entropy [59].

Method: The Causal Inference Communication Model (CICM)
The challenge of distributor cooperation in smart grids is a natural multi-agent collaboration problem. In order to reduce the communication frequency in smart grids and further reduce the communication bandwidth, we propose the CICM. This section describes CICM in detail. We first discuss the reinforcement learning with the causal model and establish a graphcial model, which offers a flexible framework. Then, based on the graphcial model, we introduce ITE to determine whether agents need to send communication or not.

Connecting Causal Model and Reinforcement Learning
Strategies for distributors in a smart grid can be learned using reinforcement learning, and the communication strategies between distributors can be obtained using causal inference. To connect them, we integrate the graphical model of reinforcement learning and causal inference into a single graphical model. The reinforcement learning embedded into the graphical model is shown in Figure 1b. The objective function can be obtained with maximum entropy [59] through approximate inference. To integrate with the causal model, this paper extends the graphical model of reinforcement learning by introducing a latent variable z and an intervention variable h, as shown in Figure 2. In the smart grid systems, the intervention variable h refers to whether the distributor node accepts external messages. The latent variable z adopts variational autoencoder (VAE) to learn a state representation vector in control problem. Through VAE, we can obtain an informative supervision signal.
, the dashed red arrows denote the variational approximation q φ (z t |x t , m t , h t ). m −i,t is the message that comes from other agents. The overall computational paths of CICM is shown in (c). Figure 2 presents the probabilistic graphical model, containing the latent variable z t , the agent's observation data x, intervention variable h, outgoing communication m and communication from other agents m −i , and action a. We first use the chain rule to express the relationship between the variables with the Bayesian formula: The variational distribution of a graph model can be written as the product of recognition terms q(z t |x t , a t , h t , m t ) and policy terms π(a t |x t , h t , m −i,t ): Optimizing the evidence lower bound (ELBO) can obtain the maximum marginal likelihood log p(x t , O t , h t , m t ) [59]. From the posterior of the variational distribution (Equation (5)) and the likelihood of the joint distribution Equation (4), whose marginal likelihood can be soloved by Jensen's inequality. The ELBO is: The first term of the above equation (Equation (6)) is the KL-divergence of the approximate from the true posterior. Since log p(x t , O t , h t , m t ) is fixed and KL-divergence is positive, we convert the problem into optimizing ELBO [32]: We rewrite the ELBO as follows, and present the complete derivation of the ELBO in Appendix A.
h t ) in control as inference framework. For simplicity, we omit the constant term, i.e., uniform action prior log p(a t ), in ELBO. The first term Equation (8) The second term Equation (9) is the maximum entropy objective function [59]. Figure 3 shows the architecture of the model and inference network, which includes the encoder and decoder in the VAE.

Estimating Individual Treatment Effect
ITE is used to compute the difference in outcomes caused by intervention variable h. In actual cases, only the outcomes caused by treatment or control can be observed, and the counterfactual outcomes of unimplemented interventions are always missing. Similarly, during the training of RL agents, only the outcomes of specific communication choice are observed. We cannot obtain the individual-level effect directly from the observed trajectories of agent. The immediate challenge is to estimate the missing counterfactual outcomes based on the historical trajectories and then estimate the ITE.
According to Definition 2, the ITE C ITE of an agent on the intervention variable h is as follows. It is measured as the difference between the expected treatment effect when h = 1 (accept m −i from the other agents) and h = 0 (reject m −i from the other agents), which can be written as: The do(h t = 1) in the above formula refers to the intervention condition h t = 1. According to the backdoor criterion in Definition 3, we apply the rules of the do-calculus to Figure 2. We can get: where the transition from (12) to (13) is by the rules of do-calculus applied to the causal graph in Figure 2. The O t and x t are independent of each other under the condition given by z t , O t ⊥ x t | z t , which transforms the formula from (13) to (14). Similarly, We can obtain the ITE C ITE in probabilistic form. The following formula can be calculated using the data distribution before the intervention. We use the backdoor criterion to estimate the ITE in the following form: The following formula is used to determine the value of the binary variable h t based on the prediction result of C ITE . The value of the binary variable h t refers to whether the agent needs to communicate.
We add a term to the latent variable model Here, O * t ,h * t are the actual observations. We use the relationship between the optimal distribution variable O t and the reward r, to calculate the label value corresponding to the distribution.

Experiments, Datasets and the Environment
We first introduce the power grid problem, a distributed distributors environment in smart grid. In addition, in order to fully prove the effectiveness of our method, we introduce StarCraft II and 3D environment habitation experiments. Both the Starcraft II and 3D environment habitation experiments have high-dimensional observation spaces, and they can validate the generalization ability of our model.

Datasets and Environment Power Grid Problem
We test our algorithm in power grid experiments [12]. To facilitate the modeling and preliminary investigation of the problem, the simulation environment is not a common alternating current (AC) power grid network, but a direct current (DC) variant power grid network. Although the physical behavior of AC grids is quite different from that of DC grids, the method is suitable for a general learning scheme for distributed control problems.
As shown in Figure 4, the regulation system involves three parts, voltage source, city, and distributor. The grey circles are the distributors, which we model as agents in the reinforcement learning algorithm. This allows it to interact with the environment, receive observations from the environment, perform actions to adjust voltages based on the observations, and then receive reward value feedback from the environment. We set the reward value fed back to the agent by the environment as the degree to which the distributor satisfies the city's electricity consumption. If the voltage obtained by the city node is lower than the expected value, the environment feeds back a penalty value to the distributor connected to the city node. The cooperation of multiple distributors is required to divide the voltage reasonably in the city to meet the urban electricity demand. A distributor that is not connected to a city will reward the signal with 0. The simulated power grid problem is solved using the reinforcement learning algorithm, where the action, state, and reward values are as follows: (1) Local actions: The power grid system controls the current by controlling the variable resistance. Each distributor node can make a choice for the power line (variable resistance) connected to it. A distributor can perform three actions on the resistance, Same, Halved, and Doubled. If a line is connected to two distributors at the same time, it is affected by both distributors at the same time, and the final selection is performed according to Table 1. (3) Local reward: When the city voltage connected to the distributor node is lower than the expected level, the environmental feedback a negative reward value to the distributor, and the reward value is equal to the difference between the actual city voltage and the expected voltage, Otherwise, the reward is 0. A distributor that is not connected to a city has a reward of 0.

Habitat
We use the multi-object navigation (multiON) dataset [60], which is adopted by artificial intelligence habitat simulators [61]. This dataset is designed for navigation tasks with the following essential elements: agent start position, direction, and target position. Eight cylindrical target objects with different colours are set. The agents' episodes are generated based on the Matterport3D [62] scene. The data is split according to the standard scene-based Matterport3D train/val/test split [60]. Based on the multiON dataset, a multi-target sequential discovery task is constructed. The agent needs to navigate to the designated target locations in sequence in an episode. The FOUND action is defined as finding the target, which should be taken when the agent is within a distance threshold to the target. If the FOUND action is called beyond the threshold, the game fails. If the target is not found within a specified limit of the number of steps, the game is also judged as a failure. We use m-ON to denote an episode with m sequential objects. In the task, we define two heterogeneous agents. One is an oracle agent A O with a god perspective, and the other is an embodied navigator A N , which performs specific navigation tasks and interacts with the environment. A O 's observations are the position, orientation of A N , the global information of the map, and the target position. A N only observes selfcentered visual images with depth information. If obstacles block the target position, A N cannot perceive the target position and needs additional information from A O . There is limited communication bandwidth for guidance information between A O and A N . Two agents share the same reward, so they must learn to cooperate to optimize their mutual reward together.

StarCraft II Environment
We design an experiment with two scenarios based on the StarCraft II environment [63], as shown in Figure 5. One is a maze task as shown in Figure 5a, and the other is a tree task of depth search as shown in Figure 5b. In the maze task, the navigator agent starts from the bottom centre point and aims to navigate to a target point whose position is randomly initialized on the road of the maze. It is noticed that the navigator agent does not know the target position. The oracle agent, on the contrary, has a god perspective that captures the target position. The oracle agent could send the relative target position (i.e., the target's relative position to the navigator agent itself) information to the navigator agent. In the tree environment, the target point is initialized at a random leaf node in the map. Similarly, the oracle agent can pass the relative position of the target point to the navigator. We set two different depths in this scenario, depths 3 and 4. An increase in depth improves the game's difficulty. An enemy agent is used to represent the target point for convenience. This enemy agent at the target point is inactive and has very little health, which a single attack can kill. The death of the enemy agent indicates the navigator agent successfully arrives at the target point, and the navigator agent will receive additional rewards.

Reward Structure and Evaluation Indicators
In Habitat and StarCraft II environment, the reward value is designed as: r t = 1 [reached-goal ] · r goal + r closer + r time-penalty + 1 [send-message] · r comm-penalty , where, 1 [reached-goal ] is an indicator variable that takes value 1 when the navigator agent finds the current goal at time t. If the target is found, the agent receives reward r goal . r closer is the difference in the distance towards target position between time step t and t − 1. r time-penalty is the penalty received by the agent at time step t. There is a communication penalty r comm-penalty for a message sending at step t. To compare our results with previous studies, the communication penalty is only used in training and is excluded from the total reward in testing. In the power grid problem, our reward value is defined as the degree to which a distributor satisfies the city's electricity consumption. If the voltage obtained by the city node is lower than the expected value, the environment feeds back a penalty value r power-penalty to the voltage divider connected to the city node. Like the previous two environments, the penalty reward r comm-penalty for communication is set in the power grid experiments.  We use the evaluation metrics in [64] on navigation tasks. In multiON [60], these metrics are extended to navigation tasks with sequential targets. We adopt the following two metrics in our experiments: PROGRESS indicates the fraction of target objects found in an episode, and PPL refers to the progress weighted by the path length.

Network Architecture and Baseline Algorithm
Similar to CoMON [19], CICM adopts the network structure of TBONE [65,66]. In Habitat and StarCraft II environment, A O encodes the information into a vector, containing the location of navigation agent A N , the map information, and the target location. During the encoding process, A O in the habitat environment crops and rotates the map to construct a self-centered map, implicitly encoding A N 's orientation into the cropped map. Then, the initial belief of A O is obtained through CNN and linear layers. This belief is encoded as a communication vector and sent to A N [65,66]. In Starcraft II, A O encodes inaccessible areas' surrounding terrain and information. This information contains the target agent position and will be sent to A N .
For the StarCraft II environment, we design the following algorithms. To meet the maximum entropy item in the ELBO model, we use the SAC [67] algorithm. Inspired by the SLAC algorithm [68], the latent variable is used in the critic to calculate the Q function Q(z t , a t ), and the state input x t of the agent is used to calculate the policy π(a t+1 |x t , a t , h t , m −i,t ) during execution. We design the following algorithms: SACwithout-Comm, model without communication; SACwithComm, only using SAC algorithm with the communication; SACwithLatentAndComm, adding latent variable model and communication using VAE to SAC algorithm; CICM (our method), leveraging causal inference and VAE model.
For the power grid environment, unlike the previous two environments, the power network involves communication between multiple agents. When the agent sends information, it is also the receiver of information. We encode the information sent by the distributors connected to the receiver, and finally take the average value as the received information. The algorithm design in the power grid network is the same as that used in StarCraft II environment.

Analysis of Experiment Results
In this section, we first analyze the computational complexity of CICM, and then analyze the performance of the algorithm in three experimental environments (power grid, StarCraft, and habitat).

Complexity Analysis
We theoretically analyze the complexity of our algorithm CICM. In the smart grid environment, we consider that all agents can communicate with each other, so that they can form a complete graph. If there are N agents, the computational complexity is O(N 2 ). If there are no more than ω neighbor nodes, our computational complexity is less than O(ωN). Therefore, the computational complexity is acceptable.
In the neural network, the computational complexity of our algorithm is related to updating parameters during training. Use U to represent the total number of training episodes. In each episode, there are T steps. We set the computational complexity of the ITE module to M and the computational complexity of the reinforcement learning (SAC) to W. During training, the update is made every C steps. The computational complexity of our algorithm is O(UT(M + W)/C). We define D x is the dimension of states, D m is the dimension of communication, D n is the dimension of hidden layer, D z is the latent variable dimension, D a is the action dimension, and the binary variable dimension D h and D O is 1. In ITE, there are two modules, including encode and decoder. In the encoder module, the computational complexity is The reinforcement learning (SAC) also includes two modules. The computational complexity of Critic is O(2D z D n D a ), which is involving two Critic. The calculation complexity in Actor is O((D x + ωD m + D a + D h )D n D a ).

Power Grid Environment
Our algorithm has a communication penalty in the reward during training, and for a fair comparison, we do not calculate the penalty during testing. Figure 6 shows the penalty values under two different grid structures. Table 2 shows the communication probability of our algorithm CICM, which is calculated by dividing the number of time steps of communication by the number of communication steps of the full communication algorithm (communication is carried out at each time step). In Figure 6a,b, we can see that the algorithm without communication, SACwithoutComm, receives significantly more penalties than the other three algorithms with communication. Among the three algorithms with communication, the algorithm SACwithComm, which directly uses communication, is better than the algorithm without communication. However, SACwithComm is not as high as the algorithm SACwithLatentAndComm which combines the latent variable model in the utilization of communication information. Our algorithm CICM, a communication model that combines latent variable model and causal inference for communication judgment, shows the best performance. The judgment of communication helps the agent filter unnecessary information, which reduces the penalty caused by the distribution voltage while reducing the communication cost.   Figure 7 shows the reward for our StarCraft II environment. There is a communication penalty in the reward in our algorithm during training, and we do not count the penalty generated by communication costs during testing for a fair comparison. All the graphs show that the reward learned by algorithms without communication is significantly lower than the algorithms with communication. This is because the oracle agent provides critical information about navigation, which includes the target position. We can see that the convergence speed of SACwithComm is fast, and it rises quickly in all of the three graphs at the beginning. In contrast, the models with latent variables (CICM and SACwithla-tentAndcomm) have a slow learning speed initially. Because a certain amount of data is required to train a stable VAE model. After obtaining a stable VAE, the SACwithlatent algorithm rises rapidly, surpassing the performance of SACwithComm on the maze and slightly exceeding the performance of SACwithComm on the tree environment. It reveals that the latent model has improved the performance of the algorithm.
CICM integrates the latent model and causal inference for communication judgment. With the help of the latent model, even although our algorithm learns slowly at the beginning (Figure 7b), it achieves the highest and most stable performance among all of the algorithms at the end. CICM's final stage is higher than others because of the introduction of a communication strategy. It allows the agent to reduce unnecessary communication and memory information in the RNN network and thus obtain the most negligible variance in all three experiments. Table 3 shows the communication probability of our algorithm and test result under different single-step communication penalties in the maze environment. From the table, we can see that an increase in communication penalty will decrease communication frequency. The performance difference between different communication probability is not very large, and CICM achieves the best performance when the singlestep communication penalty is −0.5. Therefore, we adopt −0.5 as the default value in the experiments. We further analyze Table 3. From Table 3, we can see that the smaller communication penalty, the smaller the impact on communication probability. This is because the system regards the communication penalty as a part of the feedback. When the communication penalty is small, the communication penalty will not play a big role in the probability of getting the optimal feedback p θ (O t | z t , a t , m −i,t , h t ). But when the communication penalty becomes r comm-penalty = −0.5, we find that the communication probability drops significantly to 38.0%. The reason is that the communication penalty affects the probability of the system getting the optimal feedback, and the communication probability needs to be reduced to make the feedback optimal. At the same time, we can also notice that the communication penalty cannot be increased indefinitely. Although the communication penalty can make the algorithm reduce communication probability, the lack of communication will also affect the cooperation between agents. It can be seen from the table that when the communication penalty reaches −2, the communication volume is reduced to 27.4%, but at the same time, the obtained test result will also be reduced to 1.75.

Habitat Experiment
Below we analyze the algorithm performance in Habitat. Our algorithm is the first trained on the 3-ON setting, and then gets a stable model. We merge the counterfactual communication model based on the trained model and finally get our overall model. We test this model on 1-ON, 2-ON, and 3-ON, respectively, and the final results are presented in Table 4. We test the effect of different hyperparameters r comm-penalty on the communication probability of our algorithm S-Comm&CIC, as shown in Table 5.
NoCom provides our algorithm with a lower bound on what can be achieved without communication. Our algorithm adds the causal inference communication model on U-Comm and S-Comm (which we name as U-Comm&CIC and S-Comm&CIC). U-Comm&CIC and S-Comm&CIC are close to the effect of the original algorithm. At 3-ON, our algorithm slightly exceeds the original in the indicator PPL. A higher PPL metric indicates that our algorithm can successfully navigate over shorter distances.

Discussion
In this section, we mainly discuss the research limitations and the threat to validity.

Research Limitations
The following is our summary of the limitations of our algorithm: • We only conduct experimental tests on the DC power grid simulation environment of the power grid, which is different from the real production environment. There is still a big gap between simulation and reality. How to deploy the algorithms trained in the simulation environment to the real environment requires more effort. • Our algorithm only considers the counterfactual of causal inference in agents' communication, and does not introduce some cooperative mechanisms of multi-agent reinforcement learning in policy learning, nor does it consider the credit assignment among multiple agents. The introduction of these mechanisms in the future will further improve the performance. • Our algorithm mainly takes consideration of theoretical completeness, thus introducing latent variable models with reinforcement learning objective function and causal inference. However, the model is relatively complex, and its stability is weak during training, and the stability of the algorithm needs to be paid attention to in the follow-up research.

Threat to Validity
Internal validity: Our algorithm is tested in three experimental environments. In the environment of power grid and StarCraft, the reinforcement learning module is designed based on SAC algorithm. We add the ITE module to SAC to form our algorithm CICM. The ITE module is not included in the basic algorithm of comparison. We strictly follow the control variables in the parameters of the algorithm. The parameters used in all the reinforcement learning module modules are consistent. In the Habitat environment, we also add our ITE module to the basic learning reinforcement learning algorithms U-Comm and S-Comm. We do not change the parameters in the basic reinforcement learning algorithm. In summary, our experiments are carried out under the control of variables, and it is internally valid.
External validity: Our algorithm is proved to be feasible by experiments on power grid system. To prove the scalability of the algorithm, we continue to do experiments on high-dimensional game environments StarCraft and Habitat. In addition, it performs well compared with the baseline algorithm in those high-dimensional game environments. Our algorithm is extensible in different fields, so it is external validity.

Conclusions
As electricity demands are increasingly significant, it is necessary to develop a power grid system that can meet higher and more diversified market demands and is safe, reliable, and stable performance. The emerging smart grid can meet the current challenges faced by the traditional power grid system as it is based on more intelligent power distribution agents and methods to increase electricity generation and ensure safety, reliability, and stability. The distributed control of the smart grid system requires a large amount of bandwidth to meet the communication needs between distributor nodes. To ensure that the system performs well and reduces the communication bandwidth, we propose CICM, which adopts a causal inference model to help agents decide whether communication is necessary. The causal inference is constructed on a graphical model with the Bayesian formula. The graphical model connects optimal control and causal inference. Estimating the counterfactual outcomes of the intervention variables and then evaluating the ITE helps the agent make the best communication strategy. We conduct experiments on smart grid environment tasks. Our experiments show that CICM can reduce the communication bandwidth while improving performance in the power grid environment. In addition, we also conduct experiments on StarCraft II navigation tasks and the 3D Habitat environment, and the experimental results once again prove the effectiveness of the algorithm. This study serves as a first attempt to optimise control and reinforcement learning with causal inference models. We believe that the causal inference frameworks will play a more significant role in reinforcement learning problems.
Future research will focus on extending the current model in several directions, such as model multiple distributors, explicit modeling of game theoretical aspects of the graph theory modeling, distributed computing, causal modeling of other reinforcement learning agents, model-based multi-agent reinforcement learning, and variational inference for accelerating off-line multi-agent reinforcement learning.

Conflicts of Interest:
The authors declare no conflict of interest. Figure 2 presents the probabilistic graphical model, containing the latent variable z t , the agent's observation data x, communication m, intervention variable h, and action a. We first use the chain rule to express the relationship between the variables with the Bayesian formula:

Appendix A. Derivation of the Latent Variable Model
The conditional distribution of graphical model ( Figure 2) can be written as: The variational distribution of a graph model can be written as the product of recognition terms q(z t |x t , a t , h t , m t ) and policy terms π(a t |x t , h t , m −i,t ): Optimizing the variational lower bound (ELBO) can obtain the maximum marginal likelihood log p(x t , O t , h t , m t ). With the posterior from variational distribution (Equation (A3)) and the likelihood from joint distribution (Equation (A1)), the the marginal likelihood can be calculated by Jensen's inequality. The ELBO is: Since log p(x t , O t , h t , m t ) is fixed and KL-divergence is positive, we convert the problem into optimizing ELBO: The first term in ELBO is −D kl (q(z t , a t |x t , h t , m t )||p(z t , a t )) − D kl (q(z t , a t |x t , h t , m t )||p(z t , a t )) = z t a t q(z t , a t |x t , h t , m t ) log p(z t , a t ) q(z t , a t |x t , h t , m t ) dz t da t =E a t ∼π(·|x t ,h t ,m −i,t ) (−D kl (q(z t |x t ,a t ,h t , m t )||p(z t ))) + E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (log p(a t ) + H(π(a t |x t , h t , m −i,t )) where, a prior action log p(a t ) is assumed to be a constant term, which can be ignored when optimizing the objective.
The second term in ELBO is E z t ,a t ∼q(z t ,a t |x t ,h t ,m t ) [log p(x t , h t , m t , O t |z t , a t ) E z t ,a t ∼q(z t ,a t |x t ,h t ,m t ) [log p(x t , h t , m t , O t |z t , a t ) = z t a t q(z t , a t |x t , h t , m t ) log p(x t , h t , m t , O t |z t , a t ) dz t da t =E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (log p θ (x t |z t )) +E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (log p θ (h t |z t )) +E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (log p θ (m t |z t )) +E a t ∼π(·|x t ,h t ,m −i,t ) [E z t ∼q(·|x t ,a t ,h t ,m t ) (r(z t , a t , h t , m −i,t ))] We add the results of the above two items together to obtain ELBO as follow: ELBO =E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) ( log p θ (x t |z t ) + log p θ (h t |z t ) We take the entropy of the policy out separately and rewrite the ELBO formula again: ELBO =E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (log p θ (x t |z t )+log p θ (h t |z t )+log p θ (m t |z t ) +r(z t , a t , h t , m −i,t ))] + E z t ∼q(·|x t ,a t ,h t ,m t ) [H(π(a t |x t , h t , m −i,t )] The first part of the ELBO is about latent variable z: J M (θ, φ) =E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (log p θ (x t |z t ) + log p θ (h t |z t ) + log p θ (m t |z t )) + E a t ∼π(·|x t ,h t ,m −i,t ) [−D KL (q φ (z t |x t , a t , h t )||p θ (z t ))] (A10) The other part equation is maximum entropy reinforcement learning: J π (ϕ) = E a t ∼π(·|x t ,h t ,m −i,t ) E z t ∼q(·|x t ,a t ,h t ,m t ) (r(z t , a t , h t , m −i,t ) + H(π(a t |x t , h t , m −i,t ))) (A11) for navigating to a goal is r goal = 1, and the reward for close distance is r closer = ∆ · 0.005, where ∆ = −(dis t − dis t−1 ) and dis t is the distance from the agent to the target point at time t. Our single-step penalty for the agent is 0.1, r time-penalty = −0.1. In default settings, the penalty for communication is 0.5, r comm-penalty = −0.5. In the selection of parameters we used the parameters in Table A1. In the Habitat environment, we use the same hyperparameter settings as in the CoMon [19], where we set the communication penalty term in the reward value to r comm-penalty = −0.01. The learning rate set in the variational network is 0.0001.
We designed the following agent models for Habitat: • NoCom (i.e., "No communication"): It means the performance of the navigation agent without relying on the oracle agent, which can be used as a model without communication. • U-Comm (i.e., "Unstructured communication"): the agent delivers real-valued vector messages. Oracle's information is directly encoded by the linear layer to form information and send it to the navigation agent. • S-Comm (i.e., "Structured communication"): the agent has a vocabulary consisting of K words, implemented as a learnable embedding. These probabilities are obtained by passing information to the linear and softmax layers. On the receiving end, the agent decodes these incoming message probabilities by linearly combining its word embeddings using the probabilities as weights. We designed the following agent models for Starcraft environment: