Distributed Learning Applications in Power Systems: A Review of Methods, Gaps, and Challenges

In recent years, machine learning methods have found numerous applications in power systems for load forecasting, voltage control, power quality monitoring, anomaly detection, etc. Distributed learning is a subfield of machine learning and a descendant of the multi-agent systems field. Distributed learning is a collaboratively decentralized machine learning algorithm designed to handle large data sizes, solve complex learning problems, and increase privacy. Moreover, it can reduce the risk of a single point of failure compared to fully centralized approaches and lower the bandwidth and central storage requirements. This paper introduces three existing distributed learning frameworks and reviews the applications that have been proposed for them in power systems so far. It summarizes the methods, benefits, and challenges of distributed learning frameworks in power systems and identifies the gaps in the literature for future studies.


Introduction
Large integration of renewable energy resources, electric vehicles, demand-side management techniques, and dynamic electricity tariffs have dramatically increased power system complexity [1]. Moreover, the proliferation of advanced metering infrastructure (AMI) and digital assets has led to significant amounts of data production in power systems [2]. Hence, new data and energy management techniques are required to efficiently handle this amount of complexity and data in power systems. Machine learning approaches are finding extensive applications in power systems in the areas of power quality disturbance, transient stability assessment, voltage stability assessment, SCADA network vulnerabilities [3], etc. The main advantage of machine learning-based methods is their ability to handle high volumes of data and their easy implementation without requiring specific system models and parameters [4]. Therefore, they can be a promising solution for fast and reliable power system operation.
Nowadays, the public is paying more attention to data security than ever [5]. Therefore, preventing data leakage is a paramount priority when it comes to data management. Various methodologies are used in the literature to protect power system data privacy. Keshk et al. [6] proposed a two-level data privacy framework where the first level used the enhanced proof-of-work technique based on blockchain to authenticate data records and prevent cyber attacks to alter data. The second level used a variational autoencoder to convert the data to an encoded format to prevent inference attacks. To secure the information such as local energy consumption, power generation, and cost function parameters in the optimal power flow problem, a privacy-preserving alternating direction method of multipliers (ADMM) framework was proposed by Liu et al. [7]. An energy management method was developed for smart homes by Yang et al. [8] which used rechargeable batteries to hide the energy consumption patterns of electricity customers. Alsharif et al. [9] adapted the multidimension and multisubset (MDMS) scheme for data collection from AMIs where bill computation was delegated to the AMI network's gateway for better scalability and privacy.
Other privacy-related studies include the use of masking approaches for data aggregation [10], consortium blockchain framework for electric vehicles' power trading data [11], inner product encryption (IPE) for data sharing [12], differential privacy (DP) technique for load data privacy [13], generative adversarial networks (GANs) for power generation data [14], decomposition algorithm-based decentralized transactive control for peak demand reduction and preserving data privacy [15], and Benders decomposition method for integrated power and natural gas distribution in networked energy hubs while maintaining data security [16].
Distributed learning is a subfield of machine learning used mainly to address the data island, privacy, bandwidth, and data storage problems [17]. In this method, multiple clients perform the local learning process using edge devices and then send the learned model to a central server. The server performs model updates and sends the final model back to the clients. This schema eliminates the need for high volumes of data exchange and central data storage and allows preserving data privacy [18]. Distributed learning is often interchangeably used with federated learning, and it has wide applications in the areas of communications [19], mobile Internet technology [17], natural disaster analysis [20], heavy haul railway control [21], and others.
Considering the distributed nature of the power system components (smart meters, generators, etc.), distributed learning can be a promising solution for a wide variety of challenges existent in power systems. To explore power system applications that have benefited from distributed learning architectures, this paper provides a literature review of research integrating distributed learning schemes in power systems. In addition, it identifies the benefits, challenges, and knowledge gaps in this area. Although some of the previous review studies explored the potential of distributed learning algorithms for energy trading [22], machine learning methods for power system security and stability [23], deep learning applications in power systems [24,25], and intersection of deep learning and edge computing [26], no previous study has focused on the applications of distributed learning frameworks in power systems. The goal of this paper is to present a systematic overview of this area to attract the attention of power system researchers to the distributed learning framework as a highly promising research topic that has been rapidly expanding in other research areas such as communications, healthcare, and many others.
The remainder of this paper is organized as follows. Section 3 defines the distributed learning mechanism and introduces three major variants of distributed learning. Section 4 discusses distributed learning applications that have been implemented in power systems so far. Gaps and challenges associated with distributed learning in power systems are discussed in Section 5. Finally, the concluding remarks are drawn in Section 6.

Review Methodology
This section describes the review question that this paper aims to answer, the methodology used to search for references, and the criteria for including/excluding individual studies identified in the initial search.
The main question posed by this review paper is: How have distributed, federated, and assisted learning been implemented in power systems? This central question brought several other lines of inquiry: Four main criteria were considered to include an article in this review.

1.
The article should have a learning-based structure. This could include any type of learning algorithm where the aim is to construct a mathematical representation for an unknown model.

2.
The article should focus on solving a power system-related problem.

3.
It should use a distributed structure where there is data exchange between multiple agents or between agents and a central server.

4.
Only research articles that have tested their algorithms on a case study and have presented the results should be included.
Articles that used centralized machine learning methods in power systems were not considered in the review. If a research paper used a distributed structure but did not have any type of learning method, it was excluded from consideration as well unless it provided some other useful information. Distributed, federated, and assisted learning methods that were applied in applications other than power systems were disqualified as well. The research was performed through search for articles in the following databases. In addition, search results from Google Scholar and ResearchGate were used during the review process as well. The search was conducted using the following product set of keywords. Multiple searches were performed between February 2021 and April 2021. Therefore, publications published or made available after April 2021 were not included in this review. During the screening process, papers were accepted/rejected based on their title, abstract, and keywords. In cases where the title, abstract, and keywords were not sufficiently specific about the content, the paper was judged based on the full text.

Distributed Learning Overview
The basic idea behind distributed learning is to perform the learning process in edge devices, communicate the learned model to the central server, and then receive model updates from the server, as illustrated by Figure 1. In this method, the privacy of the data is maintained since no raw data are exchanged between the edge devices and the central server. Moreover, the bandwidth and central data storage requirements of this method are significantly lowered [27] due to the reduced amount of data exchange. Consider a graph G(J, ξ) with M nodes where J = {1, 2, . . . , M} represents the nodes of the graph and ξ denotes the set of edges between the nodes. In this context, edges are considered as the communication links for data transfer between nodes. Assume that the local training set for each node j is S j = {(x jn , y jn )} |S j | n=1 where x shows the training features and y shows the labels. In distributed learning, it is assumed that the training samples are independent and identically distributed (i.i.d.) with an unknown distribution of D. Moreover, the sizes of the training sets are equal.
In traditional machine learning, the datasets are collected in one central unit and a centralized training is performed. A popular representation of this kind of learning is denoted as y = w T x + b where w is the training weights and b is the biases of a network representing the model. To find the best value of w, a loss function needs to be minimized. This can be shown as Minimize F s (w, (x, y)) = L(w, (x, y)) + R(w), where L(w, (x, y)) is the loss function and R(w) is the regularizer to prevent overfitting. The function F s (w, (x, y)) is called the statistical risk [28]. The new weight values in iteration k + 1 are calculated by w k+1 = w k − γ∇F s (w, (x, y)), where γ is the learning rate.
In distributed learning, the distribution function D is not known. Therefore, it is not possible to calculate w directly from (6). One way is to minimize the empirical risk instead of the statistical risk [29] as follows: where N is the size of training set for each node which is denoted by |S j |. It can be proven that the minimization of the empirical risk converges to the optimal solution with high probability at the rate of O(1/ √ MN) [30,31]. The main goal is to solve the following optimization problem To update the weights, two methods can be used. In the first method, the gradient is updated in a decentralized manner In this method, the nodes either broadcast their gradients and update their gradients locally or they send their gradients to the central server to perform the gradient update [32]. In the second method, the nodes cooperate to solve (4) as a single optimization problem. For this purpose, various methods such as ADMM [33,34], dual decomposition [35,36], and distributed consensus gradient [37,38] have been used.
Distributed learning is a general term. Federated learning, which was first introduced by Google [39] in 2017, is a variant of distributed learning [40]. In federated learning, the central server coordinates multiple clients to perform a single learning task in parallel [41]. However, the usual approach in distributed learning is that the central server partitions the training data between multiple clients and then entrusts each client with a separate learning subtask which is a part of the main learning task. Moreover, clients in distributed learning can communicate with each other, in contrast to federated learning. Another key difference is that clients in federated learning are also data generation nodes (smartphones, sensors, etc.) while the clients in distributed learning are only processing units [42]. Figure 2 shows the federated learning scheme. Currently, federated averaging (FedAvg) is the best known global update method in federated learning. In this approach, each agent takes one step of local gradient update, and then sends the obtained weights to the central server. The server then takes the weighted average of the received weights to compute new weights for each agent and sends them back to the agents. The FedAvg update rule can be shown as where n is the total number of data points used for training the global model, n c is the total number of data points used for training by a single agent c, and ∇F c (w, (x, y)) is the average gradient on the agent's local data. This equation basically updates each agent's weights based on the amount of its contribution to the global model. From the viewpoint of data partition, federated learning has three main categories: (1) horizontal federated learning; (2) vertical federated learning; and (3) federated transfer learning [43]. In horizontal federated learning, the features of the datasets of clients are the same while the samples are different [44]. Vertical federated learning is the case when features are different and the samples partially overlap. Federated transfer learning is used when neither the features nor samples overlap [45].
Another variant of distributed learning is assisted learning developed in 2020 [46]. In this method, the clients do not exchange any data with the central server, and they have a protocol to assist each other's private learning tasks by iteratively exchanging nonsensitive information such as fitted residuals.
In assisted learning, a client can seek assistance from other clients by sharing a few key statistics in an iterative communication process. Client A who seeks assistance sends a query to a selected list of other clients. If Client B agrees to help Client A, it fits a model to the latest statistics e B,k of Client A and sends the obtained residuals back to Client A. Based on the collected responses, Client A initializes the next round of assistance. After this iterative process converges, the training for Client A is complete. In the prediction stage, Client A can combine its own prediction by prediction results from Client B for a new feature vector and form a final prediction. Different methods can be used for combining the prediction results. An example of these methods is weighted summation. Table 1 summarizes the differences between distributed, federated, and assisted learning.

Applications of Distributed Learning
It is estimated that the proliferation of advanced metering devices such as smart meters in power systems will produce more than 2 petabytes of data annually by 2022 [47]. To take advantage of this amount of data, various big data and machine learning methods are being developed and described in the literature. In power systems, distributed learning is an emerging area which can be used to perform forecasting or control tasks without the need to transfer large amounts of data. This section provides an overview of distributed learning applications that have been proposed in power systems so far.

Voltage Control
Tousi et al. [48] used distributed reinforcement learning to control the voltage magnitude of IEEE 39-bus New England power system. In this schema, four static compensators (STATCOMs) were considered as servicing agents that received voltage values from bus agents. Each time a voltage deviation occurred in the system, the servicing agents performed primary or secondary voltage control to retrieve the voltage magnitude. Servicing agents used reinforced Q-learning to decide about the most proper voltage control action. In this method, if Agent A with an action state of s t chooses action a t , it enters a new state s t+1 and receives reward r. A table of expected aggregate future rewards was constructed and, based on the reward values, it was determined how good it is to take action a t in state s t . The actions were STATCOMs' reactive power injection/absorption to/from the buses.
Distributed learning was preferred over centralized learning since the action space was very large [48]. Four variants of multiagent tabular Q-learning were studied: Markov decision process (MDP) learners, independent learners, coordinated reinforcement learning, and distributed value functions. The results show that MDP performed better than all other methods. However, since the joint action space as a function of the agents was exponential, this method was not feasible for large problems. The coordinated reinforcement learning showed the next best performance. The independent learners method only had acceptable performance when the rewards of all agents were equal to the global system reward.
Similarly, an iterative distributed learning approach based on approximate dynamic programming was developed by Liu et al. [49] to perform secondary voltage control in a DC microgrid. Each of the sources inside the microgrid had a local primary and secondary voltage control unit. Each secondary voltage controller unit was an agent that exchanged power flow information with the environment and made decisions based on the actor-critic framework. The effectiveness of the proposed approach was verified using a DC microgrid with four sources.
Other studies include performing secondary voltage control via synchronous generators in an islanded microgrid [50], and via flexible AC transmission systems (FACTS) in the IEEE 39-bus New England power system [51]. Karim et al. [50] used K-means clustering to first classify the system state (as stable or unstable) after a sudden disturbance in the system. After that, a distributed neural network structure was used to predict the reference rotor speed and reference field voltage for individual synchronous generators. Tousi et al. [51] used distributed SARSA Q-learning algorithm to assist FACTS devices in making better secondary voltage control decisions.

Renewable Energy Forecast
Wind power forecasting up to a few hours ahead is paramount for participation in electricity markets and maintaining the reliability of power systems [52]. Statistical learning approaches require sharing confidential data with third parties and, therefore, are not popular. As a result, distributed learning algorithms have gained prominence in wind power forecasting. Online ADMM and mirror-descent inspired algorithms were implemented by Sommer et al. [53] to forecast wind power generation in a distributed manner. Each wind operator site that wanted to perform the forecast was a central agent. The central agent could contact a number of other operator sites for information and subsequently enter a learning contract. The AR-X model states the power output at site j as a function of past power generations at site j and the contracted agents. Therefore, the exchanged pieces of information between the central agent and other agents were the partial power predictions, explanatory variables, model coefficients of sites, and the information encryption matrix. A similar approach was presented by Pinson [54]. However, the proposed method was not online and only learned model parameters were exchanged between agents. Moreover, Goncalves et al. [55] used the same approach for solar energy forecasting.
Probabilistic wind power forecasting was formulated as a distributed framework by Zhang and Wang [56,57] using the ADMM algorithm. In this schema, the agents were the wind farm operators, and they only exchanged partial power predictions with the central collector which was the power system operator. A quantile regression model was used to formulate the wind farm power output as a function of on-site and off-site input features.
In wind farms, wind turbines are placed close to each other due to the limited land availability. This causes the power output of the downstream wind turbines to decrease as a result of the operation of upstream wind turbines. This phenomenon is called the wake effect [58]. To address this issue, Bui et al. [59] used double deep Q-learning to construct a state vector for each wind turbine generator based on state information including pitch angle, tip speed ratio, and wind speed. Each wind turbine generator in state s k performed an action a k (changing pitch angle and tip speed ratio based on the wind speed) and then entered a new state s k+1 with a reward of r k . The purpose was to teach the wind turbine generators to choose the best actions in each state by maximizing the common rewards. It was found that the proposed approach increased the output power of the wind turbines around 1.99% to 4.11% compared to the maximum power point tracking method.

Demand Prediction
Simultaneous electric vehicle (EV) charging can create energy transfer congestion for electric utilities. Therefore, charging station providers predict the EV demand in advance to reserve the energy needed by EVs in real-time. Charging stations may not be willing to share their local data with the charging station provider for EV demand prediction purposes. Therefore, a method based on federated learning was developed by Saputra et al. [60] to predict EV demand without sharing private information between charging stations and the charging station provider. Charging stations were the agents and they only exchanged trained models (gradient information) with the central agent which was the charging station provider. The provider aggregated all trained models, updated the global model using deep learning, and sent the new model to the agents. To increase the prediction accuracy, the charging stations were clustered into different groups based on their location before performing the prediction. Similarly, Wang et al. [61] used weather conditions, geography, characteristics of the vehicle, and driving style as inputs to a federated learning model to predict EV charging demand while considering the charging stations as agents.
Distributed Q-learning was used by Ebell et al. [62] to facilitate energy sharing among households where the agents were only aware of their own actions but received a common reward. An edge computing architecture for energy sharing between smart houses was presented by Albataineh et al. [63] that also used the decision tree learning method to calculate the electricity usage by each edge.

Energy Management
Microgrids are a combination of distributed energy resources and loads. They can operate in both grid-connected and islanded modes, and are often accompanied with a control or management system. Due to the distributed nature of the resources inside the microgrids, distributed learning has recently gained prominence in the control and management of energy in microgrids. Kohn et al. [64] proposed an intelligent control and management system for elements of the microgrid based on distributed architecture. In this model, the microgrid management server was the central server that supervised the control actions of element controllers. The dynamic behavior of the loads and resources was learned using localized Hamiltonians with the variables being operational cost and voltagecurrent relationships. These relationships were constructed as rules in Hamiltonians and used in control algorithms. To model the interactions between control elements, one virtual control element was proposed that interacted with all other elements. This interaction was formulated as an optimal control subject to constraints. The interaction was iterated by playing a two-person Pareto game until a Pareto equilibrium was reached.
Hu and Kwasinski [65] designed an energy management system for microgrids using a hierarchical game machine learning algorithm combined with reinforcement learning. Each microgrid consisted of several base stations that searched for their optimal load-ratio policies. Each base station chose an initial load-ratio and then conducted a two-player game with a virtual user to find the resulting system status (state of charge of batteries, peak signal-to-noise ratio, etc.) and compute the reward. Later, it updated the load-ratio policy according to a rule that maximized the reward. Gao et al. [66] proposed a similar study for energy management of wind-photovoltaic (PV) power systems using distributed reinforcement learning, where each wind turbine or PV system acted as an agent and decided its own action strategy while observing the action history of other agents.

Transient Stability Enhancement
Power system stability margins are decreasing due to the increasing utilization of the tie line capacity, power exchange over long distances, environmental and economical restrictions for building new transmission lines, uncoordinated controls, increase in demand, and increase in power system complexity [67,68]. This has led to difficulties in implementing a central controller for power system automation and, therefore, has motivated the application of distributed mechanisms. To this end, Hadidi and Jeyasurya [69] proposed a multiagent control based on reinforcement learning to enhance the damping of interarea oscillations and increase power system stability margins. The agents in this model were the generator excitation systems and power system stabilizers at the generator locations. The action considered in reinforcement learning was the discrete signal to the excitation reference of the generator and the system state was system oscillations (measured by the magnitude of the one-machine infinite bus speed deviation) and synchronism between the generators in case of severe incidents. The value of angular separation between the two groups of machines was used to determine the value of the penalty, and the area under the speed deviation signal was used to determine the reward associated with each action.

Resilience Enhancement
Power system resilience is the capability of a power system to restore its original condition after a major disturbance, such as an extreme weather event [70]. According to the US Department of Energy, these events have the most significant role in causing blackouts [71]. Therefore, they are known as high-impact and low-probability (HILP) events. Distributed approaches for service restoration decrease the computational burden by dividing the tasks between independent agents. Karim et al. [72] developed a power system restoration method based on distributed machine learning. After a fault occurrence in the system, the power network was divided into several groups based on rotor speed data using the K-means clustering algorithm. Later, a corrective control using supervised machine learning was applied to restore the system. An ensemble of three algorithms including Random Forest and Random Subspace together with Bagging and Boosting was used. The selected features were terminal voltages, frequency values, and sensitivity of the active power generation to voltage at the faulted bus. The final decision, which was a combination of machine learning outputs from different groups (or agents), was the active power reference for the governor of the generators and the amount of load shedding in each group.
Ghorbani et al. [73][74][75] proposed a distributed restoration strategy that used Qlearning to select the actions that restore power to the largest number of loads. The presented structure consisted of multiple feeder agents which could perform learning and load prediction and could communicate with each other or the substation agent for decision making. A similar approach was proposed by Hong [76] to find the best switching configurations for service restoration using Q-learning.

Economic Dispatch
Economic dispatch is allocating demand among power system generators in a way to minimize power generation costs. As a result of the distributed nature of power supply and demand, distributed mechanisms form a promising solution for power dispatch in the current power system. Kim [77] developed a formulation for economic dispatch of distributed generators using a multiagent learning scheme. Lagrangian multipliers were used in the primal update to perform a local optimization (distributed learning) for each generator for reducing generation costs. In the dual update stage, the central agent received Lagrangian multipliers and updated them to satisfy demand-supply balance.

Energy Storage Systems Control
The utilization of regenerative braking energy in urban railway system is essential to prevent the pantograph voltage rise. Zhu et al. [78] proposed a decentralized cooperative control for energy storage systems in urban railway using deep reinforcement learning. The training phase was carried out in a centralized manner aiming at minimizing the global loss function. In the execution phase, each agent performed −greedy action selection in a decentralized manner and based on local observations and local Q-function. The ultimate common goal was to keep the line voltage at the charge voltage threshold during train braking and at the discharge voltage threshold during train powering. The value of the loss function was determined based on the substation's output energy increase and the braking resistor loss. Al-Saffar and Musilek [79] used Q-learning in the same manner to control the energy storage units and mitigate overvoltage issues caused by high penetration of PV systems.

Other Applications
In addition to the applications discussed above, distributed learning has been used for wide-area monitoring [80], optimal allocation of DC/AC converters and synchronous machines in power systems [81], and technology deployment inside energy hubs [82]. Log-linear learning algorithm was used in a game theoretic context by Jouini and Sun [81] to perform the DC/AC converter and synchronous machine allocation between generation units by minimizing the steady state angle deviations of these units from their optimum values. For technology deployment inside energy hubs, both Q-learning and continuous actor-critic learning automaton were tested by Bollinger and Evins [82] while treating each technology type (gas boiler, battery, combined heat and power (CHP) unit, etc.) as an agent.
Gusrialdi et al. [83] developed a distributed learning method for the estimation of the eigenvectors of the power system after a small signal disturbance in the system. The system was divided into several regions, and each region had its own local estimator to compute the average of the electro-mechanical states. This information was exchanged between local estimators to estimate the eigenvectors of the original system. Al-Saffar and Musilek [84,85] used deep reinforcement learning to solve the optimal power flow (OPF) problem while considering the microgrids as agents. Federated learning has also been used for increasing cybersecurity [86] and customer privacy [87] in non-intrusive load monitoring (NILM). A summary of the discussed distributed learning applications in power systems is given in Table 2.
Although distributed Q-learning and federated learning have already been used for energy sharing and load forecasting problems, no previous study has used distributed or federated learning for consumer behavior modeling. Numerous studies have used multilayer perceptions to learn the behavior of heating, ventilation, and air conditioning (HVAC) systems in a centralized or individual manner. However, none of the distributed learning architectures have been tested yet on the same problem. In this model, consumers can serve as agents and the utility can be the central server. The aim is to design the best demand response programs for each customer based on their energy usage behavior and network constraints, without directly sharing the data with the utility. Blockchain-based federated learning was used by Zhao et al. [115] to learn the behavior of customers based on their smart home system data. This allowed the smart home system manufacturers to receive feedback from customers to better understand their needs and enhance their designs while preserving customer privacy. Federated learning can be used in a similar way to design customized demand response programs for consumers, and it can be recommended as a future research direction.
Fault localization in power systems using artificial intelligence is a complex problem. Most articles presented in the literature use small case studies to test their centralized machine learning algorithms for fault location detection and ignore the effect that the rest of the system might have on the test region. Evaluating the effectiveness of distributed learning architectures in fault location detection can be a focus of future research. In this problem, the network can be divided into different regions, where each region is an agent and the final goal is to design a general fault detector for the network by minimizing the detection errors in smaller parts of the system. Blockchain-based federated learning was used by Zhang et al. [116] to detect device failure in Industrial Internet of Things (IIoT). A similar approach can be adapted to fit fault location detection, equipment failure detection, and equipment failure prediction in power systems.
Similarly, the anomaly detection, grid topology identification, and state and optimal power flow estimation methods that are present in the literature have only been tested on small case studies while ignoring the effects of the rest of the system. Federated learning can be a good solution for these problems. Considering that it has not been applied to any of these areas yet, it can be a very interesting topic for future studies. In these cases, the network can be divided into different regions to minimize local training errors and reduce the bandwidth and central data storage requirements needed for transferring data to the utility. Nguyen et al. [117] used federated learning to detect anomalies and attacks in IoT systems. In this study, each security gateway was a local training point for anomaly detection and the IoT security service was the central agent. The detection was performed by analyzing the density of the network traffic. A similar method can be developed for anomaly detection in power systems.
It often happens in power systems that various organizations and operators do not want to share raw data with each other for privacy or competition reasons. An example of this case is the wind power generation forecast where wind turbine operators prefer to not share data with each other. Although this problem has already been addressed in literature using ADMM and mirror-descent algorithms, it has not been investigated using federated or assisted learning methods. Federated learning is simpler in implementation compared to the ADMM and mirror-descent algorithms. Therefore, comparing the accuracy and time requirements of these algorithms can be a very interesting topic for future studies.
Perhaps the greatest benefit of federated learning in power systems would be to collect heterogeneous data from various sources (AMIs, renewable energy sites, generators, protection equipment, tweets, weather forecasts, etc.) and incorporate these data into a federated framework for better training results or for the development of new software tools and a holistic approach for big data management in power systems. This approach has not been studied yet. No previous study has implemented assisted learning in power systems. Therefore, for future studies, it is recommended to explore the potential of assisted learning in power systems applications as well.
Although distributed/federated learning frameworks significantly increase user privacy by eliminating the need for raw data exchange, they are still susceptible to adversarial attacks [118], poisoning attacks [119], and privacy leakage due to the exchange of gradients [120]. This issue can be addressed by using differential privacy and data obfuscation methods. However, this comes at the cost of reduced convergence rate and accuracy [121,122]. Therefore, further research is needed in this area. Another associated challenge is that, in practice, the data in federated learning are non-i.i.d. Therefore, the locally stored data may not represent the population distribution. This further leads to convergence problems when there are missing updates from some clients [123,124]. Moreover, better model aggregation methods for optimizing the performance of distributed/federated learning need to be developed.

Conclusions
This paper provides an overview of distributed learning applications in power systems. It first defines three major variants of distributed learning and points out their differences. Then, the studies that have already implemented distributed and federated learning in power systems are discussed. Finally, the challenges, gaps, and potential research directions in this area are identified. We conclude that a major study area in power systems would be to incorporate heterogeneous data from AMIs, renewable energy sites, generators, protection equipment, tweets, weather forecasts, etc. into training models using federated learning. This serves as a promising solution for developing a holistic approach to becoming a data-driven electric power utility.