Deep Neural Networks in Power Systems: A Review

: Identifying statistical trends for a wide range of practical power system applications, including sustainable energy forecasting, demand response, energy decomposition, and state estimation, is regarded as a signiﬁcant task given the rapid expansion of power system measurements in terms of scale and complexity. In the last decade, deep learning has arisen as a new kind of artiﬁcial intelligence technique that expresses power grid datasets via an extensive hypothesis space, resulting in an outstanding performance in comparison with the majority of recent algorithms. This paper investigates the theoretical beneﬁts of deep data representation in the study of power networks. We examine deep learning techniques described and deployed in a variety of supervised, unsupervised, and reinforcement learning scenarios. We explore different scenarios in which discriminative deep frameworks, such as Stacked Autoencoder networks and Convolution Networks, and generative deep architectures, including Deep Belief Networks and Variational Autoencoders, solve problems. This study’s empirical and theoretical evaluation of deep learning encourages long-term studies on improving this modern category of methods to accomplish substantial advancements in the future of


Introduction
The precision and dependability of data-driven approaches employed for the management and assessment of power systems are highly dependent on the choice of the data representation (i.e., properties derived from the source data) [1].Consequently, the majority of issues regarding the use of traditional data-driven algorithms for power systems are centered on the layout of approaches to preprocessing employing unsupervised dimensionality reduction methods, such as principal component analysis (PCA) [2], linear discriminant analysis (LDA) [3], and t-distributed stochastic neighbor embedding (t-SNE) [4].Such methods for discovering features substantially raise the computational and memory complexities of algorithms relying on data and lead to inadequate precision as they are unable to detect extremely complex and highly fluctuating patterns within a data set's general domain.
Shallow neural networks have several drawbacks compared to deep neural networks.One of the main limitations is their limited representational power.Shallow ANNs are merely capable of learning semi-linear decision boundaries, which restricts their ability to handle highly nonlinear and complex patterns present in many real-world problems [24].
Additionally, shallow networks lack the ability to learn hierarchical representations of features, making it challenging to capture abstract and high-level concepts.However, deep learning, enabled by deep neural networks with multiple hidden layers, has addressed these issues in recent studies [25].Deep neural networks can learn highly nonlinear mappings and capture intricate relationships in the data.The multiple layers in deep networks enable the automatic extraction of hierarchical features, allowing them to capture both low-level and high-level abstractions [26].In lieu of a formal preprocessing strategy, deep learning studies form a multi-layer ANN with more than one hidden layer composed of several complex computational activations [24].The deep ANN variables (i.e., the weights and biases) are usually learned via a greedy unsupervised or semisupervised step-by-step manner [27], in which each layer extracts dynamic features from the features calculated from the preceding layer.
On the basis of conceptual considerations, deep learning methods suggested for use in smart grid-related purposes are typically classified into three main categories: (1) Discriminative methods: The objective of discriminative ANNs is to instantly acquire a highly complex decision boundary between distinct classes as well as regression patches in the power grid measurement datasets [28].The rectified linear unit (ReLU) ANN [29] proposed for immediate reliability management response is highlighted in this group of models.Because of its high adaptation potential and low computational burden, the ReLU ANN is used for real-time small signal stability evaluation [30], defective line location [31], and phasor measurement unit (PMU)-driven incident categorization [32].In addition, the Stacked Autoencoder (SAE) has been proposed as an intensely nonlinear variant of the PCA for unsupervised identification of statistical patterns and structures in wind energy forecasting [33], solar energy estimation [34], fault classification [35], and transient stability assessment [36,37].In addition, the Long Short-Term Memory (LSTM) neural architecture is proposed as a supervised time-related feature extraction technique with a deep recursive structure to model the ordered pattern of time-varying power system observations [38][39][40].LSTM-based ordered models have been put forward for wind and solar generation predictions [41], demand modeling utilizing universal statistics [42], immediate power fluctuation recognition [43], electricity consumption prediction [44], energy decomposition into electrical devices [45], sustainable energy prediction [46], as well as fault recognition [47].Because of the nature of their convolutional and pooling procedures, Convolutional Neural Network (CNN) models are highly effective at capturing patterns of coherence in power system measurements [48].Applying predictive convolutional filters, CNN identifies significant spatial and temporal relationships among observations [49].The combination of pooling and convolutional layers in this form of artificial neural networks encompasses the location of measurements into their chronological attributes to achieve spatiotemporal objectives in the field of sustainable energy prediction [50], transient stability evaluation [51], harmonic components assessment [52], fault identification [53], and short-term voltage stability inspection [54].
(2) Probabilistic deep artificial neural networks view feature modeling as a technique for discovering a minimal set of concealed factors that best characterizes the probability density function (PDF) of the information being studied [55].The PDF is subsequently converted to the problem's destination discrete class or real number.Deep Belief Network (DBN) is a widely recognized stochastic graphical system within this group that discovers the PDF of the data set considering its conditionally independent hidden variables [56].In this framework, Gibbs sampling is usually employed to acquire the features that are necessary to offer an accurate estimate of the stochastic patterns of the provided data for uncertain systems that must account for massive uncertainty elements.Wind and solar energy forecasting [57], transient stability evaluation [58], long-term and mid-term demand forecasting [59], and stochastic power grid state estimation [60] are the primary applications of DBN-based techniques.In addition, in this class of frameworks, the Generative Adversarial Network (GAN) is presented, which takes samples from an estimated PDF and contrasts them with the actual information in the source data in order to improve the preciseness of the taught PDF.This approach has recently been applied to significant anomaly and fault identification problems for small-sample turbines [61] and distributed energy system cybersecurity [62] due to its ability to effectively acquire the primary features of the PDF.Moreover, because GANs can generate datasets using samples derived from the predicted PDF, these models have recently been applied to model-free sustainable energy story synthesis challenges [63].Variational Autoencoder (VAE) is introduced as an innovative variant of deep generative artificial neural networks that discover the PDF of the collected data through the discovery of a high-dimensional hidden variable that is linked to the raw data examples in the original dataset.It has been demonstrated that VAE produces accurate artificial data for power grid simulation [64], unsupervised identification of anomalies in energy historical datasets [13], and Electric Vehicle demand production [65].(3) Deep Reinforcement Learning (DRL) methods have been an important category of machine learning techniques that aim to discover the best policy for continuous/discrete action selection based on the environment's input (i.e., objective) calculated through a function that rewards intelligent agents' behavior.This function indicates, according to the present condition of the system, the degree to which the machine learning task's objective and constraints have been met.As opposed to traditional deep learning, which estimates a discrete objective function for classifiers and a real-valued objective function for regression, DRL seeks to reduce a generic objective function specified by the training scenario in a completely visible or partially visible setting.Consequently, this technique solves more comprehensive groups of problems than traditional deep learning.Due to its reward-based structure, DRL is extensively used for a variety of control challenges, such as voltage control [66], adaptable emergency control [67], and self-learning control for energy-efficient transportation [68].DRL is also applied to optimization tasks for learning ideal pricing techniques in electricity markets [69], demand response plans for the management of energy [70][71][72], and determining the best wind and storage collaborative scheme to reduce the impact of unpredictability in sustainable energy production in electric networks [73].Moreover, the detection and classification of cybersecurity threats [74], dynamic energy allocation [75], and power grid data security protection [73] are recent applications of this group of techniques.
In the area of power grid research, this review paper investigates three of the main types of deep neural networks.Section 2 introduces the deep discriminative architectures and their training procedure.Employing several practical applications and power systems datasets, this paper explains and compares analytically and empirically various variants of this machine learning category of approaches.In Section 3, probabilistic deep neural architectures and techniques including the traditional DBN and its Gaussian variant, in addition to the recently developed GANs and VAEs, are introduced.In this section, the practical and conceptual benefits of these methods are presented.The paper then examines DRL algorithms and their broad use in power network operation and management in Section 4. Section 5 presents the emerging topics and new challenges in the area of deep learning.Finally, Section 6 concludes with a discussion of the findings and future machine learning tasks in this domain.

Discriminative Deep Architectures
Discriminative machine learning is a powerful tool for solving supervised problems in which the goal is to learn a mapping between input features and output labels.In this approach, we aim to directly model the conditional probability distribution of the output given the input, also known as the posterior probability, using a discriminative function [76].In mathematical terms, given a training set of input-output pairs (x 1 , y 1 ), . . ., (x n , y n ), the goal is to learn a function f (x) that predicts the output label y given an input x.
The discriminative function f (x) is typically learned by optimizing a loss function that measures the difference between the predicted output and the true output [77].One common approach for discriminative learning is to use logistic regression, which models the posterior probability of the output label given the input features as a logistic function P(y|x) = σ(w T x + b) where σ(z) is the sigmoid function, w is a vector of weights, b is a bias term, and x is a vector of input features [78].The sigmoid function maps the output of the linear function w T x + b to the range [0, 1], which can be interpreted as a probability.The logistic regression model can be trained using maximum likelihood estimation [79], which involves maximizing the log-likelihood of the training data with n training samples L(w, b) = ∑ n i=1 logP(y i |x i ; w, b) where P(y i |x i ; w, b) is the posterior probability of the output label y i given the input x i corresponding to sample i in the dataset, parameterized by the weights w and bias b.
The multilayer perceptron (MLP) is a type of discriminative ANN that consists of multiple layers of neurons, each layer being fully connected to the previous and next layers.The MLP is a powerful tool for solving a wide range of supervised learning problems, including classification, regression, and pattern recognition [80].Mathematically, an MLP can be represented as a function f (x) that maps an input vector x to an output vector y, where y is a function of the weighted sum of the activations of the neurons in the previous layer.Let us consider a feedforward MLP with L layers, where the input layer has d input neurons, the output layer has k output neurons, and each hidden layer has m neurons.The output of the j-th neuron in the l-th layer, denoted as a where g(z) is an activation function, w (l) ij is the weight connecting the i-th neuron in layer l − 1 to the j-th neuron in layer l, b (l) j is the bias term for the j-th neuron in layer l, and m (l−1) is the number of neurons in the previous layer.The activation function g(z) introduces nonlinearity to the network, allowing it to learn complex mappings between the input and output.
To train an MLP, we need to minimize a loss function that measures the difference between the predicted output and the true output.One common loss function is the mean squared error (MSE) [81], which is defined as where n is the number of training samples, y i is the true output for the i-th sample, and f (x i ) is the predicted output for the i-th sample.To minimize the loss function, one can use stochastic gradient descent (SGD) [26], which updates the weights and biases of the network in the direction of the negative gradient of the loss function with respect to the weights and biases.
The weight update rule for the jth neuron in the lth layer is given by w where η is the learning rate, which controls the step size of the weight update, and ∂L ∂w (l) ij is the partial derivative of the loss function with respect to the weight w The vanishing gradient problem is a common issue in training deep multilayer perceptrons, as the gradients of the loss function with respect to the weights in the lower layers tend to become very small as they propagate backward through the network [46].This can cause the weights in the lower layers to be updated very slowly, or not at all, leading to poor performance.Deep learning methods address this issue by introducing specialized activation functions, weight initialization schemes, and optimization techniques that are specifically designed to prevent the gradients from vanishing or exploding [24].For example, activation functions like ReLU and variants of it have a non-zero derivative in most regions of their domain, which helps to mitigate the vanishing gradient problem.Additionally, weight initialization schemes such as He initialization [82] and optimization techniques such as adaptive learning rate methods (e.g., Adam) [83] have also been shown to improve training performance in deep neural networks.These techniques have enabled the training of very deep neural networks with many layers, allowing them to learn complex patterns and representations in the data.

ReLU Neural Networks
ReLU neural networks are a class of deep neural networks that use the ReLU activation function in their hidden layers.ReLU is a piecewise linear function that returns the input value if it is positive, and zero otherwise.It has become a popular choice for neural networks due to its simplicity, computational efficiency, and effectiveness in preventing the vanishing gradient problem.One of the main advantages of ReLU activation functions is that they have a non-zero derivative for all positive inputs, which helps to prevent the vanishing gradient problem.Additionally, the ReLU function is computationally efficient to evaluate, as it only involves a simple thresholding operation.This makes it well-suited for large-scale neural networks that require many activations to be computed during training and inference.Another advantage of ReLU neural networks is their ability to learn sparse representations of the input data [84].Since the ReLU function sets negative values to zero, the activations of many neurons in the network will be zero for most input samples, resulting in a sparse representation.This can lead to faster training and improved generalization performance, as the network focuses on the most relevant features of the data.However, ReLU activation functions are not without their drawbacks.One issue is that they can suffer from the "dying ReLU" problem [85], where the gradient of the function is zero for all negative inputs, causing the neuron to stop learning.This can happen if the weights of the neuron are initialized such that it only receives negative inputs.To address this issue, various modifications to the ReLU function have been proposed, such as leaky ReLU [86], which adds a small slope to the negative part of the function.
Table 1 shows the applications of discriminative deep neural networks for power systems operation, management, and planning.Due to their high generalization power, deep ReLU networks are widely applied in power systems classification and regression problems.For instance, as shown in Table 1, Deep ReLU networks result in accurate voltage stability assessments with a root mean squared error (RMSE) of 0.083 and mean absolute percentage error (MAPE) equal to 0.095 on the New England 10-generator 39-bus system [29].In addition, Deep ReLU models achieve an RMSE of 0.045 and MAPE of 0.083 for power system reliability assessment of the IEEE-RTS-79 system [87].Furthermore, this type of deep learning model is employed for load identification and estimation of complex load parameters.In this area, the deep ReLU network is applied to the loads of the 68-bus New England and New York Interconnect System, which yields an RMSE of 0.045 and a MAPE of 0.074 in real-time load modeling tasks [88].Since the feedforward algorithm of these neural networks takes merely several milliseconds, these approaches can be easily tested on real-world power systems in a real-time fashion.The use of this approach for sustainable energy prediction results in high accuracy and reliability for both wind and solar energy prediction tasks [89,90].The RMSE and MAPE of the deep ReLU network for hourly wind power prediction are 0.078 and 0.092, respectively.These results are reported for the Wind Integration National Dataset.Additionally, the RMSE and MAPE of this approach for the hourly photovoltaic power prediction are 0.093 and 0.104 using the Solar Power Data for Integration Studies.Energy disaggregation is another application in which the deep ReLU network shows high performance.This technique results in an F-score of 68.72 with a precision equal to 80.54 on the Reference Energy Disaggregation Dataset [16].Similar high classification accuracies are reported for fault identification [91].For instance, the use of deep ReLU networks for fault detection in the New England 39-bus test system results in an F-score of 77.49 which is a reasonable accuracy for real-world scenarios.

Stacked Autoencoder
Stacked autoencoder (SAE) neural networks are a class of deep neural networks that use unsupervised learning to pretrain multiple layers of the network before fine-tuning the entire network with supervised learning [45].Autoencoders are a type of neural network that learns to compress and reconstruct the input data.Stacked autoencoders use multiple layers of autoencoders to learn increasingly complex representations of the input data.The neural architecture of a stacked autoencoder consists of multiple autoencoders where each autoencoding building block consists of an encoder network and a decoder network.The encoder network maps the input data to a compressed representation, while the decoder network maps the compressed representation back to the original input data.The objective of the autoencoder is to minimize the reconstruction error between the input and the output, which is typically measured using a loss function such as mean squared error (MSE) [24,119].
The first layer of the stacked autoencoder is trained as a standard autoencoder, using the input data as both the input and the target output.The compressed representation learned by this layer is then used as the input to the next layer of the network.This process is repeated for each subsequent layer of the network, with each layer learning to compress and reconstruct the output of the previous layer.Once all the layers have been pretrained in an unsupervised manner, the entire network is fine-tuned using supervised learning to learn the final classification or regression task.This involves adding one or more output layers to the network, and updating all the weights of the network using backpropagation and a supervised loss function such as cross-entropy or mean squared error.
The pretraining phase of the stacked autoencoder allows the network to learn useful highly nonlinear feature representations of the input data in an unsupervised manner.This can be particularly useful for tasks with limited labeled data [120], as the network can learn to extract relevant features from the unlabeled data to improve its performance on the labeled data.Additionally, the use of multiple layers in the stacked autoencoder allows for the learning of hierarchical representations of the data, with each layer learning to extract increasingly abstract and complex features [24].
The mathematical equations for the pretraining phase of the stacked autoencoder can be written as follows.Let x be the input data, and f i and g i be the encoder and decoder functions for the i-th layer of the network (i.e., i-th autoencoder), respectively.The compressed representation h i learned by the i-th layer is given by The objective of the i-th autoencoder is to minimize the reconstruction error between its input h i−1 and the reconstructed output ĥi−1 , which is typically measured using mean squared error The pretraining phase involves minimizing the reconstruction error for each layer of the network, using the compressed representation h i learned by the previous layer as the input.The weights of each layer are updated using backpropagation through the decoder network, with the gradients of the reconstruction error with respect to the weights being backpropagated through the decoder network and then through the encoder network.Once all the layers have been pretrained in this manner, the entire network is fine-tuned using supervised learning, by adding one or more output layers and updating all the weights of the network using backpropagation and a supervised loss function.
As shown in Table 1, SAEs are widely employed in power systems studies due to their simple implementation and strong unsupervised features captured by these methods.Generally speaking, SAEs provide better classification and regression accuracies in comparison to the deep ReLU model as these techniques seek to learn an unsupervised set of features from the data before mapping the input measurements to the supervised (desired) output.Table 1 shows that SAE yields an RMSE of 0.071 and MAPE of 0.086 for voltage stability assessments of the New England system [92].Moreover, as shown in the table, SAE improves the accuracy of deep ReLU network in power system reliability evaluation and load modeling, as well as sustainable energy prediction.For instance, the SAE yields 28.8% better RMSE and 14.5% better MAPE compared to the deep ReLU network in the reliability assessment of the IEEE-RTS-79 system [87].Furthermore, this neural network yields an RMSE of 0.085 and MAPE of 0.093 for the hourly prediction of photovoltaic energy using the Solar Power Data for Integration Studies [107].

Long Short-Term Memory Network
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) that are designed to handle the vanishing and exploding gradient problems that can occur in traditional RNNs [121,122].LSTMs use a combination of memory cells and gating mechanisms to selectively remember or forget information from previous time steps, allowing them to capture long-term dependencies in sequential data.At each time step t, an LSTM cell receives an input x t and a hidden state h t−1 from the previous time step, and produces an output y t and a new hidden state h t .The LSTM cell consists of several gates that control the flow of information into and out of the cell.These gates include the input gate i t , the forget gate f t , and the output gate o t .The input gate determines how much of the input x t should be added to the cell state c t .It takes as input the current input x t and the previous hidden state h t−1 and produces an activation a t i that is passed through a sigmoid activation function to produce the gate output i t : The forget gate determines how much of the previous cell state c t−1 should be retained in the current cell state c t .It takes as input the current input x t and the previous hidden state h t−1 , and produces an activation a t f that is passed through a sigmoid activation function to produce the gate output f t : The cell state c t is updated by: The output gate determines how much of the cell state c t should be used to compute the output y t .It takes as input the current input x t and the previous hidden state h t−1 and produces an activation a t o that is passed through a sigmoid activation function to produce the gate output o t : The final output y t is computed by multiplying the cell state c t by the output gate o t and passing it through a Tanh activation function: LSTM networks can be stacked to create deeper architectures, with each layer consisting of multiple LSTM cells [123].The hidden state of the previous layer is passed as input to the current layer, allowing the network to learn increasingly abstract and complex representations of the input data [24].
Due to their high generalization power and great adaptation to new time-dependent patterns and scenarios, LSTMs are widely used in power system research, especially for temporal pattern recognition in time series datasets and measurements.For instance, as shown in Table 1, LSTM shows high accuracy with an RMSE of 0.035 and MAPE of 0.046 for the voltage stability assessment in the New England 10-generator test system [93].Furthermore, this model shows the best performance among discriminative neural architectures with an RMSE of 0.026 and MAPE of 0.033 for power system reliability assessment in the IEEE-RTS-79 system [94].The LSTM units are also employed for load modeling and have shown state-of-the-art performance with an RMSE of 0.029 and MAPE equal to 0.032 for load parameter identification in the 68-bus New England and New York Interconnect power grid [39].In addition, the time series pattern recognition tasks such as load forecasting, wind prediction, and solar power prediction are time-dependent tasks where LSTM has shown state-of-the-art performance [99].For instance, in the hourly load prediction of the Household Electric Power Consumption dataset, the LSTM network shows great performance with 0.075 RMSE and 0.084 MAPE.Various classification tasks can also be accurately solved through LSTM units.For example, in energy disaggregation, the LSTM network shows a precision of 89.83 and F-score of 75.93, which is close to the state-of-the-art performance [108].Additionally, in fault identification, the LSTM network shows the best results with a precision of 91.72 and recall of 77.18 which gives a classification F-score of 83.82 [115].
In addition to recurrent structures, the following non-recursive techniques can be applied to capture temporal dependencies in the data: (1) Feature Engineering: By creating relevant features that represent the temporal characteristics of the data, one can capture time-dependent patterns indirectly.For instance, we can compute statistical features such as rolling means, standard deviations, or exponential moving averages over different time windows.These features can provide insights into trends, seasonality, or other temporal patterns and have been applied to battery state-of-charge estimation [124] and load forecasting [125].(2) Time Encoding: One can encode the timestamp or time information into a format that can be easily interpreted by machine learning algorithms.For example, we can convert timestamps into cyclical representations such as hour of the day, day of the week, or month of the year.This encoding can help algorithms capture periodic patterns and time dependencies and has been applied to energy disaggregation problems [45].(3) Nonlinear Transformations: Sometimes nonlinear transformations of the data can reveal time-dependent structures that are not evident in the original form.For instance, we can apply mathematical functions such as logarithmic or exponential transformations to the data.These transformations can help normalize the data or highlight specific temporal patterns.This method has been applied to power system state estimation [126], fault detection [127], and power grid time domain simulation [128].(4) Dynamic Time Warping (DTW): DTW is a technique used to measure the similarity between two time series, even when they have different lengths or shapes.This method allows for finding the optimal alignment between two sequences, which can help capture time dependencies.DTW can be useful when we have similar patterns occurring at different time scales or with time shifts.This technique is widely applied to peak load prediction [129], fault diagnosis [130], and controlled islanding [131].

Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a type of neural network that is specifically designed for the classification and regression of vectors as well as two-dimensional data points.CNNs use convolutional layers to extract features from vectors and matrices, and pooling layers to reduce the spatial dimensions of the feature maps [95,96,132].The final layers of the CNN consist of fully connected layers that perform the classification or regression task.The key components of a CNN are the convolutional layers, which perform the convolution operation on the input data using a set of learnable filters.The output of the convolution operation is a feature map that represents the activation of the filter at each location of the input data.The filter weights are learned during the training process via backpropagation [101].
The convolution operation can be represented mathematically by: where x i,j is the pixel value at location (i, j) of the input data, w p,q is the weight of the filter at position (p, q), b is the bias term, and k is the size of the filter.The output feature map h ij is obtained by passing the result of the convolution operation through a nonlinear activation function, such as the rectified linear unit (ReLU) function [96].After each convolutional layer, a pooling layer is typically added to reduce the spatial dimensions of the feature maps.The most common type of pooling layer is the max pooling layer, which takes the maximum value within a local region of the feature map.The max pooling operation can be represented mathematically as: The final layers of the CNN consist of fully connected layers, which perform the supervised task.The output of the last convolutional layer is flattened into a vector and passed through a series of fully connected layers, each of which applies a linear transformation to the input followed by a nonlinear activation function [115].The final output of the network is a probability distribution over the different classes in classification or the probability of real-valued numbers in regression.
As shown in Table 1, CNNs are widely applied for various applications in power systems due to their high generalization power, feature sparsity, and the spatial and temporal coherence assumption of this model that usually holds in most real-world input measurements.For instance, CNN is combined with LSTM networks for the assessment of voltage stability for the New England 10-generator 39-bus system, which results in an RMSE of 0.021 and a MAPE of 0.038 [28].This is the state-of-the-art performance compared with other deep neural architectures including SAEs and LSTMs.The main reason for such accuracy improvements is the fact that CNN is a cutting-edge solution for capturing spatial data dependencies, while the LSTM is among the best models for modeling temporal dependencies in power systems measurements [96].CNNs also show great performance in the reliability assessment of power grids.These methods improve the results of the deep ReLU network by 31.1% RMSE and 18.1% MAPE on the IEEE-RTS-79 system [95].Similar improvements are shown in load modeling and demand forecasting [98].In sustainable energy prediction, the CNN provides higher accuracy in comparison to deep ReLU networks and SAEs since the CNN is capable of capturing the spatial dependencies between the wind or solar power measurements of neighboring sites while the classing deep ReLU and SAE-based techniques can merely handle one station at a time and do not have enough capacity to see the entire stations in a wide region.Moreover, the sparsity of CNN features helps this model to need less number of training samples to achieve a high accuracy.The CNN also improves the energy disaggregation accuracy of SAE by 5.9% precision and 3.76% recall on the Reference Energy Disaggregation Dataset [15].This model also shows state-of-the-art classification performance for PMU event categorization of the IEEE 34-bus system with a precision of 89.15 and recall of 73.36 when it is incorporated into the LSTM units [112].Similar high accuracies can be seen in the fault identification studies where the CNN provides a high F-score equal to 82.10 for the New England 39-bus system [116].

Strengths and Shortcomings of Discriminative Deep Architectures
Figure 1 outlines the strengths and weaknesses of various deep discriminative models within the context of power system research.As detailed in the table, the ReLU neural network offers a simple implementation characterized by low training time complexity and fast feedforward mechanisms.Nevertheless, this supervised model does not inherently account for temporal or spatial features within datasets.ReLU networks lack explicit mechanisms for capturing spatial information in data.They are primarily designed for processing vector inputs, where the spatial relationships between the input features are not considered.That is, the ReLU operations do not assume any similarity metrics between their input variables (i.e., vector entries).Therefore, ReLU networks may struggle to effectively handle data with inherent spatial coherence, such as images or sequential data.On the other hand, CNNs are specifically designed to capture spatial coherence in data since they apply convolutional layers that define a set of learnable filters or kernels to local regions of the input data, scanning the entire input through the filters to extract relevant features [24].The SAE shares a similar limitation as it cannot well capture spatial and temporal patterns directly.However, SAE is suited for unsupervised learning or learning with limited data.Unlike SAE, LSTM can process temporal data due to its recurrent structure, which enables the model to represent time-dependent relationships and variable-sized input.However, due to the larger quantity of parameters in LSTM compared to conventional recurrent neural networks, LSTM runs a higher risk of overfitting and shows significant sensitivity to observation noise.Finally, CNN leverages filtering and pooling layers to extract informative sparse representations from spatial datasets using straightforward gradient-based methods.Given that the filters can be trained in a distributed fashion, CNN emerges as a highly efficient methodology for pattern recognition in large-scale systems.However, due to the lack of a recursive structure in this supervised model, it struggles to accurately capture time-dependent structures of the data.

Probability-Based Deep Neural Architectures
Probability-based Deep Neural Architectures, also known as Generative Neural Networks (GNNs), are a class of neural networks that are capable of generating new data samples that resemble the input data [132,133].The GNNs can also learn the probability distribution of the input data.This means that the GNN can generate new samples that not only resemble the input data but also follow the same statistical patterns as the input data.Therefore, this class of neural networks is widely used for probabilistic estimations such as probabilistic load modeling [134], uncertainty-aware sustainable energy prediction [135], and power grid synthesis [136].

Deep Belief Network
Deep Belief Networks (DBNs) are a class of neural networks that are composed of multiple layers of Restricted Boltzmann Machines (RBMs).RBMs are generative models that consist of two layers: a visible layer and a hidden layer.The visible layer represents the input data, and the hidden layer represents the latent variables that capture the underlying structure of the input data.RBMs are trained using the Contrastive Divergence algorithm [137], which maximizes the log-likelihood of the training data.
DBNs are trained using a greedy layer-wise approach, where each layer is trained independently using an unsupervised learning algorithm such as the Contrastive Divergence algorithm.After each layer is trained, the output of the previous layer is used as the input to the current layer.Once all the layers are trained, the DBN can be fine-tuned using a supervised learning algorithm such as backpropagation.
The probability of the input data given the hidden variables in an RBM is defined by the energy function: where v is the visible layer, h is the hidden layer, V is the number of visible units, H is the number of hidden units, w ij is the weight between visible unit i and hidden unit j, a i is the bias of visible unit i, and b j is the bias of hidden unit j.
The joint probability distribution of the visible layer and hidden layer is given by the Boltzmann distribution: where Z is the partition function, which normalizes the distribution.The conditional probability of the hidden layer given the visible layer is given by: where σ is the sigmoid function.The conditional probability of the visible layer given the hidden layer is given by: where σ is the sigmoid function.DBNs are trained using a greedy layer-wise approach, where each layer is trained independently using the Contrastive Divergence.This training process is composed of two phases: pretraining and fine-tuning.
In the pretraining phase, each layer is trained independently as an RBM.The input to each RBM is the output of the previous layer, or the raw input data in the case of the first layer.The Contrastive Divergence algorithm is used to maximize the log-likelihood of the training data, by adjusting the weights and biases of the RBM.The goal of pretraining is to learn a set of feature detectors that can capture the underlying structure of the input data.Once all the layers have been pretrained, the DBN is fine-tuned using a supervised learning algorithm such as backpropagation.During the fine-tuning phase, the weights and biases of the entire network are adjusted to minimize a cost function, which is typically the cross-entropy between the predicted labels and the true labels.The output of the DBN is obtained by passing the input data through each layer, and the output of the final layer is used as the prediction.
The advantage of the greedy layer-wise approach is that it allows the DBN to learn a hierarchy of features, where each layer captures increasingly complex and abstract features [24].This makes the DBN more robust to variations in the input data, and enables it to generalize well to new, unseen data [57].However, the disadvantage of the greedy layerwise approach is that it may not always lead to the optimal solution, and the pretraining phase may take a long time to converge.During pretraining, the convergence of each layer of DBN may be too slow due to the following challenges: (1) Initialization: RBMs are sensitive to weight initialization [57].Initializing the weights inappropriately can result in slow convergence or even convergence to poor local optima.Proper initialization techniques, such as sampling from a small Gaussian distribution or using unsupervised pre-training, can help accelerate convergence; (2) Learning Rate: The learning rate determines the step size at each update during training.If the learning rate is too high, it can cause unstable updates and hinder convergence.On the other hand, if the learning rate is too low, the convergence may become excessively slow.Finding an appropriate learning rate or using adaptive learning rate methods can be crucial for efficient convergence [25]; (3) Sampling Method: RBMs use sampling techniques, such as Gibbs sampling or Contrastive Divergence, to approximate the model's distribution and update the weights.The number of sampling steps can significantly impact convergence.Using a higher number of sampling steps improves the approximation but increases the computational cost, potentially slowing down convergence [58,59]; and (4) Local Optima: Similar to many optimization problems, RBMs can get stuck in local optima, where the network fails to find the global minimum of the objective function.The network may converge to a suboptimal solution, leading to slower convergence or reduced performance.Exploring different weight initializations, employing techniques such as momentum or simulated annealing, or using more advanced optimization algorithms can help overcome local optima and improve convergence [24].
DBNs are widely applied to various power systems problems in recent studies.Table 2 shows the applications of the probabilistic deep neural architectures in power systems research.As shown in this table, DBN is applied to the transient stability assessment problem of the Central China Region Power Grid and has shown a precision of 80.24, recall of 78.92, and an F-score equal to 79.57 in this application [138].Moreover, DBN is shown to accurately learn probabilistic features from time-dependent datasets [57].For instance, DBN provides highly accurate hourly demand predictions on the Texas Urbanized Area Dataset with an RMSE and MAPE of 0.045 and 0.091, respectively [139].Additionally, this model has shown great prediction accuracies for hourly sustainable power prediction tasks.For example, on the Solar Integration National Dataset, the DBN shows a prediction RMSE of 0.082 and MAPE equal to 0.093 [140].State estimation is another major application of DBNs in power engineering research.This deep learning framework yields an RMSE and MAPE of 0.092 and 0.156, respectively, on the US PG&e69 power system [11].Moreover, DBN is employed for classification tasks such as fault identification as well as cyberattack detection.In fault classification, DBN provides a classification precision, recall, and F-score of 81.32, 76.59, 78.88, respectively, when applied to the IEEE 33-bus distribution network [13].In cyberattack categorization, DBN yields an F-score of 79.44, which shows the great generalization capacity and highly nonlinear decision boundary of this type of deep learning technique in classification tasks [141].Similar high accuracies are reported for data synthesis problems.For instance, this model shows an RMSE of 0.085 and MAPE of 0.125 in the synthesis of the Columbia University Synthetic Power Grid [121].
Furthermore, it offers an F-score of 74.75 for the disaggregation of residential loads in the Reference Energy Disaggregation Dataset [45].

Variational Autoencoder
Variational Autoencoders (VAEs) are a class of generative neural networks that use an encoder and a decoder to learn a low-dimensional representation of high-dimensional data.VAEs are commonly used for image and video generation, but can also be applied to other types of data, such as text and audio.The VAE architecture consists of two neural networks: an encoder network, q φ (z|x), which maps the input data, x, to a low-dimensional latent representation, z, and a decoder network, p θ (x|z), which maps the latent representation, z, back to the original data space.The objective of the VAE is to learn the joint probability distribution of the input data and the latent representation, p(x, z), and use it to generate new samples.The encoder network maps the input data, x, to a latent representation, z, using a Gaussian distribution with mean, µ, and standard deviation, σ, which are functions of the input data, x.Hence, we have: where f φ,µ and f φ,σ are the encoder network functions, and N (µ, σ 2 I) is the Gaussian distribution with mean µ and covariance matrix σ 2 I.The standard deviation is usually modeled using the logarithm to ensure positivity.The decoder network maps the latent representation, z, back to the original data space, x, using a similar Gaussian distribution with mean, µ , and standard deviation, σ : where g θ,µ and g θ,σ are the decoder network functions.During the training process, the VAE learns to maximize the evidence lower bound (ELBO), which is the lower bound on the log-likelihood of the data under the model [181].The ELBO is computed by: ELBO where D KL is the Kullback-Leibler divergence [158] between the encoder distribution and the prior distribution, p(z), which is usually assumed to be a unit Gaussian distribution.The first term in the ELBO is the reconstruction loss, which measures the similarity between the input data and the reconstructed data, while the second term is the regularization term, which encourages the latent representation to follow the prior distribution [64].
During the training process, the weights of the encoder and decoder networks are updated using backpropagation and stochastic gradient descent, where the gradient is backpropagated from the ELBO objective function to the network weights [50].As the VAEs learn the probability distribution of the input data, they can be utilized to generate synthetic samples that follow the distribution of the original data.To generate a data sample x ∼ P(x|z), one can generate a random Gaussian sample z with the same dimension as the VAE's latent feature from the Normal distribution, and feed z into the decoder ANN that implements p θ (x|z) to generate a synthetic data sample in its output layer.
The VAEs have been applied to a wide range of power systems applications as both an unsupervised feature extraction model and a supervised technique for supervised estimation of discrete labels (classification) or continuous variables (regression).As shown in Table 2, in this area, VAE is applied to the transient stability assessment of the Central China Regional Power Grid, which results in a precision, recall, and F-score of 85.64, 81.05, and 83.28, respectively [143].These results show a significant improvement over the DBN method since is better able to model the uncertainties in the data using its variational loss function compared to the DBN which is more data-hungry due to its Gibbs sampling-based training process.Additionally, VAE is applied to a wide range of time series prediction tasks such as hourly prediction of electricity demand in residential units, wind energy, and photovoltaic power [182].VAE provides a slightly better demand prediction model compared to the DBN when applied to the Texas Urbanized Area Dataset for hourly prediction of residential loads [145].This model results in RMSE and MAPE of 0.036 and 0.075, respectively, which are slightly lower than the DBN.In addition, the results of VAE's hourly wind power predictions on the Wind Integration Dataset show an RMSE and MAPE of 0.064 and 0.071, respectively [150].The RMSE of VAE in this application is 17.94% and its MAPE is 25.26% better than the DBN method as it can provide a more reliable estimation of the probability densities of the underlying data.Similar improvements can also be seen in the state estimation applications where the VAE provides an RMSE of 0.074 and MAPE of 0.093 on the US PG&E69 distribution network [11].In terms of classification problems, the VAE outperforms the DBN as it is more data-efficient and robust to noise and uncertainties.For instance, the VAE yields an F-score of 83.39 on the IEEE 33-bus system which is 5.72% higher than the DBN in fault identification [158].Similar improvements can be seen in attack recognition problems, which are classification challenges [141,170].VAEs can also be employed for data augmentation and synthesis of power networks.In this domain, they provide an RMSE and MAPE of 0.052 and 0.096, respectively, for the synthesis of the Columbia University Synthesis Power Grid [136].Additionally, the VAEs outperform DBNs in non-intrusive load monitoring and energy disaggregation tasks.For instance, on the Reference Energy Disaggregation Dataset, VAE provides a 12.55% better load decomposition F-score compared with a similar technique using DBN for feature extraction [177].

Generative Adversarial Network
Generative Adversarial Networks (GANs) are a class of generative neural networks that use two neural networks, a generator and a discriminator, to learn the underlying probability distribution of the data [133].The GAN architecture consists of two neural networks: a generator network, G, and a discriminator network, D. The generator network takes a random noise vector, z, as input, and produces a sample, x, from the target distribution, p(x).The discriminator network takes a sample, x, as input, and produces a binary output, d, indicating whether the sample is real or fake.The objective of the generator network is to produce samples that are indistinguishable from real samples, while the objective of the discriminator network is to correctly distinguish between real and fake samples [161].
The training process of GANs can be formulated as a minimax game between the generator network and the discriminator network [160].The generator network tries to minimize the difference between the distribution of its generated samples, G(z), and the distribution of real samples, p(x), while the discriminator network tries to maximize the difference between the distribution of real samples and the distribution of fake samples, D(G(z)).The objective function for the GAN can be written as: where p data (x) is the true data distribution and p z (z) is the noise distribution.The first term in the objective function maximizes the log-likelihood of the discriminator network correctly classifying real samples, while the second term maximizes the log-likelihood of the discriminator network incorrectly classifying fake samples [175].During the training process, the generator network tries to generate samples that fool the discriminator network, while the discriminator network tries to correctly classify real and fake samples.The weights of the generator network and the discriminator network are updated using backpropagation and stochastic gradient descent, where the gradient is backpropagated from the discriminator network to the generator network.
The advantage of GANs is that they can generate realistic samples that capture the underlying distribution of the training data.However, the training of GANs can be difficult and unstable, and the quality of the generated samples may depend heavily on the architecture and hyperparameters of the network.In addition, GANs may suffer from mode collapse [155], where the generator network only learns to generate a few modes of the target distribution, rather than capturing the entire distribution.
Table 2 shows the applications of GANs in power systems research and clear comparisons with VAEs.GANs offer several advantages over VAEs.Firstly, GANs produce more realistic and high-quality synthetic samples compared to VAEs.GANs leverage a discriminator network that learns to distinguish between real and generated samples, leading to sharper and more convincing output.Secondly, GANs do not suffer from the blurry and noisy output issue commonly observed in VAEs.VAEs tend to average over the latent space, resulting in less distinct and fuzzy synthetic data samples.In contrast, GANs generate samples by directly modeling the data distribution, allowing for more diverse and detailed outputs.For instance, as shown in Table 2, GAN shows 2.89% better F-score compared to VAE for the transient stability assessment of the Central China Regional Power Grid [144].Additionally, GAN improves the MAPE of demand forecasting on the Texas Urbanized Area Dataset by 12.0% compared to the VAE model [148].Similar improvements can be seen in other time series forecasting tasks in the power system area.For example, as shown in the table, GAN provides a 14.63% less MAPE compared to the VAE for the hourly prediction of solar energy using the Solar Integration National Dataset [154].In state estimation of the US PG&E69 distribution power network, GAN yields an RMSE of 0.065, which is 12.16% lower than the same model that employs a VAE [12].Furthermore, for fault identification, GAN provides a 4.95% better F-score in comparison with the VAE on the IEEE 33-bus system [160,161].Similar improvements can be seen in the cyberattack detection problem as a challenging classification task [170].As discussed in this section, GANs are capable of generating realistic data points as they accurately learn data probability densities.Therefore, as shown in Table 2, these methods are applied to generate synthetic power networks.On the Columbia University Synthetic Power Grid dataset, the synthesis task by GAN shows an RMSE of 0.033 and MAPE of 0.071 which are significantly lower than applying VAEs and DBNs for the same task to generate realistic power networks [136].Similar improvements are shown in the table for the energy disag-gregation application, where GAN can improve the VAE by 8.02% in the F-score for the disaggregation of residential loads in the Reference Energy Disaggregation Dataset [179].

Strengths and Shortcomings of Probability-Based Deep Neural Architectures
Figure 2 highlights the benefits and limitations of deep generative modeling as applied to power system research.As the table indicates, DBN, GAN, and VAE exhibit proficiency in managing measurement uncertainties, concurrently offering a robust unsupervised data representation.DBN, in comparison with GAN and VAE, possesses lower sample complexity, which subsequently reduces the amount of training data needed for effective feature extraction.However, DBN's reliance on Gibbs sampling during training introduces considerable time-complexity.Moreover, DBN makes a strong independence assumption regarding its latent variables.Contrarily, GAN and VAE are capable of learning data distributions directly without any prior assumptions, making them well suited for power system data synthesis.Due to its more complex architecture, GAN demands a larger number of training examples relative to DBN.However, GAN's feature diversity is limited, and it does not guarantee parameter convergence.In comparison, while VAE exhibits similar sample complexity to GAN, it provides a more reliable estimation of distribution.Yet, the less pronounced sharpness of VAE relative to GAN makes the latter a superior choice for probabilistic applications.

Deep Reinforcement Learning
Deep reinforcement learning is a subfield of machine learning that deals with training agents to perform tasks in a given environment.In DRL, the agent learns to interact with the environment by taking actions and receiving rewards based on those actions.The goal is to learn a policy that maximizes the expected cumulative reward over time.The key components of a DRL system are the agent, the environment, and the reward function.The agent is typically implemented as a neural network that takes the state of the environment as input and outputs a probability distribution over possible actions.The environment is modeled as a Markov decision process (MDP) which consists of a set of states, actions, transition probabilities, and rewards.The reward function maps states and actions to scalar rewards [183].The objective of the agent is to learn a policy that maximizes the expected cumulative reward over time.This can be formulated as the expected sum of discounted rewards: where θ represents the parameters of the agent's policy, τ represents a trajectory of states and actions, p θ (τ) is the probability of generating a trajectory τ under the agent's policy, r t is the reward at time step t, and γ is a discount factor that trades off immediate rewards versus future rewards.The key challenge in DRL is to learn an effective policy that can generalize to unseen states and actions.One way to achieve this is to use a neural network to represent the policy [184].The neural network takes the state of the environment as input and outputs a probability distribution over possible actions.The policy is trained using gradient descent to maximize the expected cumulative reward over time.The gradient of the expected cumulative reward with respect to the policy parameters can be computed using the policy gradient theorem: where π θ (a t |s t ) is the probability of taking action a t in state s t under the policy parameterized by θ.The policy gradient can be estimated using Monte Carlo methods [66] by sampling trajectories from the current policy and computing the gradient of the expected cumulative reward with respect to the policy parameters.To estimate the policy gradient using Monte Carlo, one needs to perform the following steps: (1) Sample multiple trajectories using the current policy π θ (a|s).For each trajectory, we compute the returns G(t) = ∑ T−1 t =t γ t −t r t for each time step t using the observed rewards.
(2) Compute the policy gradient estimate by averaging the gradients of the logarithmic policy multiplied by the corresponding returns: where N is the number of sampled trajectories and the superscript i denotes the trajectory index.(3) Update the policy parameters θ using gradient ascent written as θ ← θ + α * ∇J(θ) where α is the learning rate.
By repeatedly sampling trajectories, estimating the policy gradient, and updating the policy parameters, the agent can iteratively improve its policy toward maximizing the expected cumulative reward.

Deep Q-Network (DQN)
Deep Q-Network (DQN) is a deep reinforcement learning algorithm that combines a deep neural network with Q-learning to learn an optimal policy [67,185].The DQN algorithm has been successfully applied to various tasks, including playing Atari games and controlling robotic systems.The goal of the DQN algorithm is to learn a function Q(s, a) that estimates the expected return for taking action a in state s.The Q-function is updated iteratively using the Bellman equation: where r is the reward received after taking action a in state s, s is the next state, α is the learning rate, and γ is the discount factor.
To deal with the curse of dimensionality and to enable the DQN algorithm to learn a function approximation for Q(s, a), a deep neural network is used to represent the Q-function.The network takes a state s as input and outputs Q-values for all possible actions.During training, the DQN algorithm uses experience replay to store the agent's experiences in a replay buffer, which is then used to sample batches of experiences to train the network [46].The loss function for the DQN is the mean squared error between the Q-value estimated by the network and the target Q-value, which is computed using the Bellman equation: where θ represents the parameters of the network, Q target (s , a ; θ − ) is the target Q-value computed using a separate target network with parameters θ − , and the expectation is taken over a batch of experiences.
To stabilize training, the DQN algorithm uses two key techniques: target network [68] and -greedy exploration [69].The target network is used to generate the target Q-values and is updated periodically by copying the parameters from the Q-network.The -greedy exploration strategy is used to encourage the agent to explore the environment by selecting a random action with probability and the action with the highest Q-value with probability 1 − .Overall, the DQN algorithm is a powerful deep reinforcement learning technique that has been shown to achieve state-of-the-art performance on a wide range of power engineering tasks.Table 3 shows the applications of deep reinforcement learning techniques in power system research.As shown in this table, DQN is a successful technique in the voltage control of the IEEE 123-bus system with an average control reward of 153.46 [186,187].This method also yields a normalized reward of 0.795 for the emergency control of the IEEE 39-bus system [188].It is also studied for transportation electricity management using the California Freeway Performance Measurement System dataset, which shows a cost efficiency metric equal to 0.141 [189,190].In demand-response scheme learning, the DQN provides an operational cost of $161.93 on the Steel Powder Manufacturing Dataset [191,192].Moreover, the DQN provides a profit of £5.24 × 10 5 on the ISO New England Inc. environment in electricity market problems [193].This methodology has also been used to solve power scheduling problems, with an average income of $4268.17using the Center for Renewable Energy Systems Technology Model [194,195].Another major application of DQN is the cyberattack detection and identification problem.In this area, the DQN results in precision, recall, and F-score of 83.70, 79.08, and 81.32, respectively, when applied to the IEEE 39-bus system [74,196].Energy Scheduling [194,195,207,[217][218][219][220] [68].Similar to DQN, the DDQN algorithm uses experience replay to store the agent's experiences in a replay buffer and uses a deep neural network to represent the Q-function.The Q-function is updated iteratively using the Bellman equation computed by: where θ represents the parameters of the Q-network, θ − represents the parameters of a separate target network, r is the reward received after taking action a in state s, s is the next state, α is the learning rate, and γ is the discount factor.
In DDQN, the Q-values for selecting actions are computed using the Q-network, while the Q-values for evaluating the selected actions are computed using the target network.This decoupling of action selection and evaluation helps to reduce the overestimation of Q-values.The loss function for the DDQN is the mean squared error between the Q-value estimated by the Q-network and the target Q-value, which is computed using the target network as follows: where the expectation is taken over a batch of experiences.
Overall, DDQN is a powerful deep reinforcement learning technique that has been shown to achieve state-of-the-art performance on a wide range of tasks while addressing the problem of overestimation of Q-values.As shown in Table 3, the DDQN generally shows better accuracies and objective values compared to the DQN method.For instance, as shown in the table, DDQN provides a 5.39% better average control reward over the DQN for the voltage control of the IEEE 123-bus system [186,187].Moreover, the DDQN shows 8.68% higher normalized average reward over the DQN in the emergency control task of the IEEE 39-bus test case [201].Similar results can be seen in the transportation electrification management tasks, where the DDQN results in a 0.163 cost efficiency [186,189].In the demand-response problems, this method yields a 9.70% lower operational cost compared to the DQN on the Steel Powder Manufacturing model [191,192].Moreover, in the ISO New England Inc., the DDQN provides a 28.63% higher profit compared to the DQN for electricity market management [193,214].Similar profit improvements are shown for the energy scheduling of the Center for Renewable Energy Systems Technology model [217,219].Finally, in the cyberattack identification problem of the IEEE 39-bus system, this method shows a 4.29%, 15.35%, and 9.70% higher precision, recall, and F-score compared with the DQN method [74], which is mainly due to the fact that DDQN handles the problem of overestimation of Q-values that exists in the DQN model.

Deep Deterministic Policy Gradient (DDPG)
Deep Deterministic Policy Gradient (DDPG) is a model-free, off-policy algorithm for continuous action spaces in reinforcement learning.It combines the ideas of DQN and deterministic policy gradient methods to learn a deterministic policy that maps states to actions.
The DDPG algorithm uses a deep neural network to represent both the actor (policy) and critic (action-value function) networks.The actor network takes in the state as input and outputs the action to be taken.The critic network takes in both the state and action as input and outputs the corresponding action-value function.The Q-value for a state-action pair is defined as: where µ is the actor network, θ Q are the target network parameters, and µ is the target actor network that is used to compute the target Q-value.The target Q-value is then used to update the critic network parameters using the Bellman equation: where L is the mean squared error between the predicted Q-value and the target Q-value.
To update the actor network, the deterministic policy gradient theorem is used.The objective is to maximize the expected return: where ρ β is a replay buffer and Q µ (s t , a t ) is the Q-value estimated by the critic network for a given state-action pair.The gradient of the objective is given by: where N is the batch size and ∇ a Q(s, a; θ Q ) is the gradient of the Q-value with respect to the action.The actor network is updated by descending along this gradient: where α µ is the learning rate for the actor network.DDPG has several advantages over deep Q-Network algorithms.Firstly, DDPG is capable of handling continuous action spaces, whereas DQN is primarily designed for discrete action spaces.This makes DDPG well-suited for real-world power engineering tasks.Secondly, DDPG is an off-policy algorithm, meaning it can learn from a batch of experiences collected independently from the current policy.This enables more efficient learning by reusing past experiences and reducing the sample complexity.On the other hand, DQN is an on-policy algorithm that requires interactions with the environment to collect new data for each update.Lastly, DDPG utilizes actor-critic architecture, allowing it to learn both a deterministic policy (actor) and estimate the value function (critic) simultaneously.This improves the stability and convergence of the learning process compared to the separate target and behavior networks used in DQN.
As shown in Table 3, DDPG outperforms DQN and DDQN in various data-driven power engineering studies.For instance, as shown in the table, the DDPG method provides a 2.58% higher average control reward compared to the DDQN for voltage control of the IEEE 123-bus system [67].DDPG also shows a higher normalized reward in the emergency control problem of the IEEE 39-bus test case compared to the DDQN [203].It provides a 6.48% better normalized reward value which is mainly due to the fact that DDPG can learn from a batch of experiences and it has better convergence in real-time applications.In addition, DDPG obtains a 15.95% higher cost efficiency in the transportation energy management problem [224] of the California Freeway Performance Measurement System compared to the DDQN method [68,205].In the demand-response problem of the Steel Powder Manufacturing Dataset, the DDPG shows a 7.59% lower cost compared with the DDQN model [192,209].A similar improvement can also be observed in the electricity market management case using IOS New England Inc. dataset [214,215].The average income obtained by the DDPG is 20.40% and 6.94% higher than DQN and DDQN, respectively, in the power scheduling problem solved for the Centre for Renewable Energy Systems Technology model [195].DDPG also provides better classification accuracies compared with the DQN and DDQN since DDPG shows better stability and convergence during its training and provides a more reliable classification decision boundary.For instance, in cyberattack identification for the IEEE 39-bus test case, DDPG's classification precision, recall, and F-score are 7.76%, 18.08%, and 15.25% higher than DQN, respectively [222,223].Additionally, DDPG has 5.06% better F-score compared with the DDQN for the same classification problem.

Strengths and Shortcomings of Deep Reinforcement Learning
Figure 3 shows the strengths and limitations of deep reinforcement learning methodologies in the context of power system implementations.DQN and DDQN exhibit reliable and resilient training dynamics, making them highly applicable to datasets characterized by elevated uncertainty.Nevertheless, these techniques do not ensure the definitive convergence of their parameters.By reducing the overestimation bias of state-action values, Double DQN enables better decision-making and more accurate value estimation, leading to enhanced performance and stability in reinforcement learning tasks.Empirical studies such as [68,225] have shown that Double DQN provides more stable and reliable learning compared with the original DQN algorithm.It exhibits improved convergence properties and can learn more efficiently, especially in domains with high-dimensional state spaces or complex action spaces.
DQN and DDQN primarily concentrate on refining deterministic strategies within discrete action spaces.This implies that, relative to the DDPG, they may not be well suited for practical situations which require continuous actions.The DDPG method has significant benefits for DRL.First, DDPG integrates the advantages of actor-critic methods and deep neural networks, making learning in high-dimensional action spaces efficient and scalable.Second, DDPG utilizes off-policy learning, which enables the agent to learn from a replay buffer of previous experiences, thereby enhancing sample efficiency and overall stability.In addition, DDPG includes target networks that are periodically updated to provide more consistent and reliable value estimations.Despite DDPG potentially being susceptible to parameter instability during its training phase, it guarantees fast and reliable convergence towards an accurate stochastic local policy.

Future Directions of Research
Over the past few years, several new and emerging topics have garnered considerable attention and sparked innovative research in the area of deep learning for power engineering.In this section, we discuss several cutting-edge emerging methodologies that can be incorporated into state-of-the-art, data-driven methods for power systems to improve the accuracy of the existing works.

Attention Mechanism
Attention models have emerged as a groundbreaking topic within the realm of deep learning, revolutionizing the way models process and focus on relevant information.Inspired by the human cognitive process, attention mechanisms enable deep learning models to selectively attend to different parts of input data, assigning varying degrees of importance to each element.This attention-based approach allows models to efficiently capture and leverage crucial features, leading to improved performance across a wide range of tasks.At the heart of attention models lies the attention mechanism, which computes attention weights for different elements of the input based on their relevance to the task at hand.By assigning larger weights to important elements and lower weights to less relevant elements, attention models can dynamically adapt their focus and processing on a per-element basis.This dynamic allocation of attention enables models to effectively handle complex relationships, dependencies, and long-range interactions within the data, thus, enhancing their ability to understand and generate meaningful representations.Moreover, attention models have been widely adopted in various deep learning architectures, including transformer models [226], which have achieved remarkable success in natural language processing tasks.Transformer models leverage self-attention mechanisms to attend to different positions in the input sequence, capturing contextual information and facilitating parallel processing.Beyond text, attention models have also found applications in computer vision [227], speech recognition [228], and reinforcement learning [229], among other areas.Furthermore, the introduction of attention models has significantly advanced the field of deep learning, enabling models to focus on relevant information and achieve state-of-the-art performance on numerous challenging tasks.The ongoing research into attention mechanisms continues to develop novel architectures and techniques that expand the potential of these models and drive innovation in the field of deep learning.

Transfer Learning and Domain Adaptation
Domain adaptation has emerged as a new topic in deep learning, addressing the challenges posed by the domain shift problem.The domain shift problem arises when a model trained on a specific set of data fails to generalize well to data from a different, but related, domain.In such cases, domain adaptation models aim to leverage the knowledge acquired from the source domain to adapt to the target domain, thereby mitigating the effects of domain shift.The core idea behind domain adaptation models is to learn a shared representation space between the source and target domains in which the domainspecific differences are minimized and the shared characteristics are maximized.This is achieved by optimizing a loss function that simultaneously maximizes the task performance on the source domain while minimizing the discrepancy between the source and target domains.Various techniques have been developed to address the domain adaptation problem, including adversarial training [133], where a domain discriminator is trained to distinguish between the source and target domains, while the model is trained to fool the discriminator by producing domain-invariant features.Another approach is based on discrepancy minimization [230], in which the model is trained to minimize the discrepancy between the distributions of the source and target domains in the feature space.Furthermore, recent research has explored the use of meta-learning and self-supervised learning techniques [231] for domain adaptation, where the model learns to adapt to new domains by leveraging knowledge acquired from previous adaptation tasks or by learning to predict transformations between the domains.Domain adaptation has been applied to various domains, including computer vision [232], natural language processing [233], and speech recognition [234], among others.The ongoing development of domain adaptation models continues to produce novel techniques and architectures that advance the field of deep learning by producing models that better generalize across diverse domains and which achieve improved performance on challenging tasks.

Interpretable Feature Learning
Interpretable models have emerged as a significant new topic in deep learning, addressing the inherent black-box nature of complex neural networks.As deep learning models become increasingly sophisticated, there is a growing need to understand the decision-making processes and inner workings of these models.Interpretable models aim to bridge this gap by providing explanations and insights into the factors influencing the model's predictions.One approach to interpretability is through feature attribution [235], which assigns importance or relevance scores to input features that indicate their contribution to the model's output.Techniques such as gradient-based methods [236], saliency maps [237], and attention mechanisms [229] help highlight the regions or features that the model focuses on during its decision-making process.Another approach is to learn disentangled representations [238], where the model learns to separate factors of variation in the data, making it easier to understand the underlying relationships.In addition, rule-based models and decision trees [239] provide interpretability by explicitly capturing decision rules and conditions.These models can be trained to mimic the behavior of complex deep learning models, providing a more interpretable alternative.Furthermore, post-hoc interpretability methods aim to interpret black-box models by analyzing their behavior after they have been trained.Techniques such as LIME (Local Interpretable Model-agnostic Explanations) [240] and SHAP (Shapley Additive Explanations) [241] provide explanations at the instance level, highlighting the features that contribute the most to individual predictions.Interpretable models have significant implications, including ethical considerations [242], transparency [243], and trust in AI systems, especially in high-stakes applications such as healthcare [244] and finance [245].The ongoing advancements in interpretable feature learning continues to produce more reliable and higher-performing models that not only provide accurate inferences but also understandable and explainable insights into how they made those inferences, thus fostering greater transparency into the decision-making process for various domains.

Physics-Guided Machine Learning
Physics-guided models have recently emerged as a compelling new topic in deep learning, combining the power of data-driven approaches with the foundational principles of physics.These models aim to incorporate prior knowledge and physical laws into the learning process, enabling the development of more accurate and reliable models, particularly in scenarios with limited or noisy data.By integrating physics-based constraints, such as conservation laws, symmetries, or known relationships, deep learning models can capture the underlying structure of the data and make predictions that align with the fundamental principles of the domain [246].Physics-guided models offer several advantages, including improved generalization, better extrapolation capabilities, and the ability to handle data scarcity.They can also provide interpretability by explicitly incorporating known physics principles [247] into the model architecture, allowing for insights into the decision-making process.Various approaches have been developed to incorporate physics into deep learning models, including physics-informed neural networks [248], where the model is trained to satisfy the physical equations and constraints during the learning process.Another approach involves coupling traditional physics-based models with deep learning architectures, leveraging the strengths of both approaches to achieve more accurate and efficient predictions.Physics-guided models find applications in a wide range of fields, including fluid dynamics [249], materials science [250], medical imaging [251], and climate modeling [252], among others.The ongoing research and development in this area focus on developing new techniques for effectively integrating physics knowledge into deep learning frameworks, enabling the development of robust and interpretable models that align with the laws of nature.By combining the predictive power of deep learning with the foundational understanding of physics, physics-guided models pave the way for advancements in scientific discovery, engineering, and decision-making processes across various domains.

Conclusions
With consistent increases in time and storage complexity of problems associated with power system applications, the demand for sophisticated statistical pattern identification techniques has led to the implementation of deep learning-based approaches.This recently developed category of techniques can be classified primarily as discriminative, generative, and reinforcement learning techniques.This paper analyzes the deep discriminative algorithms which offer a precise approach for mapping the complicated input of power system problems to accurate solutions of the supervised problem.Due to their excellent ability for generalization, these models are extensively used for stability evaluation, fault detection, as well as wind and solar power generation forecasting.Then, deep generative methods that offer a probabilistic estimate of data probability densities are discussed.These models can learn complicated probabilistic patterns for a broad range of electrical engineering purposes, such as state estimation, renewable scenario generation, and power grid synthesis.The article concludes with a discussion of deep reinforcement learning algorithms that attempt to optimize an objective using observed rewards captured from the problem's environment.The empirical and mathematical analysis of the adopted methods inspires future studies in the field of deep learning to expand the potential uses of this strong type of framework in new areas of power engineering.

Figure 1 .
Figure 1.Strengths and Shortcomings of Discriminative Deep Architectures.

Figure 2 .
Figure 2. Strengths and Shortcomings of Deep Generative Models.

4. 2 .
Double DQN Double Deep Q-Network (DDQN) is an extension of the Deep Q-Network (DQN) algorithm that addresses the problem of overestimation of Q-values.The DDQN algorithm uses two separate deep neural networks to decouple the selection of actions and the evaluation of their Q-values, resulting in more accurate estimates of the Q-function

Figure 3 .
Figure 3. Strengths and Shortcomings of Deep Reinforcement Learning Models.

Table 1 .
Discriminative deep neural networks in power system studies.

Table 2 .
Generative deep neural networks in power system studies.

Table 3 .
Deep Reinforcement Learning neural networks in power system studies.