Advances in the Application of Machine Learning Techniques for Power System Analytics: A Survey †

: The recent advances in computing technologies and the increasing availability of large amounts of data in smart grids and smart cities are generating new research opportunities in the application of Machine Learning (ML) for improving the observability and efﬁciency of modern power grids. However, as the number and diversity of ML techniques increase, questions arise about their performance and applicability, and on the most suitable ML method depending on the speciﬁc application. Trying to answer these questions, this manuscript presents a systematic review of the state-of-the-art studies implementing ML techniques in the context of power systems, with a speciﬁc focus on the analysis of power ﬂows, power quality, photovoltaic systems, intelligent transportation, and load forecasting. The survey investigates, for each of the selected topics, the most recent and promising ML techniques proposed by the literature, by highlighting their main characteristics and relevant results. The review revealed that, when compared to traditional approaches, ML algorithms can handle massive quantities of data with high dimensionality, by allowing the identiﬁcation of hidden characteristics of (even) complex systems. In particular, even though very different techniques can be used for each application, hybrid models generally show better performances when compared to single ML-based models.


Introduction
The power system management and development have constantly been changing due to expanding complexity and distributed modern power networks. [1]. Principally, the increasing distribution of Renewable Energy Sources (RESs) with intermittent energy generation and technological novelties in power system management and control demand reliable power predictions and more precise monitoring models [2,3]. In recent years, researchers developed advanced solutions based on Machine Learning (ML) algorithms to solve the bottleneck of conventional lumped parameter simulations. In practice, conventional traditional simulation techniques based on deterministic methods are still dominated in power grids. However, the high performance of machine learning solutions in terms of accuracy computational speed, and scalability brings novelties in power grids management and control. Therefore, it is expected to boost the adaptation of these techniques for shortto medium-term forecasts of the power grid system operation to meet this gap while getting benefits of advantages of traditional approaches.
Deep learning and neural networks are the most famous machine learning subset. Thanks to the use of (typically) multi-layered Artificial Neural Networks (ANNs), deep learning can handle unstructured datasets and can recognize complex input data patterns. In deep learning, different architectures can be designed using neural unit cells in various layers, unless other machine learning algorithms are fixed. Figure 1 illustrates artificial intelligence, machine learning, and deep learning concepts in the schematic description by means of a Venn diagram.
Energies 2021, 14, x FOR PEER REVIEW 3 of 27 equipped with computerized patterns learning from raw data and perform fast-response predictions applied in decision-making processes [9,10]. Deep learning and neural networks are the most famous machine learning subset. Thanks to the use of (typically) multi-layered Artificial Neural Networks (ANNs), deep learning can handle unstructured datasets and can recognize complex input data patterns. In deep learning, different architectures can be designed using neural unit cells in various layers, unless other machine learning algorithms are fixed. Figure 1 illustrates artificial intelligence, machine learning, and deep learning concepts in the schematic description by means of a Venn diagram.

Machine Learning Paradigms
In machine learning, training a model intends to learn the values of the parameters (or weights) and the bias from input data, while in traditional methods (i.e., with predefined algorithms), both the model and its parameters are given to a computer to perform a task. Labeled data are samples with a sort of meaningful "tag", "label", or "class" that are informative or desirable to know-for example, whether an Alternating Current (AC) power signal contains harmonic distortion(s) or not. In contrast, unlabeled data are samples with no explanation; in other words, it has only row data without any "tag" or "label" assigned to it-for instance, voltage and current signals of an electric motor.
Machine learning tasks are principally arranged into three main classes: supervised, unsupervised, and semi-supervised learning. Supervised learning algorithms work with labeled data with the objective of mapping new input data to the known target output values. On the contrary, unsupervised learning models process an unlabeled dataset, in which target values are unknown, to draw insights by learning hidden complicated patterns and structures spontaneously. Semi-supervised algorithms deal with a dataset that some samples are labeled, and more extensive samples are unlabeled. These algorithms are designed to benefit from both advantages of supervised and unsupervised methods [11]; Supervised learning is categorized into classification and regression problems. A classification problem predicts output variables as a category, such as "cat" or "dog." Contrarily, in regression problems output variables are numerical values [12]; Unsupervised learning algorithms are generally divided into clustering or dimensionality reduction (or sometimes called embedding) methods [11]. For instance, in anomaly detection, a clustering algorithm is applied to data to identify false data by scanning outliers in a dataset or noticing abnormal patterns; Semi-supervised learning makes use of the mixture of labeled and unlabeled data as the training dataset. Semi-supervised models act as active learners [13]. There are two

Machine Learning Paradigms
In machine learning, training a model intends to learn the values of the parameters (or weights) and the bias from input data, while in traditional methods (i.e., with predefined algorithms), both the model and its parameters are given to a computer to perform a task. Labeled data are samples with a sort of meaningful "tag", "label", or "class" that are informative or desirable to know-for example, whether an Alternating Current (AC) power signal contains harmonic distortion(s) or not. In contrast, unlabeled data are samples with no explanation; in other words, it has only row data without any "tag" or "label" assigned to it-for instance, voltage and current signals of an electric motor.
Machine learning tasks are principally arranged into three main classes: supervised, unsupervised, and semi-supervised learning. Supervised learning algorithms work with labeled data with the objective of mapping new input data to the known target output values. On the contrary, unsupervised learning models process an unlabeled dataset, in which target values are unknown, to draw insights by learning hidden complicated patterns and structures spontaneously. Semi-supervised algorithms deal with a dataset that some samples are labeled, and more extensive samples are unlabeled. These algorithms are designed to benefit from both advantages of supervised and unsupervised methods [11]; Supervised learning is categorized into classification and regression problems. A classification problem predicts output variables as a category, such as "cat" or "dog." Contrarily, in regression problems output variables are numerical values [12]; Unsupervised learning algorithms are generally divided into clustering or dimensionality reduction (or sometimes called embedding) methods [11]. For instance, in anomaly detection, a clustering algorithm is applied to data to identify false data by scanning outliers in a dataset or noticing abnormal patterns; Semi-supervised learning makes use of the mixture of labeled and unlabeled data as the training dataset. Semi-supervised models act as active learners [13]. There are two main semi-supervised learning algorithms, namely reinforcement learning and Generative Adversarial Networks (GANs). In reinforcement methods, if a model does a task correctly, it would get a reward. The objective of reinforcement learners is to build a model to maximize rewards through an iterative process [14]. Reinforcement learning is suitable for an interactive or dynamic environment that a model can improve itself based on policies defined by an expert, for example, playing a game or self-driving cars. GANs generate models based on deep neural learning methods to discover and learn patterns of input data. Then, the generative model can be used to create new data examples that resemble a training dataset. For instance, GANs can create pictures that look like human faces images, even though the faces don't relate to any actual person.

Machine Learning Algorithms
Many different machine learning techniques have been proposed in recent years, particularly consisting of hybrid ML-based models, making use of two or more machine learning techniques or even other statistical or mathematical models. For example, ensemble learning models include different weak learners such as decision trees, support vector machines, and linear or logistic regression. This section discusses the basic and most relevant machine learning techniques in each category.

Classification Algorithms
There are several classification algorithms; the most commonly used ones are presented as follows [15]: Logistic Regression (LR): LR is widely used for binary classification tasks where an output belongs to one class or another (0 or 1). In this algorithm, a threshold is defined to indicate examples will be labeled into which class using hypothesis and logistic function (usually sigmoid curve). The hypothesis determines the likelihood of events to generate data and fit them into the logarithm function that forms an S-shaped curve called sigmoid. Then, the logarithm function is used to predict the class of new inputs. Even though logistic regression provides better performance in binary classification tasks, it can also be used in multiclass classification problems, by applying the one versus all strategy [16]; K-Nearest Neighbors (KNN): this algorithm is one of the most basic yet broadly used classifiers. It is generally used to find data with similar characteristics and group them in the same class, without making any assumptions on data distribution. The groups are constructed by considering the attributes of the neighboring samples. It is used in real-life problems in several applications such as data mining, pattern recognition, and invasion detection [17,18]; Naïve Bayes (NB): this technique is one of the most powerful classification algorithms based on an extension of Bayes' theorem, assuming each feature is independent to capture input-output relationships. Bayes' theorem compares the probability of an event happening to what has already happened, for example, the probability of having a fire (event A) while the weather is hot (event B, which is present) [19]. The naïve algorithm is simple to implement and can easily predict labels of new inputs. Additionally, when domain knowledge confirms the feature independence, with less data, it has a better performance than other classification algorithms such as logistic regression. On the other hand, in real life, it is not easy to have data with entirely independent features; moreover, when there is an input that was not followed up in the training phase, the algorithm assigns zero probability, and it does not classify this input in any group. This technique is used in various applications such as text classification and spam filtering [20]; Support Vector Machine (SVM): This algorithm is widely used in classification tasks and also applied in regression problems. The main idea of SVM is to transfer data to higher n-dimensional space to find an ideal hyperplane to differentiate classes [21]. In simple words, these support vectors are coordinates of a new n-dimensional coordinate system. This method is commonly used in binary classification, but it is computationally expensive and slow in the big data domain; Decision Tree (DT): This algorithm is based on different hierarchical steps that lead to certain decisions. It applies a treelike structure to represent decision paths with induction and pruning steps. In the induction step, the tree structure is built, while, in the pruning step, the complexities of the tree are reduced. The inputs are mapped to outputs by traversing each path through different branches of the tree [22]. DT is a powerful classification tool, simple to structure and with good performance. However, with even small variations in data, DT can become unstable. Furthermore, it can easily become overfitted, especially in a thorny tree with many branches and conditions, thus, it does not generalize well on new inputs. Regularization, bagging, and boosting techniques are usually used to avoid overfitting problems in the DT [23]; Random Forest (RF): This classifier is very similar to the decision tree. Compare to DT, RF uses several decision trees, instead of having only one tree. This technique can be applied in massive data set to classify data or measure the importance of each feature in the final decision. In many applications, the random forest is preferred over the decision tree because it can be more accurate and overcomes the overfitting issued of DT. However, this technique is not easy to implement since it has a complex structure, and it is not recommended for real-time prediction purposes because it is generally slower than other models [24].

Regression Algorithms
Several regression algorithms (numerical or continuous value prediction) have been introduced in the scientific literature; the most commonly used ones are presented in the following: Linear Regression (LR): this technique tries to find the fittest straight hyperplane to the data [25]. It is commonly used when there are linear relationships between variables, and it can avoid overfitting by regularization techniques such as LASSO, Elastic-Net, and Ridge [26]. However, it is not flexible in finding the best solution for non-linear relationships in variables and complex patterns; Regression Tree (RT): This technique has the same hierarchical structure as the decision tree, but it takes numerical values as input. The branching procedure not only maximizes the learning gain but also learns non-linear relationships between variables. Even if this method is robust to outliers and easy to implement, it is prone to overfitting problems [27]. In addition to the regression tree, random forests and Gradient Boosted Trees (GBM), which are the most commonly used ensemble methods, are also applied in numerical predictions and have better performance concerning overfitting issues; Deep Neural Network (DNN): Deep neural network, or multi-layer neural network, is widely used in several domains. Indeed, thanks to their ability to capture complex patterns, DNNs can be used both as regression algorithms and classifiers. The non-linear relationships between features are learned by non-linear activation functions and hidden layers between the input and the output [28]. There are several techniques and methods to improve the performance of neural networks, as well as different advanced neural network-based models such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) [29,30]. Different from other algorithms, in DNNs, a deep knowledge of how to tune the parameters of the neural network is required to develop a working neural network model. In addition, even though neural network models work well in the big data domain, they are usually very computationally expensive methods; Extreme Learning Machine (ELM) has a wide range of applications in the data-driven approach. ELM has been used in regression, classification, clustering, sparse representation, feature extraction or learning, and compression. This feedforward neural network does not apply the backpropagation gradient-based mechanism to update the network weighted values; instead, it randomly assigns random values to the weight and bias terms of the network [31]. The main advantages of this kind of algorithm are (i) the faster training phase and (ii) the better interpolation results. On the contrary, the accuracy results of ELM is not promising, even if compared to basic MLP models; Support Vector Machine or kernel SVM can also be used for regression problems, even though it is mostly used in classification problems; XGboost, finally, is a (recently) widely used rugged decision-tree-and ensemblebased algorithm with a framework that is designed considering a gradient boosting procedure [32].

Clustering Algorithms
Clustering techniques try to group instances with the same properties in the same cluster. These techniques are commonly used in other fields than machine learning, such as image analysis, pattern recognition, data compression, and statistical analysis. The most well-known algorithms are as follows: K-means: this technique, one of the simplest and intuitive machine learning algorithms, separates instances in the k centroids or clusters with equal variance. After selecting the number of clusters (K), the algorithm finds the best k clusters by minimizing the criterion known as inertia through the iterative procedure and changing the position of centroids [33]. As it is simple to interpolate and scales well to big data, it has been applied across a wide range of applications in various domains; DBSCAN: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), which is widely used in data mining and machine learning, finds core instances of high density and extends clusters with the specified radius (usually Euclidean distance) around them. Low-density regions are distinguished as outliers. The primary problem in DBSCAN is selecting clustering attributes, detecting noise with varied densities, and significant differences of amounts of boundary objects in opposed directions of the corresponding clusters [34]. The smallest number of instances to constitute a dense region and how close instances should be to each other in the same region are defined by an expert. Even though this algorithm, which is a very popular clustering technique, is widely used, it badly behaves with very sparse or high dimensional datasets; Spectral: this clustering algorithm, which is also an exploratory data analysis technique, performs dimensionality reduction through eigenvalues (spectrum) of the similarity of data instances, by then grouping similar data instances with reduced dimensions into the same cluster [35]. This approach is practically applied when the center of clusters and their spread does not appropriately describe the whole cluster (non-convex cluster), such as in image segmentation problems. The spectral technique is widely used because it is a fast response technique and outperforms other clustering techniques, especially in sparse datasets.

Embedding Algorithms
In many cases, especially in the big data domain, the presence of a large number of variables or features in a dataset makes it difficult to interpret the relationship between them. Training a model on the whole dataset could easily make the model not sufficiently generalized on new unseen data (overfitting problem). Embedded Algorithms (EAs) can be applied to extract new features from data without losing essential information before implementing sophisticated ML models. EA techniques could also be used directly for prediction purposes. Embedding algorithms can be subdivided as follows: Principal Component Analysis (PCA): the main aim of PCA is to reduce high-dimensional datasets to a smaller dimension. PCA projects each data instance onto the main components or ranks while retaining as much data variation as possible. PCA techniques, such as Singular Value Decomposition (SVD), use eigenvectors of the covariance matrix of data to reduce the dimension of the dataset or making a prediction [36]; Autoencoder: this is one of the current states of the art techniques leveraging neural networks. Autoencoders are widely used in different applications, such as data compression. The autoencoder learns a representation (encoding) of the input dataset and ignores noise through embedding architecture, and reconstructs the input data as close as possible to its actual forms (decoder). A typical autoencoder consists of three parts, namely: an encoder, a bottleneck, and a decoder [37]. The encoder tries to compress the data to a lower dimension with the best representative, the decoder attempts to regenerate an input by eliminating the noise in the dataset, while the embedded data is stored in the bottleneck. It is possible to use the encoder part of a well-trained autoencoder for dimensionality reduction, or use the whole model, for example, in anomaly detection [38]. compression. The autoencoder learns a representation (encoding) of the input dataset and ignores noise through embedding architecture, and reconstructs the input data as close as possible to its actual forms (decoder). A typical autoencoder consists of three parts, namely: an encoder, a bottleneck, and a decoder [37]. The encoder tries to compress the data to a lower dimension with the best representative, the decoder attempts to regenerate an input by eliminating the noise in the dataset, while the embedded data is stored in the bottleneck. It is possible to use the encoder part of a well-trained autoencoder for dimensionality reduction, or use the whole model, for example, in anomaly detection [38]. Figure 2 summarizes the different machine learning paradigms and techniques used in power system analytics, by providing examples for each category.

Model Performance Evaluation Metrics
The metrics that are used in each machine learning algorithm are different from each other. In Table 1 the most used metrics in discrete and continuous cases are discussed. In this table, True Positive (TP) and True Negative (TN) are samples that are correctly predicted as positive and negative, respectively. In contrast, False Positive (FP) and False Negative (FN) are samples that are incorrectly predicted as positive and negative, respectively. In continuous metrics, y is the actual value, y is the forecasted amount, and n is the number of prediction samples.

Model Performance Evaluation Metrics
The metrics that are used in each machine learning algorithm are different from each other. In Table 1 the most used metrics in discrete and continuous cases are discussed. In this table, True Positive (TP) and True Negative (TN) are samples that are correctly predicted as positive and negative, respectively. In contrast, False Positive (FP) and False Negative (FN) are samples that are incorrectly predicted as positive and negative, respectively. In continuous metrics, y is the actual value,ŷ is the forecasted amount, and n is the number of prediction samples.

Literature Review
Machine learning is widely applied to address various problems to bring novel solutions or improve the performance of existing applications. The main state-of-the-art machine learning-based applications in power systems are in power flow, power quality, photovoltaic system, intelligent transportation, and load forecasting.

Power Flow Applications
Compared to traditional algorithms, machine learning technologies make power flow problems easier to be handled. For example, algorithms like CNN, KNN, SVM, reinforcement learning, and decision tree affected power flow optimization problems in terms of accuracy, computational speed, and response time. Table 2 elaborates more into detail the recent advancements in machine learning applications in power flow.

Power Quality Applications
The power quality, one of the most critical topics in electrical systems, has also been affected by machine learning, which can be used to improve speed and accuracy in disturbances detection, or distortions classification, and estimations for future cycles. In addition, ML can also be used on a wide set of PQ parameters related to load functioning such as active power, reactive power, complex power, fundamental frequency, and power factor. Table 3 summarizes the most recent improvements and achievements in the use of ML techniques in power quality applications.

Photovoltaic System Applications
Machine learning algorithms have been widely used for different purposes in Photovoltaic (PV), from forecasting the long-, medium-, and short-term energy generation, to fault detection and classification. The most recent works in this field are summarized in Table 4.

Intelligent Transportation Applications
Artificial intelligence, especially machine learning applications, are widely used in intelligent transportation, to develop smart online traffic management systems, from safety applications (e.g., driving distraction detection) to optimized traffic scheduling. Selfdriving cars, for instance, have been recently developed only thanks to the advancements in machine learning. Table 5 provides the most recent works based on ML in the field of intelligent transportation.

Load Forecasting Applications
Accurate load forecasting, both short-and long-term, is an essential task for the daily (economic) dispatching of electricity, both to prevent wasting energy production and integrating renewable energy resources.
Energy companies monitor, control, and schedule load demands and power generation to enhance energy management systems. However, electrical load profiles are becoming more complicated, not only because of the stochastic behavior of customers, but also because of the introduction of new non-linear components in power systems, such as electric vehicles, buses, and bikes. Therefore, many researchers have been developing both deterministic and probabilistic load forecasting models to improve the precision and speed of prediction models. Table 6 presents recent advancements of machine learning studies in load forecasting.

DT and KNN
The highest classification accuracy is achieved with a DT; accuracies obtained in the simulations are satisfactory for some classes; performance heavily relies on the distribution of the target output and number of samples per class   Proposed an ANN-based model based on dynamic programming (DP), which, compared to other methods, has better quality and faster response time (27.15 s); this method reduced a daily and yearly electric cost by more than 50% for four different scenarios considering PV output, electrical demand, electricity price, and battery SOC.  This study proposed a novel hybrid model based on variational mode decomposition (VMD), self-recurrent (SR) mechanism, support vector regression (SVR), chaotic mapping mechanism, and cuckoo search (CBCS).
The VMD-SR-SVRCBCS outperformed other medium-range prediction methods (240 half-hours window) in both cases with 2.5 and 0.9 MAPE of New York and Queensland, respectively. This study proposed a deterministic and probabilistic load prediction using the two Q-learning agents to select the best model locally from deterministic load forecasting methods and ten state-of-the-art ML-based models. The results show 50-60% accuracy improvements compared to single-phase benchmarks models. The authors presented a hybrid algorithm based on supervised and unsupervised machine learning techniques as follows: firstly, they applied empirical mode decomposition (EMD) and similar days selection days to extract dominant features, then, they made predictions with LSTM considering a very rich dataset for 11 years for training, one year for validation, and one year (2016) for testing. The similarity between days achieved by XGboost-based weighted k-means. The testing results for one-day and one-week ahead shows this hybrid method improved the average accuracy of the LSTM-based model from 5. 43

Discussion
ML-based algorithms have shown remarkable results in power system analytics compared to traditional methods. However, even if the models proposed by the literature showed to work fine in real datasets, their performance in industrial applications has not been sufficiently demonstrated yet, due to cost or privacy issues. This suggests the need for further investigations at the industrial level, where the presence of input data with different distributions or big data properties (e.g., volume, velocity, variety, and veracity) could decrease the performance of ML models.
Regarding the data used for system validation, the studies generally presented customized datasets. They typically provided information on the total number of samples, sampling frequency, recording time, and percentage of data used for training and validation. As several datasets were synthetically generated using simulation software, only various studies reported problems with imbalanced datasets and missing items in the data. In this regard, Hong et al. [45] analyzed the case in which data were missing from one of the buses, concluding that system performance decreased significantly. Karagiannopoulos et al. [46] extrapolated historical data and used information from the public domain or from neighboring systems to deal with missing or noisy data. In this sense, Hafeez et al. [95] replaced missing values with the average values of preceding days, while El-Hendawi et al. [98] replaced missing data with the average values of the same day in previous years. Similarly, Ray et al. [75] used measurements from past hours to fill in missing data and performed data cleaning to exclude incorrect data from training. Jia [79], Ou et al. [84], and Alawad et al. [88] also highlighted the need to clean up missing data, while Li et al. [44] wrote the missing features as zero to keep the dimension of the matrix constant. Additionally, Gao et al. [73] presented an ML-based fault detection system in a photovoltaic array and quantified the impact of missing PV input data (irradiance, temperature, and different combinations of them) on system accuracy. On the other hand, Li et al. [83], Vantuch et al. [54], and Liao et al. [53] discussed the effect of the imbalanced dataset on performance. In this sense, Wang et al. [57] solved the data imbalance problem using an enhancement method that equalized the amount of data (random cropping of existing data to generate a new dataset, increase of random noise, signal reversing, etc.). Similarly, Jia [79] applied a synthetic minority over-sampling technique that addressed the dataset imbalance problem without overfitting the classifier.
The lack of standard datasets for the testing of ML-based algorithms also emerged as a relevant issue. Indeed, all the models presented in the literature are usually tested on not-standard datasets, with very different characteristics and peculiarities, thus making the comparison of the performance of such methods almost impossible. It is then apparent that, when it comes to the selection of the most suitable ML method to be implemented in large scale applications, this lack of information represents a relevant issue, that would eventually prevent the implementation of novel (and potentially more performing) methods in favor of (probably less performing) traditional ones. This, in the end, highlights the need for the definition of application-specific standard datasets, to allow a fair comparison between the very different ML methods proposed for each application. The standardized dataset should have the following properties: Size: considering the industrial side, the dataset size should be considerably big with high dimensionality. Although some weak learners, such as DT, showed to work perfectly with a small amount of data, they would not well generalize in the big domain. On the contrary, neural network models have better accuracy results in the big domain; Quality: if the focus is only on the performance of the machine learning model, the different input datasets should have the same properties. For example, some models are very robust to none values or outliers while others are not. Preparing a dataset before feeding it to a model relates to data engineering procedures rather than to the model performance; Validity: the dataset should accurately represent the phenomena or reality of events. The statistical properties of the standardized dataset should be as much as possible close to real-life scenarios to show how practical models are; Uniqueness and completeness: the information should be unique and not be duplicated over the dataset to make sure a trained model will generalize well enough in actual cases. Moreover, it should cover all the possible occurrences or conditions. When considering, for example, the power quality disturbance classification, the dataset should include all the essential distortions; Train and test division: it is important the make sure that the performance of all models are evaluated with the same train set. Otherwise, a chosen test set probably only consists of easy instances, or it does not consist of all the possibilities; Accuracy metrics: authors used different metrics to evaluate their model performance; however, it is not possible to compare various studies when the same accuracy metrics are not used. The metrics should be proposed taking into account the nature of problems. For example, there are much fewer abnormal events in anomaly detection than normal, so the model with 99% accuracy does not guarantee that it correctly detected all abnormal events; for such studies, F1-score or AUC should be taken into account.
Researchers proposed different models based on one or more techniques. Accuracy metrics: authors used different metrics to evaluate their model performance; however, it is not possible to compare various studies when the same accuracy metrics are not used. The metrics should be proposed taking into account the nature of problems. For example, there are much fewer abnormal events in anomaly detection than normal, so the model with 99% accuracy does not guarantee that it correctly detected all abnormal events; for such studies, F1-score or AUC should be taken into account.
Researchers proposed different models based on one or more techniques.  Alternatively, it seems the hybrid models had better performances compared to others, particularly the one that combined feature engineering techniques with prediction models. Reinforcement learning methods such as Q-learning have also enhanced accuracy in some applications like intelligent transportation systems and load forecasting. In some applications, such as PV prediction or load forecasting, which deal with temporal datasets, some sequential techniques such as GRU or LSTM are preferred.

Conclusions
When facing the challenges related to the management of smart power systems, it became apparent that traditional techniques are no more computationally promising solutions. One of the limitations of conventional algorithms is their inadequate capacity to handle a large amount of data-consisting of chunks of heterogeneous datasets-collecting from measurement devices such as phasor measurement units and smart meters. As a result, many researchers developed high-level, efficient, and reliable solutions based on state-of-the-art intelligent learning algorithms to provide innovative solutions or promote the overall performance of current models in various power system fields. In this context, the ML paradigm and modern ML algorithms are categorized and presented in this article. Furthermore, this study provided a systematic overview of the latest machine learning techniques and models employed to bring new resolutions in power flows, power quality Alternatively, it seems the hybrid models had better performances compared to others, particularly the one that combined feature engineering techniques with prediction models. Reinforcement learning methods such as Q-learning have also enhanced accuracy in some applications like intelligent transportation systems and load forecasting. In some applications, such as PV prediction or load forecasting, which deal with temporal datasets, some sequential techniques such as GRU or LSTM are preferred.

Conclusions
When facing the challenges related to the management of smart power systems, it became apparent that traditional techniques are no more computationally promising solutions. One of the limitations of conventional algorithms is their inadequate capacity to handle a large amount of data-consisting of chunks of heterogeneous datasets-collecting from measurement devices such as phasor measurement units and smart meters. As a result, many researchers developed high-level, efficient, and reliable solutions based on state-of-the-art intelligent learning algorithms to provide innovative solutions or promote the overall performance of current models in various power system fields. In this context, the ML paradigm and modern ML algorithms are categorized and presented in this article. Furthermore, this study provided a systematic overview of the latest machine learning techniques and models employed to bring new resolutions in power flows, power quality events, power quality parameters, photovoltaic systems, intelligent transportation systems, and load forecasting services. The authors also suggested the properties of a standard dataset for testing and reviewing the ML-based models to make a fair comparison between the performances of proposed models for each topic. However, the literature analysis implies that hybrid models based on supervised machine learning algorithms are applied more exceeding than unsupervised or semi-supervised techniques. Thus, it can be highlighting that supervised algorithms convey more benefits to problems typically faced by electrical power engineers. Finally, it can also be concluded that the application of machine learning methods in electrical systems simplifies complex issues and ensures more reliable and accurate results. As numerous works proposed solutions based on ML techniques, the authors limited their research to well-known newly published articles. Accordingly, in future work, the authors focus on and review articles related to each topic separately to provide an informative survey.