Using True RMS Current Measurements to Estimate Harmonic Impacts of Multiple Nonlinear Loads in Electric Distribution Grids

Currently, for analyzing harmonic impacts on voltage at a point of interest, due to multiple nonlinear loads, the literature recommends carrying out simultaneous and synchronized measurement campaigns in all suspicious points with the use of high cost energy quality analyzers that are usually not available at the customers’ facilities and very often also not at the electric utilities. To overcome this drawback this paper proposes a method of assessing the harmonic impact due to multiple nonlinear loads on the total voltage harmonic distortion using only the load current true RMS values which are already available in all customers’ installations. The proposed methodology is based on Regression Tree technique using the Permutation Importance indicator which is validated in two case studies using two different electrical systems. The first case study is to ratify the use of Permutation Importance to measure the impact factor of each nonlinear load in a controlled scenario, the IEEE-13 bus test system, using ATP simulation (Alternative Transient Program). The second is to apply the methodology to a real system, an Advanced Measurement Infrastructure System (AMI) implanted on a campus of a Brazilian University, using low cost meters with only true RMS current measurements. The results achieved demonstrated the feasibility of applying the proposed methodology in real electric systems without the need for additional investments in high-cost energy quality analyzers.


Introduction
With the advance of technology, a problem has become quite frequent in distribution systems: harmonic distortion. This is due to the increasing use of nonlinear loads, mainly in the form of power electronics or consumers' electronics equipment. The first records on harmonic distortion were in the 1920s, when the main effects were on rotating machines [1]. It is noticed a great influence of harmonic distortion in the voltage and current's waveforms, being responsible for harmful effects at various levels in distribution networks.
Heating in distribution transformers, malfunction of protection and overload of neutral conductors are some of the implications on equipment most sensitive to these variations, directly affecting the operation of the system and substantially increasing the likelihood of failure occurrences and interruptions in electricity supply [2]. The electric network itself is a source of harmonic distortions in voltage and current due to its operation characteristics but loads are the main responsible for this type of phenomenon [3,4].
However, the electric energy as a product needs to be delivered at certain quality levels, and the harmonic distortion, as well as other factors impacting the power quality need to be regulated, forcing distribution utilities to continuously improve their services. In Brazil, for example, ANEEL, the Brazilian Electricity Regulatory Agency, establishes the main power quality indicators to be met by utilities and the respective penalties for non-compliance with these limits [5].
Therefore, electric utilities need to continuously monitor their electrical grids to detect any suspicious loads that could be contributing to the voltage harmonic distortion above specified limits as recommended by power quality standards. Therefore, the development of procedures to identify which customers' loads are more impacting to the voltage harmonic distortion observed at specific locations in the grid are relevant and can help to implement a differenced treatment to these customers aiming at taking remedial actions to mitigate the possible harmonic distortion transgressions.
The point is that monitoring all customers who may be heavily contributing to the voltage harmonic distortion at a specific point in the electric network is not a trivial task and would involve a simultaneous measurement campaign in all these customers, resulting in a high financial investment in power quality analyzers.
In the current scenario of distribution systems, consumers rarely have power quality analyzers in their electrical installations. In this way the implementation of simultaneous measurement campaigns to assess harmonic distortion parameters becomes infeasible. However, all-consuming facilities have power measurement, and consequently, electric voltage and current rms measurements, which is typical in a current Advanced Metering Infrastructure system (AMI) [6].
In an AMI, Smart Meters (SMs) measure the major electrical quantities of the network at regular time intervals and make them available online through a central Meter Data Management System (MDMS). AMI can be used on any power system if coupled to a Power Line Communication (PLC) module, see for example [7][8][9].
Considering this current scenario, this paper proposes a low-cost methodology capable of identifying and quantifying the impact of multiple nonlinear loads on the voltage total harmonic distortion at a point of interest in the electric grid using only the True RMS current measured at the customers' electric installations.
This new perspective is quite attractive in terms of cost, as it dramatically reduces it by using more affordable meters for utility's reality, but still allows for even greater gain when coupled with an Advanced Metering Infrastructure, enabling real-time power quality monitoring. To achieve this, the Gradient Boosting Regression Tree (GBRT) technique was applied using the permutation importance, an inherent characteristic of regression tree techniques to determine the impact of multiple nonlinear loads in the voltage total harmonic distortion (THD u ) at the point of interest.
The proposed methodology will be tested and validated using the two steps procedure as described in the following: (1) Validation using the permutation importance as a metric to measure the impact factor for each nonlinear load by comparison with the result obtained by ATP software in a controlled simulation scheme using the IEEE-13 bus system [10]. (2) Applying the methodology to the AMI system installed at the Federal University of Pará campus located in the north of Brazil, so that it is possible to indicate the most representative nonlinear loads in impacting the THD u at the Common Coupling Point (CCP) with the utility electric grid. The result of this test case was also compared to the methodology presented by [10].

Related Works
Xu and Liu [11] presented a method for determining the consumer's and the electric utility's harmonic contribution at the common coupling point (CCP) based on a harmonic Norton equivalent circuit capable of separating harmonic current and voltage into two components: one due to consumer and the other due to the power grid. Following similar methods as the one used in [11], other studies based on harmonic Norton equivalent circuits emerged, as in [12] which has innovated by offering the possibility to evaluate the relative contribution of several customers at the CCP. The main disadvantage of such methodologies is the difficulty of application in systems where there are multiple nonlinear loads constantly changing, making it difficult to establish an adequate and accurate equivalent circuit.
Wang, Nino and Xu [13] presented a measurement method that can determine the source and harmonic impedance for residential and commercial systems supplied by a single-phase transformer, showing the utility's and consumer's contribution at the CCP through a short circuit controlled by a thyristor at the measuring point. The biggest problem with this technique is the need to insert a thyristor in the measurement arrangement which represents an unnecessary additional cost compared to measurement systems already available in the market.
Taking advantage of evolutionary computing, Srinivasan et al. [14] used trained neural networks to extract waveform patterns from the input current to identify harmonic sources and the various types of devices connected to the grid using their uniquely distinct harmonic "signature". The main disadvantage of this method is its inability to provide information about multiple nonlinear loads in the system, such as location, and the application in large power systems.
Mazumdar et al. also conducted studies with recurrent neural networks [15][16][17], seeking to minimize conflicts between utilities and customers by analyzing the responsibility for harmonic distortions at the CCP. However, the verification is not done in real time, but only when problems related to the responsibility for harmonic contribution in the electric network occur.
Kandev and Chénard [18] present a method capable of locating in real time the consumer responsible for the harmonic disturbance in the grid by vector analysis of harmonic currents measured at different points of the electrical network. Although it has the advantage of real-time analysis, only one harmonic current is analyzed for each experiment.
Manito in [10] presented an artificial neural network (ANN) model to estimate the impacts of nonlinear loads on the voltage harmonic distortion at a point of interest. His method relies on a sensitivity analysis to establish the percentage contribution (impact factor) of load currents on the observed individual voltage harmonic distortion. The study was conducted in an industrial test system with the alternative transient program (ATP). The proposed methodology was able to correctly classify the impact degree of nonlinear load currents on the 5th harmonic voltage distortion at the points of interest, however, only one specific harmonic at a time.
Utilizing a statistical vision, Yin et al. [19] explored statistics tools that provide a solution for identifying sources of harmonic distortions in electric networks with multiple nonlinear loads. Methods with multiple linear regression and partial correlation analysis were proposed. However, this article only determined their contributions to individual harmonic distortions.
Matos et al. [20] presented a method for finding the harmonic impact of multiple nonlinear loads using measurements of individual harmonic voltage and current distortions by linear and nonparametric regression models. Two electrical networks were used for tests and in both the nonparametric models obtained better results.
Following the premise of evaluating harmonic distortion impacts due to multiple loads comparing the performance of evolutionary computation techniques and statistical methods, [21] compares the techniques of multiple linear regression, artificial neural networks and regression tree. This work considered daily and weekly time windows, also analyzing the load curve for different loading levels. However, just like the work previously mentioned, this research evaluates only one harmonic order at a time.
Moradifar et al. [22] presented a new approach to the problem by applying fuzzy logic to determine the optimal allocation of harmonic meters for intelligent detection of multiple nonlinear loads in distribution systems. In addition, the fuzzy algorithm estimates the location and relative level of harmonic sources by investigating the power magnitude and signal considering the harmonic distortions and network reactance. Looking at the mentioned previous works the main contributions of this article can be summarized as follows: (1) The methodology uses as input the true RMS current, an electric variable that can be easily measured by low-cost meters, making viable for the reality of both distribution utilities and customers. The use of true RMS current in identifying and quantifying the impact of multiple nonlinear loads on voltage total harmonic distortion levels at the CCP as proposed in this article is a novelty. (2) The methodology can be applied with other input and output variables as long as they are representative to the scope of the problem. True RMS current was chosen because it is the most accessible magnitude with high correlation with voltage total harmonic distortion levels at the point of interest; (3) It presents an innovative and insightful methodology for solving the problem of identifying and quantifying the impact of multiple loads on the THDu level per phase at the CCP through a metric which is inherent to the regression tree technique. As it is already a step of the computational technique chosen, the estimation of input variables importance in the output variable does not add computational cost.
The methodology, when applied to an AMI system, offers further contributions: (1) Understanding that Advanced Metering Infrastructure is a relatively new definition, this article presents some procedures to ensure data reliability during its implementation; (2) Provides real-time power quality monitoring in all phases of the distribution grid, identifying meters with big contribution to total harmonic distortion; (3) Allows a wide range of experiments with varying time windows. According to the case to be studied by the grid managers, the combination of the proposed methodology with an AMI is able to investigate the meters contributions at different times per day, week or month, investigating the most impacting loads on the grid power quality seasonally.

Proposed Methodology
The problem to be solved is illustrated schematically as shown in Figure 1, representing an electric network having m nonlinear loads for which it is desired to calculate how these nonlinear loads are impacting the voltage total harmonic distortion at bus X, THDu x . To reach this objective a GBRT model will be used having the nonlinear loads electric currents as input variables to calculate the harmonic impacts on the total harmonic distortion at a point of interest. To describe the problem mathematically, consider the input matrix as X and the output vector as Y.
Matrix X contains the time series of measured true RMS currents, I RMS j (t), where j = 1, 2, 3, . . . , m, been m the number of nonlinear loads considered for a given time series t. And vector Y is composed of voltage total harmonic distortion values measured at the common coupling point X, THDu x simultaneously measured with nonlinear loads currents presented in matrix X. Matrix X and vector Y structures are shown in Equations (1) and (2), respectively.
The proposed methodology is based on the principle of supervised learning for the estimation of a time series, where the quality of this prediction is reflected in how the inputs to build the model are representative. This analysis is performed by estimating a THD X within a period t based on a set of input attributes, I RMS m , measured simultaneously according to a specified sampling rate. However, nothing prevents the use of this methodology with other input and output variables, if the high correlation between them is proven. Commonly, machine learning problems apply the attribute selection procedure to combine and identify the most significant variables in a dataset, acting to improve predictive performance and reducing its learning curve. There are techniques that already implement this procedure on its own structure, as is the case of Regression Tree. The regression tree comprises an input X and output Y, being formed by n nodes, subsets of X. The nodes are subjected to binary tests, dividing into new two subsets of X which in turn are labeled with a better value for the output variable, decreasing the degree of impurity with each step, adopting a voracious approach, top-down, recursive, under the divide-to-conquer strategy. Thus, the tree will continue to form nodes until they become pure in terms of Y or when all variables Xj are locally constant.
The importance of an input variable Xm is then evaluated in predicting the output by adding the weighted impurities decreases for all nodes where Xm is used. Thus, using the weighted impurity function to measure the Impact Factor (IF) of each harmonic generating source at a point of interest, is an innovative aspect introduced in this article, not found yet in other references in the technical literature.
Based on this problem formulation, this paper proposes the use of the Gradient Boosting Regression Tree (GBRT) technique [24][25][26], together with the impurity-based function proposed by [27,28], known as Mean Decrease Accuracy (MDA) or Permutation Importance, to get the impact factor. For the impurity measure i(n) it can be also used the Gini index, the Shannon entropy, the Y variation, among others [27,28]. Generally, (3) can be referred to as the Mean Decrease Impurity (MDI) importance. And Permutation Importance is an MDI that weighs the importance of a Xm variable in the output when its values are randomly exchanged in out-of-bag samples: Commonly, machine learning problems apply the attribute selection procedure to combine and identify the most significant variables in a dataset, acting to improve predictive performance and reducing its learning curve. There are techniques that already implement this procedure on its own structure, as is the case of Regression Tree. The regression tree comprises an input X and output Y, being formed by n nodes, subsets of X. The nodes are subjected to binary tests, dividing into new two subsets of X which in turn are labeled with a better value for the output variable, decreasing the degree of impurity with each step, adopting a voracious approach, top-down, recursive, under the divide-to-conquer strategy. Thus, the tree will continue to form nodes until they become pure in terms of Y or when all variables X j are locally constant.
The importance of an input variable X m is then evaluated in predicting the output by adding the weighted impurities decreases for all nodes where X m is used. Thus, using the weighted impurity function to measure the Impact Factor (IF) of each harmonic generating source at a point of interest, is an innovative aspect introduced in this article, not found yet in other references in the technical literature.
Based on this problem formulation, this paper proposes the use of the Gradient Boosting Regression Tree (GBRT) technique [24][25][26], together with the impurity-based function proposed by [27,28], known as Mean Decrease Accuracy (MDA) or Permutation Importance, to get the impact factor. For the impurity measure i(n) it can be also used the Gini index, the Shannon entropy, the Y variation, among others [27,28]. Generally, (3) can be referred to as the Mean Decrease Impurity (MDI) importance. And Permutation Importance is an MDI that weighs the importance of a X m variable in the output when its values are randomly exchanged in out-of-bag samples: In (3) N TR represents the number of trees built, p(n) is the proportion of samples reached on n nodes, s n is the binary test for separating samples in subsets, v(s n ) is the variable used as the separation parameter and ∆i(s n , n) is the variation of a measure of impurity.
Finally, the relative contribution of each nonlinear load to the voltage total harmonic distortion at the model output is measured by the impact factor, IF THDu m (%), according to (4): The impact factor of each j load can be interpreted as a relative percentage value that is calculated for each true RMS current individually for the set of nonlinear loads considered in the analysis, leading to (5): An overview of the methodology proposed in this article is presented in Figure 2. As a starting point, it is important to identify the data sources. The methodology points out two possible ways: the first corresponds to the application in an AMI system following procedures as will be indicated in topic A; the second not applying in an AMI system, that is, using traditional measurement campaigns. In (3) represents the number of trees built, ( ) is the proportion of samples reached on n nodes, is the binary test for separating samples in subsets, ( ) is the variable used as the separation parameter and ∆ ( , ) is the variation of a measure of impurity.
Finally, the relative contribution of each nonlinear load to the voltage total harmonic distortion at the model output is measured by the impact factor, (%), according to (4): The impact factor of each j load can be interpreted as a relative percentage value that is calculated for each true RMS current individually for the set of nonlinear loads considered in the analysis, leading to (5): An overview of the methodology proposed in this article is presented in Figure 2. As a starting point, it is important to identify the data sources. The methodology points out two possible ways: the first corresponds to the application in an AMI system following procedures as will be indicated in topic A; the second not applying in an AMI system, that is, using traditional measurement campaigns. If measurement campaigns are to be used without the resources of a Data Management System, a time window must be chosen having a sampling interval. For power quality applications a maximum sampling interval of 10 min is recommended by legislation for one-week campaigns, which correspond to 1008 samples [5]. The smaller the sampling rate the greater the possibility of applying data preprocessing techniques, which corresponds to the next step in the methodology.
After the data acquisition comes a data management step, which corresponds basically to the application of process of Knowledge Discovery in the Database (KDD), which is an essential step to obtain a good estimation model.
In the Data Extraction step, it is important to apply a filter to select which measurement points should be considered in the study, depending on how many null samples exist in the time series. Points with more than 30% of null samples must be discarded from the diagnosis. These null samples usually correspond to periods of meter shutdowns or failures. This same procedure is applied in the SM selection step for the interest area. If measurement campaigns are to be used without the resources of a Data Management System, a time window must be chosen having a sampling interval. For power quality applications a maximum sampling interval of 10 min is recommended by legislation for one-week campaigns, which correspond to 1008 samples [5]. The smaller the sampling rate the greater the possibility of applying data preprocessing techniques, which corresponds to the next step in the methodology.
After the data acquisition comes a data management step, which corresponds basically to the application of process of Knowledge Discovery in the Database (KDD), which is an essential step to obtain a good estimation model.
In the Data Extraction step, it is important to apply a filter to select which measurement points should be considered in the study, depending on how many null samples exist in the time series. Points with more than 30% of null samples must be discarded from the diagnosis. These null samples usually correspond to periods of meter shutdowns or failures. This same procedure is applied in the SM selection step for the interest area.
According to the proposed methodology the next step is the dataset creation and separation that will be used in the training step and later in the validation step of the configured and constructed estimation model. It is advisable to use the cross-validation technique to mitigate effects caused by seasonality which are inherent in electric loads modeling.
If the modelling error is not satisfactory, new parameters configuration must be selected to raise the desired performance. Otherwise, it is recommended to restart the methodology with a greater focus on the Data Manipulation step. When a satisfactory estimation model is obtained the impact factor for each nonlinear load is calculated accordingly.

Aprocedures for Implementing the Methodology in an Advanced Metering Infrastructure
As previously presented, this work also contributes to the application of the proposed methodology in a real scenario with an already implemented and functioning AMI system. In this section it will be outlined the main steps taken to ensure that data acquisition from the meters be reliable for usage in the methodology.
To achieve that, it begins by highlighting important technical factors for the basic elements that make up the infrastructure required for this application: • End user devices-Smart Meters; In general, smart meters need to be able to send stored data and receive operational commands. Thus, it is crucial a SM to have at least one network communication interface and an internal real-time clock to allow synchronism checking, and in the case different meters are used, it must be possible to standardize the time measurement.
In [6] several recommendations for the implementation of an AMI are presented, but as one of the main motivations of this work is cost reduction so the methodology can be accessible, the efforts were devoted to the development of an application with the function of mediation between meters, database and end users. In other words, routines and mechanisms for communication and data integrity checking were proposed to implement the infrastructure Data Management System.
The application was divided into three steps: data acquisition; data processing; and data persistence. Each step must be associated with an error and exception release. The first is related to the communication with the meters, being responsible for verifying possible changes in the physical network and data acquisition. The essential procedures for this implementation are: Scan network to check if meter is connected or disconnected to system; • Identify and flag meter configuration changes; • Request data and apply checksums CRC (Cyclic Redundancy Check).
Cyclic Redundancy Check is an error detection methodology that aims to identify changes in the data chain during information transmission [29]. It basically consists on the polynomial calculation of bytes check, usually called checksum, which are added to the original message for later transmission. The data acquired during the Data Acquisition step correspond to measurements provided almost instantaneously by the SMs, with requests in every 30 s. Since the purpose of this application is to measure steady state quantities and the meters preferentially work with RMS current values, the interval between data acquisition doesn't need to be smaller than one cycle of the input sinusoid. Thus, the request period can happen in seconds, allowing time for repetition of the data acquisition process in case of errors or inconsistencies.
Such errors need to be detected and corrected so that persisted data in the database are reliable for the estimation model construction step. It is also very important to store the history of faults and errors, as they will be used to filter meters that will participate in the diagnosis. Table 1 shows how the codes associated with each error detected by the AMI system of the test scenario are stored, which it will be presented in the next section. The second step of the proposed application is Data Processing. The measurement treatment process consists of obtaining a representative value of this set of samples for each electrical quantity, which are I RMS and THD u in this specific case. Thus, the first step is to remove measurements with an error, in an independent way for each variable. Then the remaining samples are filtered using the Chauvenet criterion presented in the computer code Algorithm 1. Finally, the mean value is calculated, and used to represent the respective variable. If in one of the steps all measurements are removed, the value representing the variable is taken as zero, and an equivalent error is defined as the smallest value among the errors obtained for the set of variables. the codes associated with each error detected by the AMI system of the test scenario are stored, which it will be presented in the next section. The second step of the proposed application is Data Processing. The measurement treatment process consists of obtaining a representative value of this set of samples for each electrical quantity, which are IRMS and THDu in this specific case. Thus, the first step is to remove measurements with an error, in an independent way for each variable. Then the remaining samples are filtered using the Chauvenet criterion presented in the computer code Algorithm 1. Finally, the mean value is calculated, and used to represent the respective variable. If in one of the steps all measurements are removed, the value representing the variable is taken as zero, and an equivalent error is defined as the smallest value among the errors obtained for the set of variables. The Chauvenet Criterion is widely accepted as a deviation detection technique. It defines an acceptable deviation around the mean for a sample with no measurements [30]. The base critical value can be obtained from Table 2.
The last step to address is Data Persistence, which is the step responsible for storing the handled data in the database. It is important to meet the acquisition and treatment rates, because it is an application which should be repeated periodically. These cycles start at different intervals and can be easily configured according to the scenario covered. In summary, the procedures presented in this section ensure the use of the proposed methodology in an AMI environment. It is important to note that all samples must respect a sampling rate and be in the same time window, ensuring synchronism.
The Chauvenet Criterion is widely accepted as a deviation detection technique. It defines an acceptable deviation around the mean for a sample with no measurements [30]. The base critical value can be obtained from Table 2. The last step to address is Data Persistence, which is the step responsible for storing the handled data in the database. It is important to meet the acquisition and treatment rates, because it is an application which should be repeated periodically. These cycles start at different intervals and can be easily configured according to the scenario covered. In summary, the procedures presented in this section ensure the use of the proposed methodology in an AMI environment. It is important to note that all samples must respect a sampling rate and be in the same time window, ensuring synchronism.

Results and Discussion
In order to validate the proposed methodology, two tests were executed. The first one, compares the impact factor performance calculated using the GBRT technique with Permutation Importance, and the impact factor calculated using Artificial Neural Network based procedure as presented in [10] for the IEEE-13 bus industrial system. The second test is accomplished using the AMI system installed at the Federal University of Pará campus.

Permutation Importance Validation
Although permutation importance is already used successfully to evaluate the impact of input variables on output variables in several applications [32][33][34], it is necessary to evaluate its performance in the problem addressed in this article. As previously mentioned, in the first test the GBRT technique combined with the permutation importance metric are used to calculate the impact factor in an ATP simulation study using the balanced IEEE-13 bus industrial distribution system, and the obtained results are compared with those obtained in [10] also using the same test system. Manito et al. (2018) in [10] used an Artificial Neural Network to estimate the impacts of fifth harmonic currents generated by the nonlinear loads on the fifth harmonic voltage at a point of interest, in a scenario simulated in ATP Draw software (Alternative Transient Program [35]).
The same scenario is used with the GBRT modeling technique with permutation importance to calculate the impact factor of fifth harmonic currents on the fifth harmonic voltage at the same point of interest as calculated in [10], comparing the results of the two techniques.

IEEE-13 Bus Industrial Electrical Distribution System
The IEEE-13 bus industrial distribution system [36,37], represented in Figure 3, is a three-phase system used by several authors as a test system for power quality analysis [38][39][40]. The same scenario of [10] was used in this study, where four nonlinear loads were specified at buses B11, B29, B49 and B51, modeled by the ATP Draw software as shown in Figure 3, configuring nonlinear loads as harmonic current sources HS1, HS2, HS3, and HS4 respectively. Importantly, ATP Draw software is widely used in power quality analysis, simulating the characteristics and the actual behavior of the network satisfactorily [41][42][43][44][45]. Energies 2019, 12, x FOR PEER REVIEW 10 of 22

Data Manipulation
As described in section III, it is necessary to evaluate the scenario database focusing mainly on data preprocessing. Applying this step to the first test data, outliers were found most frequently in harmonic sources 1 and 4 being identified and removed. At the end of this step the samples were reduced from 10800 to 7427, which is a 31.23% reduction. In the next topic it will be presented the importance of this step for the calculated harmonic impact factor values shown in Table 3.

. Comparison between Techniques and Metrics
The work developed by Manito et al. (2018) divided the dataset, so the first 80% samples were intended for training and testing the ANN via cross-validation and the 20% remaining for the ANN model validation. To apply the proposed methodology by using GBRT, the above proportion was respected, and the cross-validation technique was also used, but with 10 folds for training the GBRT model.
The methodology was implemented in Python with the aid of the Scikit-learn library [46] and the Gradient Boosted Regression Trees regression technique, as already mentioned. After training

Data Manipulation
As described in section III, it is necessary to evaluate the scenario database focusing mainly on data preprocessing. Applying this step to the first test data, outliers were found most frequently in harmonic sources 1 and 4 being identified and removed. At the end of this step the samples were reduced from 10800 to 7427, which is a 31.23% reduction. In the next topic it will be presented the importance of this step for the calculated harmonic impact factor values shown in Table 3.

Comparison between Techniques and Metrics
The work developed by Manito et al. (2018) divided the dataset, so the first 80% samples were intended for training and testing the ANN via cross-validation and the 20% remaining for the ANN model validation. To apply the proposed methodology by using GBRT, the above proportion was respected, and the cross-validation technique was also used, but with 10 folds for training the GBRT model.
The methodology was implemented in Python with the aid of the Scikit-learn library [46] and the Gradient Boosted Regression Trees regression technique, as already mentioned. After training and construction of the GBRT model, it had its performance evaluated by the Absolute Mean Error (MAE) obtained through the test set, as well as in [10].
The best parameter setting found for the GBRT model was: Max_depth = 10, Min_sample_split = 10, Min_sample_leaf = 3, N_estimator = 50, Learning rate = 0.1, Loss functions = huber, alpha = 0.85. With this configuration the GBRT technique obtained a MAE equal to 0.0065 while ANN obtained 0.0107 in this study, so the GBRT improved the estimation performance by 39.25%. In addition, GBRT presented a satisfactory result when compared to impact factors calculated by ATP simulation, which are considered reference values for comparison, as seen in Figure 4.
It is also observed in Figure 4 that the impact factors calculated by the GBRT model are systematically closer to the exact results obtained by ATP, for all harmonic sources HS1, HS2, HS3 and HS4. Other relevant results are summarized in Table 3, for the ANN and GBRT models. While ANN and GBRT without data preprocessing have erroneously ranked third and fourth most impacting harmonic sources as compared to ATP ranking, the GBRT model with data preprocessing as proposed in the article ranked correctly all harmonic sources. This fact makes evident the importance of the data preprocessing step included in the GBRT model. and construction of the GBRT model, it had its performance evaluated by the Absolute Mean Error (MAE) obtained through the test set, as well as in [10]. The best parameter setting found for the GBRT model was: Max_depth = 10, Min_sample_split = 10, Min_sample_leaf = 3, N_estimator = 50, Learning rate = 0.1, Loss functions = huber, alpha = 0.85. With this configuration the GBRT technique obtained a MAE equal to 0.0065 while ANN obtained 0.0107 in this study, so the GBRT improved the estimation performance by 39.25%. In addition, GBRT presented a satisfactory result when compared to impact factors calculated by ATP simulation, which are considered reference values for comparison, as seen in Figure 4.
It is also observed in Figure 4 that the impact factors calculated by the GBRT model are systematically closer to the exact results obtained by ATP, for all harmonic sources HS1, HS2, HS3 and HS4. Other relevant results are summarized in Table 3, for the ANN and GBRT models. While ANN and GBRT without data preprocessing have erroneously ranked third and fourth most impacting harmonic sources as compared to ATP ranking, the GBRT model with data preprocessing as proposed in the article ranked correctly all harmonic sources. This fact makes evident the importance of the data preprocessing step included in the GBRT model.

Test Case in an Advanced Metering Infrastructure Instalation
After validating the use of permutation importance as a metric to calculate the impact factor, it was decided to execute the proposed methodology in a real electric distribution grid having an AMI system already implanted. This is located at the Federal University of Pará (UFPA) Campus in Belém City-Brazil, as depicted in Figure 5. The AMI system is named SISGEE that is the short form for Electric Energy Management System.

Test Case in an Advanced Metering Infrastructure Instalation
After validating the use of permutation importance as a metric to calculate the impact factor, it was decided to execute the proposed methodology in a real electric distribution grid having an AMI system already implanted. This is located at the Federal University of Pará (UFPA) Campus in Belém City-Brazil, as depicted in Figure 5. The AMI system is named SISGEE that is the short form for Electric Energy Management System.

Brief Description of SISGEE (Electric Management System)
SISGEE's main objective is to monitor the steady-state electrical quantities in the University Campus electric grid and to arrange them in an intuitive online environment so that it is easily accessed by computers logged in the Campus available Wide Area Network (WAN). Figure 6 shows schematically the electric grid of the University Campus summarizing the main load blocks.
UFPA´s electric loads are distributed in four main sectors, namely: Basic 1, Basic 2, Professional, and Health as detailed in Figure 6. Basic sector is the largest, comprising most graduation courses, administration buildings, banks, convention center, among others, being supplied by two distribution feeders, namely AL1 and AL2. The Professional and Health sectors are supplied by

Brief Description of SISGEE (Electric Management System)
SISGEE's main objective is to monitor the steady-state electrical quantities in the University Campus electric grid and to arrange them in an intuitive online environment so that it is easily accessed by computers logged in the Campus available Wide Area Network (WAN). Figure 6 shows schematically the electric grid of the University Campus summarizing the main load blocks.
UFPA's electric loads are distributed in four main sectors, namely: Basic 1, Basic 2, Professional, and Health as detailed in Figure 6. Basic sector is the largest, comprising most graduation courses, administration buildings, banks, convention center, among others, being supplied by two distribution feeders, namely AL1 and AL2. The Professional and Health sectors are supplied by feeders AL3 and AL4 respectively. The common coupling point, where the THDu x is to be measured is the 13.8 kV substation connecting the internal UFPA electric distribution networks to the local utility distribution grid. feeders AL3 and AL4 respectively. The common coupling point, where the THDux is to be measured is the 13.8 kV substation connecting the internal UFPA electric distribution networks to the local utility distribution grid. SISGEE monitors this power grid through class S Smart Meters (SM) installed at the substations of the largest institutes and buildings. SMs use the MODBUS TCP / IP over Ethernet protocol to send the acquired data to a central server which is responsible for requesting, calculating, storing the acquired data and hosting an internet site for displaying online and real time results. The MS parameters according to factory specification are shown in Table 4.  Table 5 lists the installed SMs by feeder and building highlighting its main activity type and the installed power transformer capacity. In red are the SMs that were inactive in the period of the experiment.  SISGEE monitors this power grid through class S Smart Meters (SM) installed at the substations of the largest institutes and buildings. SMs use the MODBUS TCP/IP over Ethernet protocol to send the acquired data to a central server which is responsible for requesting, calculating, storing the acquired data and hosting an internet site for displaying online and real time results. The MS parameters according to factory specification are shown in Table 4.  Table 5 lists the installed SMs by feeder and building highlighting its main activity type and the installed power transformer capacity. In red are the SMs that were inactive in the period of the experiment.   Table 6 lists by feeder, the respective nominal power capacity, the effective measured power during the measurement campaign, and the percentage of nominal power capacity used by connected loads. Because, although SISGEE serves the institutes with the largest loads installed, there are some other buildings that are not yet monitored.

Test Scenario Description
This test scenario aims to apply the proposed methodology in an electric network with an implanted AMI system which facilitates the prompt access to the measurements acquired from the real time operation of the electric network. Accessing the UFPA AMI system make available, by phase, real time and historical data involving all electric quantities like voltage, current, power, energy, power factor and also discriminators of power quality like individual and total harmonic distortions indicators.
To validate the proposed methodology of obtaining impact factors in the THDu of a point of interest using only the electric current true rms values, the test was conducted from January 21 to January 27, 2019, because it was the week that had the highest number of THDu violations of the year. Initially, it was selected all meters with status active and that have not extrapolated 30% of null samples. This selection filter is made according to each phase and automatically by MDMS.
Before the final database is formed, it still undergoes a sample removal action that corresponds to the university's general power outages measured at the Main Power Cabin. At the end of this process, the database for constructing and validating the regression model had 897 samples, where the input attributes correspond to true rms current values measured at the points that have active and validated meters, with the THDu measured at the CCP (Main Power Cabin) representing the model output.
It is important to know that of all active meters listed in Table 5, only the meter installed in the Physiotherapy College building was excluded from the Phase B experiment because it did not meet the above prerequisites.

Parameters Definition
Following the steps of the proposed methodology (Figure 2), after selecting the SM that will represent the input and output attributes of the regression model, it is time to create the basis that will be used for training, testing and validation. For this, randomly following the ratio of 80% for the training set and 20% for the validation set. The model was built using the 10-pack cross-validation technique, which forms training and test subsets.
The methodology was implemented using the Python language with the aid of the Scikit-learn library [46] and the GBRT regression technique, through the impurity index obtained from the own GBRT construction process, and the harmonic impact factor is estimated for each SM.
Model performance was evaluated by the Mean Absolute Error (MAE), Mean Square Error (MSE), and Mean Absolute Percent Error (MAPE) metrics obtained by the test set, and the best ratio of configuration parameters for the GBRT model are: Max_depth = 8, Min_sample_split = 9, Min_sample_leaf = 9, N_estimator = 50, Learning rate = 0.09, Loss functions = huber, alpha = 0.85. Table 7 presents the errors obtained for phases A, B, C, by the configured estimator with the above parameters.   Table 7, it can be observed that MAPE in all phases remained below 5.0%, which is an acceptable threshold to guarantee a satisfactory performance of this prediction model for the next step, which is the impact factor calculation. Figure 7 compares the THDu values measured at the CCP with the values estimated by the model for phases A-B-C respectively, together with the 95% confidence interval. model output.
It is important to know that of all active meters listed in Table 5, only the meter installed in the Physiotherapy College building was excluded from the Phase B experiment because it did not meet the above prerequisites.

Parameters Definition
Following the steps of the proposed methodology (Figure 2), after selecting the SM that will represent the input and output attributes of the regression model, it is time to create the basis that will be used for training, testing and validation. For this, randomly following the ratio of 80% for the training set and 20% for the validation set. The model was built using the 10-pack cross-validation technique, which forms training and test subsets.
The methodology was implemented using the Python language with the aid of the Scikit-learn library [46] and the GBRT regression technique, through the impurity index obtained from the own GBRT construction process, and the harmonic impact factor is estimated for each SM.
Model performance was evaluated by the Mean Absolute Error (MAE), Mean Square Error (MSE), and Mean Absolute Percent Error (MAPE) metrics obtained by the test set, and the best ratio of configuration parameters for the GBRT model are: Max_depth = 8, Min_sample_split = 9, Min_sample_leaf = 9, N_estimator = 50, Learning rate = 0.09, Loss functions = huber, alpha = 0.85. Table 7 presents the errors obtained for phases A, B, C, by the configured estimator with the above parameters.   Table 7, it can be observed that MAPE in all phases remained below 5.0%, which is an acceptable threshold to guarantee a satisfactory performance of this prediction model for the next step, which is the impact factor calculation. Figure 7 compares the THDu values measured at the CCP with the values estimated by the model for phases A-B-C respectively, together with the 95% confidence interval. (a)

Impact Factor Calculation
In order to calculate the IF by the technique proposed in [10], it is necessary to vary, one by one, each electric current time series in the input vector of the trained model by a percent factor, for example 10%, 20%, etc and then resubmit them to obtain a new estimated output that will be compared with the actual measured values. This procedure was named in [10] as a sensitivity analysis. The difference between both time series is evaluated by the Mean Absolute Percent Error (MAPE), that indicates the IF of the respective load current on the THDu at the point of interest.
As the GBRT regression technique already calculates the importance factor as part of its process, it is evident the gain in processing time offered by the use of GBRT and importance factor. Figure 8 demonstrates how the sensitivity factor varies with varying percent values from 10% to 1000% for phases A-B-C.

Impact Factor Calculation
In order to calculate the IF by the technique proposed in [10], it is necessary to vary, one by one, each electric current time series in the input vector of the trained model by a percent factor, for example 10%, 20%, etc and then resubmit them to obtain a new estimated output that will be compared with the actual measured values. This procedure was named in [10] as a sensitivity analysis. The difference between both time series is evaluated by the Mean Absolute Percent Error (MAPE), that indicates the IF of the respective load current on the THDu at the point of interest.
As the GBRT regression technique already calculates the importance factor as part of its process, it is evident the gain in processing time offered by the use of GBRT and importance factor. Figure 8 demonstrates how the sensitivity factor varies with varying percent values from 10% to 1000% for phases A-B-C. To have a good resolution in the sensitivity analysis for the second test system, the input vector was varied 200%, resulting MAPE and Permutation Importance percent values as presented in Figure  9. For the results analysis, the IF calculated for each nonlinear load, represented by the measurements of each SM, were grouped by feeder as can be seen in Figure 9.
This type of representation enables simpler monitoring of the electric network power quality situation and allows the manager to first perform a macro analysis and then focus his actions on specific points. According to Figure 9, it is highlighted that feeder 2 phases A and B are the most impacting on the THDu at the CCP, by both metrics, that is, MAPE with 41.90% and Permutation Importance with 44.18%. in phase A, and 40.55% and 42.76% in phase B, respectively. It is also important to note that feeder 2 has the largest length and has also the largest installed load. Still looking at feeder 2, and analyzing the detailed data in Table 8, there was a considerable contribution of the Geoscience Institute with 21.25% of the total IF in phase A at the CCP, the highest percentage of impact factor found in all phases. To have a good resolution in the sensitivity analysis for the second test system, the input vector was varied 200%, resulting MAPE and Permutation Importance percent values as presented in Figure 9. For the results analysis, the IF calculated for each nonlinear load, represented by the measurements of each SM, were grouped by feeder as can be seen in Figure 9.
This type of representation enables simpler monitoring of the electric network power quality situation and allows the manager to first perform a macro analysis and then focus his actions on specific points. According to Figure 9, it is highlighted that feeder 2 phases A and B are the most impacting on the THDu at the CCP, by both metrics, that is, MAPE with 41.90% and Permutation Importance with 44.18%. in phase A, and 40.55% and 42.76% in phase B, respectively. To have a good resolution in the sensitivity analysis for the second test system, the input vector was varied 200%, resulting MAPE and Permutation Importance percent values as presented in Figure  9. For the results analysis, the IF calculated for each nonlinear load, represented by the measurements of each SM, were grouped by feeder as can be seen in Figure 9.
This type of representation enables simpler monitoring of the electric network power quality situation and allows the manager to first perform a macro analysis and then focus his actions on specific points. According to Figure 9, it is highlighted that feeder 2 phases A and B are the most impacting on the THDu at the CCP, by both metrics, that is, MAPE with 41.90% and Permutation Importance with 44.18%. in phase A, and 40.55% and 42.76% in phase B, respectively. It is also important to note that feeder 2 has the largest length and has also the largest installed load. Still looking at feeder 2, and analyzing the detailed data in Table 8, there was a considerable contribution of the Geoscience Institute with 21.25% of the total IF in phase A at the CCP, the highest percentage of impact factor found in all phases.
In phase C, feeder 2 presented similar results as feeder 3 using both MAPE and Permutation Importance, with approximately 33% of contribution. Summarizing the results in Table 8 it is shown It is also important to note that feeder 2 has the largest length and has also the largest installed load. Still looking at feeder 2, and analyzing the detailed data in Table 8, there was a considerable contribution of the Geoscience Institute with 21.25% of the total IF in phase A at the CCP, the highest percentage of impact factor found in all phases.
In phase C, feeder 2 presented similar results as feeder 3 using both MAPE and Permutation Importance, with approximately 33% of contribution. Summarizing the results in Table 8 it is shown the top three most impacting loads are Geoscience Institute, Administration Building 2 and Legal Sciences Institute.

Conclusions
Voltage total harmonic distortion is one of the internationally regulated power quality indicators. The THDu level is a concern for both electric power utilities and larger customers as the responsibility for the negative impact on the grid, may involve legal aspects to be considered. Therefore, it is crucial to find out which loads contribute most to the THD level at the common coupling point, which is the point where the customer and utility are connected.
To contribute to the solution of this problem, this paper presented a new methodology for the identification and quantification of the impacts of multiple nonlinear loads on the electric grid voltage total harmonic distortion at a point of interest by using only true RMS current values employing a concept inherent to the Gradient Boosted Regression Trees technique, which is the impurity index with permutation importance, to find this Percent Impact Factor for each point of interest.
In the first experiment, the use of permutation importance as a metric for the Impact Factor calculation was validated compared to the work developed in [10], finding values very close to the reference scenario obtained from ATP simulations, and reducing the error by 39%. The result obtained by GBRT without preprocessing showed that the technique is very sensitive to bad data, highlighting the need for a data preprocessing step for a good performance of the technique.
Then, the proposed methodology was applied in a real system with an AMI system implanted, obtaining very similar results for both metrics used, that is, MAPE and Permutation Importance. These results make evident that the Permutation Importance can be used as a metric for the calculation of harmonic impacts due to multiple nonlinear loads at a point of interest, which reduces considerably the cost of computer processing in relation to the technique using MAPE as introduced in [10].
Finally, the proposed methodology obtained satisfactory results in both test scenarios through reducing the costs of these evaluations, allowing also the use of this procedure in an integrated way with an AMI system.