Innovative Methodology to Identify Errors in Electric Energy Measurement Systems in Power Utilities

: Many electric utilities currently have a low level of smart meter implementation on traditional distribution grids. These utilities commonly have a problem associated with non-technical energy losses (NTLs) to unidentiﬁed energy ﬂows consumed, but not billed in power distribution grids. They are usually due to either the electricity theft carried out by their own customers or failures in the utilities’ energy measurement systems. Non-technical energy losses lead to signiﬁcant economic losses for electric utilities around the world. For instance, in Latin America and the Caribbean countries, NTLs represent around 15% of total energy generated in 2018, varying between 5 and 30% depending on the country because of the strong correlation with social, economic, political, and technical variables. According to this, electric utilities have a strong interest in ﬁnding new techniques and methods to mitigate this problem as much as possible. This research presents the results of determining with the precision of the existing data-oriented methods for detecting NTL through a methodology based on data analytics, machine learning, and artiﬁcial intelligence (multivariate data, analysis methods, classiﬁcation, grouping algorithms, i.e., k-means and neural networks). The proposed methodology was implemented using the MATLAB computational tool, demonstrating improvements in the probability to identify the suspected customer’s measurement systems with error in their records that should be revised to reduce the NTLs in the distribution system and using the information from utilities’ databases associated with customer information (customer information system), the distribution grid (geographic information system), and socio-economic data. The proposed methodology was tested and validated in a real situation as a part of a recent Ecuadorian electric project.


Introduction
Most power utilities in Latin America and Caribean (LAC) make investments to reduce the non-technical losses of energy (NTLs), with scarce success, as they do not properly consider all the external macroeconomic variables, such as the local employment rate and the level of income per family. These variables are difficult to mitigate in our countries because of the lack of policies, laws, and regulations for power distribution systems. This social and cultural inequality becomes a severe issue for power utilities because consumers cannot pay for the electricity service due to lack of financial liquidity, which encourages cheating the measurement systems to reduce the electricity bill.
The research is based on applying the concepts and algorithms of data analytics, machine learning, and neural networks to build a systematic methodology to determine changes in consumption patterns and efficiently locating energy thefts to mitigate losses for energy distribution companies.
The review of state-of-the-art shows in general that the techniques used in the analysis of NTLs consider the use of a reduced amount of data with theoretical results; that is, they do not use a combination of techniques to minimize the error in data processing [3,4].
Five different algorithms for NTL detection using Pearson's coefficient, Bayesian networks, and decision trees were developed and tested in [5]. They used a real database provided by Endesa to test the models.
Nizar, A.H. et al. [6] presented a method to determine what type of data provides the highest precision concerning NTL analysis in the electricity distribution sector. The method identifies two popular classification algorithms, naive Bayesian and decision tree, to detect any significant energy consumption behavior abnormality.
Leite, D. et al. [7] took the case of Brazil in their research and defined the efficient frontier model, SFA (stochastic frontier analysis), from stochastic economic, social, and political variables for electric power distribution utilities to provide tolerable limits on the percentage of non-technical losses to mitigate the total cost of the transmission and distribution infrastructure associated with these utilities as an alternative to the econometric approach used in the rate review cycle.
Arthur, D. et al. [8] used only k-means to perform tests in different scenarios, looking for the comparable asymptote or the best result in the evaluation.
Sun, S et al. [9] transformed and adapted the traditional nearest neighbor algorithm (kNN) to k (AdaNN); the value of k has a crucial influence on the performance of the proposed algorithm; the optimal k detects the correct class label; and the experimental results indicate that the algorithm performs better than traditional kNN.
On the other hand, Ramos, C.C.O. et al. [10] approached NTL by using artificial intelligence techniques. However, their use can result in a high computational load in the training and parameter optimization procedures. They showed that the pattern recognition technique called optimal path forest (OPF) is superior to the latest artificial intelligence techniques. Comparisons with neural networks and other methods demonstrated the robustness of the OPF concerning the automatic identification of commercial losses.
Nagi, J. et al. [11] presented the inclusion of human knowledge and experience in the fraud detection model based on SVM with the introduction of a fuzzy inference system, in the form of fuzzy IF-THEN rules. It acts as a post-processing scheme to show the suspects with a probability of fraud; the detection rate was between 60% and 72%.
Likewise, León, C. et al. [12] used an integrated experts system to analyze useful customer information to identify the NTLs and their type. It included text mining modules, data mining modules, and the rule-based expert system module. It was applied to real data from the Endesa company power utility in the testing phase by human experts, providing a tool for the inspectors to make the best decision.
Additionally, Galván, Elices, Muñoz, Czernichow, and Sanz-Bobi [12] proposed a general methodology based on the use of radial basis function networks, with the following steps: (1) selection of variables, (2) data filtering, (3) model fit, (4) model analysis, and (5) model evaluation. The third step takes variables from monthly periods of each pattern of annual consumption and active consumption. The methodology was applied to two sectors: the low voltage residential and high-voltage irrigation sectors.
Similarly, Reference [4] presented a set of rules with a high rate of correct NTL identification based on the most relevant customer attributes available in the distribution companies' database. It allowed a reduction of the number of inspected clients with a fraud identification rate between 7% and 20%.
The research presented in this paper focuses on NTLs in Ecuador since it is one of the countries in the region with success in reducing and mitigating NTLs, with an investment above 50 million dollars in related projects [13,14].
Electric energy supply from generation to final users implies losses in different processes where the main component is in the distribution stage [15][16][17]; the losses are the difference between the energy delivered by the generator and the energy measured and billed by the company. They are classified into technical and non-technical losses [18,19].
Non-technical losses, also known in the specialized literature as "black losses" or "commercial losses", are produced by administrative errors generated by the CIS, incorrect readings, errors in the computation of consumption, incorrect energy in end-use, and theft or manipulation of the metering system, among others. Generally, their forecast is uncertain (stochastic nature), since it is not known where, how, and when they occur. They are computed as the difference between the total losses and the technical losses of the distribution system [18,20,21].
NTLs are classified according to their cause [16,18]: • Theft of energy: Any type of illegal connection that is made before the energy meter so that the connected load consumptions are not recorded by the measurement equipment. • Handling of the measuring equipment: Voluntary alterations to the measuring equipment resulting in the registration of less consumption than the real one. • Measurement errors: Involuntary technical failures of measurement devices that produce the wrong recordings, such as:

-
Damage to the components of the measurement system in direct and indirect connection,the meter, current transformers, potential transformers, terminal blocks, and connection cables.
-Human error in taking the reading or failure of the telemetry equipment.

-
Incorrect configuration of the energy meter.

-
Unintentional errors in the connection of the measurement system installation.
• Billing errors: They occur when the energy consumed is not recorded in the billing system of the distribution company due to damage to the components of the metering system.
The traditional method used by distribution companies to mitigate this problem is periodic random inspections on-site, a method that requires a high amount of financial and technical resources [21][22][23].
The research presented here aims to ease and reduce the costs associated with this procedure, with the following contributions: • Development of a methodological study based on suitable indicators that integrate and take advantage of the different technologies for data analytics, machine learning, and neural networks. These results of the study were tested with real utility data related to customers' consumption patterns. This study yields a list of potentially manipulated measurement equipment to be reviewed under the planning of the power utility. • The study allows identifying which technique gives the best result, denoting the precision of each of these, with the support of data science, processed through the use of the computational tool MATLAB ® for the construction of algorithms, in such a way that contributes to the objective of reducing non-technical losses and maximizing economic utility incomes. This document is structured as follows: Section 2 considers the current state-of-the-art, providing a deep insight into the theoretical concept of the evaluated methods for the determination of NTLs. Section 3 shows the process of data analytics and the application of the algorithms. Section 4 presents the results of the methodology for the proposed analytic methods' evaluation and comparison. Section 5 shows the results of the implementation of the methodology in a real system. Finally, in Section 6, the technical and economic effects are discussed and concluded.

Techniques Applied in Data Mining
The research is based on the concept of maximizing the probability of finding in the location of measurement systems errors in the recorded data, in such a way that the result of the execution of the algorithm allows reviewing only suspected cases, applying various methodologies as shown in the specialized literature in Table 2. This literature addresses the current issues of supervised and unsupervised data analytics techniques applied to electricity consumption variables. The new data concepts guide methods with algorithms that yield responses with metrics whose errors allow decision-making in the proposed approach.
The methodologies combined in the research are: 1.
Theoretical study: This focuses on analyzing aspects related to energy theft through the use of statistical techniques with socio-demographic and socio-economic variables to build potential lists of suspected infractions for the reviewed measurement systems; The disadvantage of theoretical studies is that they do not present specific cases of theft or the failure of the measuring equipment [20].

2.
Data-oriented methods: These methods focus on data analytics, for example the pattern of energy consumption and demand. By applying data mining techniques, the consumers with high error probability are identified [20,22]. Learning with data mining techniques is classified into: (a) Supervised learning: These are algorithms that learn by example, require input data, and provide output data with the variables that the data scientist needs; that is, he/she must give instances on properly labeled data (positive/fraud and negative/no fraud). This method requires a large amount of quality information to apply the model; the electricity distribution company must have data labeled with the variables of fraud and not fraud [20,22]. (b) Unsupervised learning: The function of these algorithms is to determine patterns to acquire training according to the available variables; generally, these algorithms use databases whose variables do not have labels or when the sample does not have a sufficient amount of data [20][21][22].

3.
Network-oriented methods: These methods are based the acquisition of data through the management of proprietary software and hardware installed in the electrical network, in such a way as to facilitate the identification or estimation of non-technical losses after a data analytics process through an algorithm that minimizes error and loss of information.

4.
Hybrid method: This is a combination of the two classifications mentioned above to maximize the precision in detecting NTLs [20,22]. Table 2. Literature review of the methods used for NTL detection.

Unsupervised Techniques
K-means is a clustering technique that has the purpose of dividing n number of samples into K number of groups; it is based on the entry of n number of instances, each one defined by a vector (a group of variables) and a number K integer that indicates the number of groups to be developed [8,50,51]. It is a technique that groups samples according to set similarity proximity, defined by "K" or "centroid" points. The advantage of this technique is efficiency when handling large data sets, and as a disadvantage, it is essential to know the number of formed groups. Another disadvantage is the sensitivity to noise in calculating the groups referring to a center; any atypical data could alter this centroid; therefore, low group formation can impair the response [21].
The operation of this technique is shown in Table 3 [8,50]. Table 3. Algorithm for K-means.
1. Randomly enter a value of K, these being the centroids of each group. 2. Form K clusters, setting each of data to the closest centroid. 3. Readjust the K centroids, which will be the average of the group established in Step 2. 4. Repeat Steps 2 and 3 until there is no readjustment of centroids.
It can overcome this drawback by knowing the number of groups to be developed. For example, there are several methods, such as the "elbow method", which is a method that analyzes the percentage of variance as a function [50]. Another technique is the gap method (GAP), similar to the elbow method [21,50]. However, no method determines the exact number of clusters to develop; the number of groups is generally chosen by trial and error, always at the discretion of the researcher [21].

k-Nearest Neighbors
This is a supervised algorithm, one of the oldest and most straightforward to use for classifying samples [9,22], which classifies the models based on their similarity with other cases, enters a model into the characteristics field, and sets the class that is more common among the closest neighbors. It uses a single parameter called "K", which indicates the number of nearest neighbors to test [9,28,52,53].
The algorithm is simple to apply; it calculates the distance between the new elements with the training set, and depending on the K value, it gives a label to the initial value; for example, the K value is five when calculating the closest neighbors to the original sample: four belong to one group and the rest to another; therefore, it can be concluded that the original model belongs to the first group [9,22].
The algorithm is presented in Table 4. Table 4. K-nearest neighbor algorithm.
1. Enter class data C = (X 1 , Y 1 ), ...(X n , Y n ). 2. Enter data to classify N = (X 1 , ..., X n ). 3. Enter the value of K neighbors to consider. 4. For every classified object, calculate the distance with the data to be classified. 5. Keep the K training data closest to the data to be classified. 6. Assign X the most frequent class.

Decision Tree
This is a process flow that shows the probable results of a series of connected decisions: a hierarchical decision model that starts with a single node and follows a series of rule "branches" into possible results [6,22,54].
A suitable task for the decision tree is classification [22]. It is a supervised method since the information classification uses a predefined data set with classes, according to the variables [6]. First, it obtains useful information from each attribute (variable), known as information gain. The attribute with the highest information gain will be the initial node or root node. It will divide into different branches based on the values of the node [6,28,54].
Several algorithms are used for creating decision trees, including ID3, C4.5, CART, and CHAID. The criterion of partitioning distinguishes each one; for example, CART is characterized by generating binary trees and uses the dosing criterion for the division of its nodes; ID3 uses the information gain as a division criterion; C4.5 uses the gain ratio as the division criterion. The division stops when the number of instances divided is below a certain threshold [54].

Artificial Neural Network
Inspired by the neurons in the human brain, this deals with linked layers that take the shape of a neuron, relating input data with output data, learning from the data, looking for patterns, classifying data, and predicting future events [22,36,55].
It is a supervised method that receives training through examples. There are many types of neural networks, but for classification cases, multilayer perceptron is usually used, which uses a supervised technique called backpropagation [36]. Figure 1 shows the basic structure of this neural network, and as we can see, it consists of three layers: input layer, hidden layer, and output layer. The connections between neurons transmit the signals. In the input layer, it receives the signals and distributes the information to the next layer (hidden layer). The number of neurons in the input layer will be equal to the input vector (number of attributes). The hidden and output layers process the signals by amplifying, attenuating, or inhibiting the signals. The number of neurons in the output layer will be equal to the number of classes in the investigation; the number of neurons in the hidden layer will depend on the application for which the neural network is established [32,34,36]. Except for the input nodes, each node in the hidden and output layers is a neuron that uses an [34,36] activation function.
The establishment of a neural network consists of three stages [36]: 1. Training stage: This is the learning stage where the input attributes (network input) can be added and compared with the target set (label or target). 2. Validation stage: This stage is executed in conjunction with the training stage and is carried out to avoid over-training the network. 3. Testing stage: This stage is carried out after the training stage and consists of using a set of data other than those of the training and validation stage to investigate how well the network learned at the end of the process.
The exposed techniques were used in the comparative analysis carried out in this research to determine the consumption patterns and energy losses.

Methodological Construction of the Matrix and Data Analysis
The methodology's objective allows establishing a comparison of the different data analytics techniques in a systematic way to evaluate the non-technical energy losses of a distribution company through the recognition of the consumption patterns and to ascertain potential thefts of energy.
The input variables used in the model come from different databases, both from the CIS, GIS, and distribution companies. Besides, the external information corresponds to the National Institute of Statistics and Census of Ecuador.

Data Collection and Integration
The information required for integration came from the company's reports; these correspond to energy losses in a period of 18 months, financial indicators, final energy use profiles, date of the last review of the measurement systems, year of manufacture of the energy meter, outstanding debt, outage status, and consumption range. A large part of these variables came from the system application product (SAP). Furthermore, related information was taken from the GIS, such as location, load density per square meter, consumption stratum, type of electrical networks, social stratum, among other variables. Figure 2 describes how the data matrix called the "base matrix" was obtained from the variables used in this research.  • Information: Those variables that provide consumer information, such as: "contracted account", "account", "name", and "ID." • Geographic: Variables that indicate the geographic location of the customer's meters, such as: "Codparr", "province", and "canton." • Economic: Variables that show the economic relationship between the customer and the distribution company, such as: "date last paid", "months due", and "debt." • Social: Variables that indicate a social aspect concerning the client, such as "population." • Techniques: Technical variables, such as: "type consumption", "voltage", "consumption kWh/month." With the classified variables, the next step corresponds to the careful review of each variable to determine those that provide relevant information in the NTL detection and control algorithm. Subsequently, with the correlation analysis of the variables and the "expert's criteria", each variable is meticulously analyzed to establish the number and magnitude of the variables that will provide information to this research methodology.

Data Pre-Processing
This step is essential for applying any data mining technique. It allows eliminating or separating anomalous data so that the matrix remains in optimal conditions for training through any method, whether supervised learning or not. For the pre-processing of the data, Variables 1 to 26 are omitted since these variables provide customer information (name, contract, telephone). The analysis is carried out from Variable 27 onwards because technical data refer to the consumer (consumption, demands, invoiced values).

Recognition of Data
The CIS tries to minimize the entry of wrong information; however, it is inevitable to have this information in the data matrix, causing the variables to move away from the mean and lose their nearness to reality, distorting the analysis context. The research placed null and white values in the data matrix; these are considered in the analysis as outliers; these are discarded in the execution of the data analytics techniques. Additionally, the recognition of technical variables is carried out through exploratory analysis to identify patterns that allow future actions to make decisions.
The statistical indices of the different variables are in Table 5. The Null data column presents values with 255 errors observed in the records of a universe of 2462 consumers; This effect begins with the migration of information from the previous AS400 (Servers Storage Systems)system to the new SAP CIS-CRM. As indicated in the last paragraph, outliers remain in the analysis of this investigation as they could be false positives.
From the analysis using descriptive statistics, the following are determined: • All variables presented blank or null data. • There exist large differences between the maximum and minimum values; there are even high percentages of the variation coefficients, generally occurring when the base matrix analyzed contains measurement systems with information of residential, commercial, and industrial consumers; therefore, consumption varies considerably. The data must be linearized and normalized to reduce these differences in values and avoid possible errors in training and executing the algorithms; this procedure is given in Section 3.2.3.
• Some variables have negative values; the distribution company states that they correspond to re-invoicing of the consumer due to reading errors or low application rates. • The zero value for the mode in the consumption variables determines that there are measurement systems with zero consumption; it is essential to physically review this in field planning. • There is a high difference between the maximum and minimum values; this must be considered when applying data mining techniques.
In Figure 3a, the data dispersion of the variable "consumption" and, in Figure 3b, the variable "debt" respectively, are given; The negative values (enclosed in red) are due to the dispersion of the variables of the database considered in this investigation.

Data Cleaning
One of the main points of this research is the cleanliness of the data since the information comes from different bases and may suffer alterations in the handling and transfer from the source, so it is suggested to maintain greater care, or failing that, to use tools like business intelligence (BI) for information management. The tools used for cleaning the data were Microsoft EXCEL and MATLAB ® , according to the following process: • Null or non-existent data are verified: -EXCEL recognizes the missing data as N/A. -MATLAB ® recognizes non-existent data as NaN (not a number).
Those consumers that have null data in the technical variables are eliminated from the list. • Atypical data: Through exploratory data analysis, it is determined that the data that should have been considered inconsistent are the negative values in the technical variables; therefore, any consumer that has a negative value is eliminated from the list.

Data Normalization
The variables studied in the research present coefficients of variation with high ranges, so it is necessary to center, scale, or linearize the data to be in the same range.
The normalizations used are: • Maximum-minimum normalization: This is done by Equation (1). where: v is the new value v is the value to normalize max is the maximum value of the data min is the minimum data value • Z-score normalization: This is done by Equation (2 where: v is the new value v is the value to normalize mean is the data average std is the standard deviation of the data

Data Processing
In this section, the data mining techniques depend on the information available in the matrix created in the utility database. The research objective is to identify which techniques best respond to the data analysis, for which supervised and unsupervised learning techniques are used.

Supervised Learning
For the application of supervised methods, examples require training; for this, a database of 2462 samples was obtained, which was reviewed in situ, including 1231 proven instances of fraud and 1231 of non-fraud. Based on the above, the variables used for training are those shown in the following Table 6. Table 6. Description of the variables.

# Variable Description
x 1 Average 13 month average energy consumption x 2 Standard deviation Standard deviation corresponding to monthly energy data The monitored methods implemented are: 1. Nearest neighbor (K-NN): The algorithm uses the MATLAB ® tool; in Figure 4, the algorithm execution response is given. The training data represent a circular form, and the new data are in a grid form; in red color, data classified as "fraud" and in blue color "no fraud." The K value is five, and the operation of this algorithm is simple; it calculates the distance of the most frequent nearest neighbors (in this case, five) and chooses the class. Before training this algorithm, the data are normalized with Equation (1).

Decision tree:
The algorithm is executed with MATLAB ® , generating CART-type decision trees; that is, each node is divided into two. The data are normalized using Equation (1) before training, and the decision tree generated is given in Figure 5.  Figure 5. Decision tree.

Neural network (ANN):
The creation and training of the artificial neural network occur using the Toolbox tool of MATLAB ® , in which the perceptron multilayer neural network is used. The implemented neural network in Figure 6 shows an input layer with six variables; a hidden layer made up of 10 neurons and an output layer with one neuron for classification. The training algorithm is the Levenberg-Marquardt backpropagation, and the activation function is the sigmoidal one. The data are normalized with Equation (1) and randomly divided into three parts: 70% for training, 15% for validation, and 15% for testing.

Unsupervised Learning
Unlike previous techniques, "unsupervised methods" do not need examples for training. The technique applied is the following:

K-means:
The algorithm does not require following the traceability of previous occurrences; the variables of the base matrix is used; however, only the variables mentioned in Table 6 are used compared with other techniques. The K-means technique is based on grouping by similarities. The algorithm performs a pre-grouping before performing the K-means groupings to avoid bad group formation since the magnitudes of consumption between these rates vary significantly. The data are normalized with Equation (2). In Figure 7, an example of the algorithm execution is given; the value of K is two, representing the formation of two groups within the residential rate, the group of fraudulent consumption, and the group of consumers that reflect consumption patterns without alterations. In this sense, it is necessary to plan the on-site review of the measurement systems since something is happening with these measurement systems. An example is presented in Figure 7b. The group is selected as Fraudulent Number 2 (blue color).

Results of the Application of the Data Analytics Techniques
Data analytics techniques' performance is analyzed with the data matrix with 400 examples of proven fraud and non-fraud measurement systems, including 200 labeled 1 (fraud) and 200 as 0 (no fraud). From this information, the metrics from the confusion matrix shown in Table 7 are used [5].  From the confusion matrix presented in Table 7, the concepts of specificity and reliability are derived [21,22]:

Actual Values
• Specificity or true positive ratio (TPR): This indicates whether a classification technique performs correctly, stating the proportion of samples cataloged as non-technical energy losses corresponding to the total number of non-technical losses within a data group, shown in Equation (3).
• Reliability or a false positive ratio (FPR): This indicates the relationship between false alarms (consumers falsely classified as committing fraud) and the total number of true negatives, shown in Equation (4).
Compliance with data analytics techniques is compared based on specificity and reliability metrics. Table 8 shows the result of the evaluation of the K-means technique. Intuitively, two groups should exist, that is K = 2 (fraud and not fraud); however, the results cannot be right; that is why the technique evaluates different values of K.
Good results are obtained when forming 2, 3, 5, and 7 groups, getting high numbers of TP and TN and low numbers of FP and FN; with this, a high percentage of the TPR and a low percentage of FPR are obtained; while with nine groups, the result was in the middle. The results were right, it could happen that in other cases with fewer or more groups, good or bad results are achieved; that is, there is no precise method that determines the right number of groups and which of them to choose as fraudulent; these depend on the amount of data and the number of groups available. In this case, the expert, based on his/her experience, must locate the best group. Table 9 presents the results of evaluating the K-nearest neighbors technique with different K values; the good results during the application of this technique are from the group K = 10; this value applies to the corresponding analysis.  2  80  17  3  80  24  5  79  24  7  79  24  9 49 24 Table 9. Evaluation with metrics of the K-nearest neighbors method.  Table 10 presents the results of the evaluation with the metrics of the supervised techniques. It shows that the technique that presented the best results in the three methods was the neural network. The neural network obtained considerable percentages of the TPR; however, it presented high values of the FPR (43%), indicating that there is a high number of false positives. Comparing the results of data analytics techniques, the K-meansgrouping is the one that delivered the best results; however, it must be taken into account that the training of supervised techniques requires having a database with at least 30% examples.
An evaluation was performed by applying in the same analysis process an unsupervised technique (K-means) and a supervised technique to determine the measurement systems considered as fraudulent. The result of the evaluation is presented in Table 11. Table 11. Results of combining K-means with a supervised method.

FPR (%)
K-means + K-neighbors (K = 10) 53 34 K-means + decision tree 55 39 K-means + neural network 87 16 For the evaluation of the different techniques applied in this research, two groups were used for K-means, as shown in Table 11; when combining the techniques, a better result was obtained, where the TPR percentages increased, although the FPR decreased relatively. In Figure 8, the AUC of all the methods implemented in the analysis presented shows that the K-means method was the one that gave the best results. Of the combinations performed, K-means with the neural network turned out to be the most efficient, presenting the highest AUC value among the classification methods. The results are precise; after that information transforms into data through supervised and unsupervised techniques, the advantages arise; these allow the distribution companies to make profound decisions regarding the measurement systems, always focused on economic recovery.
The result of the combination of the k-means and neural network algorithms gave us 87% true positive data; this value depends on the type of variables used in the analysis, the quality of the information, and the percentage of NTLs that the distributor maintains in their indicators. In the case of distribution companies with high loss rates, the results will be better in practice. The algorithms cited in this research are the most optimal for this analysis.

Control of Measurement Systems in Utilities
Generally, utilities have a specialized department for the control and review of the measurement systems; this work is carried out under planning to organize and optimize the inspection of the measurement systems in such a way that establishes precisely the operation and integrity of the measurement systems, guaranteeing correct billing to consumers.
The application of data analytics techniques in distribution companies is almost nil. Therefore, the research proposes applying this new concept to detect fraud or damage to measurement systems, so we applied and tested the algorithms developed with the CENTROSUR Utility data.

Management in the Recovery of Energy Consumed and Not Invoiced
The methodology results are evaluated based on on-site reviews to determine potential consumers with energy theft during the trial period. In a universe of 15,000 measurement systems, the first results obtained were 1816 reviews; of these, we detected 78 measurement systems with damage and alterations. Under these test conditions, the efficiency of the algorithm was 4.29%, a relatively optimistic value since the non-technical losses in the distribution company did not exceed 0.85% of the total losses (6.25%).
While the economic incorporation with the monthly billing for the energy consumed and not invoiced represented a total value of USD 80,869, this value is due to re-billing processes duly protected by the Organic Law of the Public Electricity Service. In this way, the economic flow of the specialized department can be cover by the management carried out through the recovery of the energy consumed and not invoiced.

Examples of the Application of the Methodology for the Reduction of Non-Technical Losses
The technique's success goes hand-in-hand with the timely revision executed in the measurement system; we will explain some application cases that the execution of the algorithm presents as a result.
With the support of the geographic information system tool, the analysis area was determined using the polygon method to obtain the model's input variables, as indicated in Figure 9. The measurement systems that presented anomalies in the monthly records are shown in Figure 10. The patterns of electric energy consumption show before and after the anomalous detection resulting from this investigation to recover the energy consumed and not billed. Figure 10a represents the regularization of the indirect measurement system, CL 20, FM 4S, installed with medium voltage (22 kV) with a particular transformation station of 25 kVA, which maintained a consumption pattern of close to zero since the energy meter had not appropriately configured the current transformer transformation ratio.
In the polygon of Figure 9, there are 1809 measurement systems among residential, commercial, and industrial rates. In this database, it was found that 11% correspond to atypical data (erroneous and NaN), resulting in a list of 1610. The K-means grouping algorithm was executed in this database, and then, the classification was performed using the neural network, determining that 27 measurement systems qualified for an on-site review. Figure 10b refers to installing a connection without metering from of public medium voltage network in a particular transformer station of 50 kVA. In this case, the meter was removed for non-payment in the last few years. When starting a new project, consumers connected directly so that once executed, the algorithm detected the violation, a measurement system was installed, and energy consumption was re-invoiced. The utility recovered around 750 kWh/month on average.
The example shown in Figure 10c represents the decrease of a consumer in a time window (March 2017 until December 2017); the evidence is in the drop in consumption to zero due to the installation of direct lines; this clandestine connection prevents the energy meter from correctly registering consumption. It detected that around 750 kWh/month were lost for 10 months.
In the execution and test of the algorithms, through the k-means cluster, we obtained different groups of consumers; one of these is the industrial consumers, significant for the utilities due to the consumption. For this example, Figure 10d shows industrial consumers who kept a fault in a voltage transformer of the three existing ones in the measurement system. This decompensation of the magnitude of the voltage in the transformer's delta connection caused the error computed to be more than 47%. Therefore, the window of time to re-bill was wide, more than seven years. However, the law allows only computing one year. In this manner, it recovered around 1950 MWh/year; in economic terms, this corresponded to USD 156,000. In this case, these are important amounts to a utility.
The algorithm no only identifies the fault or the meter being altered, it recognizes the variation of consumption as the function of an in-depth analysis of different variables. Moreover, the algorithm operation allows separating the clusters not identified as altered. The next example shows a load profile with a decrease of consumption, but without fault or alterations; the event is produced for the season of the service zone; generally, the commercial consumer does not use the air conditioning during these seasons, as shown in Figure 10e.
The last Figure 10f explains another form of consumption variation. It shows a residential consumer's sporadic consumption; generally, his/her home is on the beach or far from the city, and he/she visits it occasionally.
In summary, with the application of the algorithm, the recovery of energy was consumed and but not invoiced was 2,021,800 kWh/year; an USD 161,744 recovery. This information was taken from the marketing system of the energy distribution company.
Distribution companies in LAC do not have remote, real-time measurement systems or advanced measurement infrastructure. Generally, the readings, review, and control of the measurement systems are managed by humans, requiring considerable investments, prolonged times in periodic reviews, and for some, even the change from conventional conductors to pre-assembled, with low success rates.
The percentages of non-technical losses may vary from one country to another, even between regions; for this reason, it is important to treat the information of each of the distributors in a personalized way. This document contributes significantly to the little exploited topic of the automatic analysis of available customer information for NTL detection. The proposed approach presents advantages in the methodology; it uses 68 variables among technical, economic, social, linearized, and correlated data. The information is homogeneous, to later apply supervised and unsupervised methods in the grouping according to the similarity of the data. The different techniques applied were evaluated through metrics to obtain the highest probability of potential energy theft events.
On the other hand, it does not require significant investments; since the data are stored and available, the post-analysis management will use the same infrastructure resources and existing distributor personnel. The methodology carries out continuous learning each time the algorithm is executed; it learns from the real expert data and stores them for future runs of the algorithm to increase certainty in detecting anomalies. Moreover, a set of rules that are executed one-by-one is not required; the expert's criteria are internalized in the algorithm's learning, separating consumers with NTLs from those with true zero or false intakes (typical cases). The advantage of applying this methodology is for the distributors with very low percentages of losses 1% or 2%; therefore, detection will considerably reduce operating costs, achieving technical efficiency.

Conclusions
This research provides a data processing methodology that improves the detection and identification of fraud in electricity consumption by a comprehensive analysis of consumption patterns using data analytics techniques and artificial neural networks. Combining the k-means clustering and forecasting methods with neural networks gives the smallest error in the algorithm of 14% for the true positive data. The second method presents an error of 18%; it uses the k-means grouping method with two groups (k = 2); the third method that best adjusts to the detection of true positives is the combination between the k-means algorithm and k-nearest neighbors with 40%.
Data mining techniques, accompanied by algorithms with supervised and unsupervised methods and artificial intelligence models, have gained particular interest in the electricity sector since their application depends on the efficiency and effectiveness of the processes.
The computation time used to run the methodology was around 25 min, with approximately 15,225 clients and 64 variables, to deliver 1816 reviews. This time can be decreased considerably with the use of supercomputers. On the other hand, it is important to stratify the planning of potential revisions by zones to keep the revision records updated.
The analysis window in this methodology is monthly; however, it can be narrower even in real time with advanced measurement infrastructure; the amount of data will grow exponentially, requiring the use of servers and big data.
The methodology uses 30% of the knowledge to forecast 70% of the unsupervised methods; the function of these algorithms is to determine patterns to acquire training according to the available variables for a label in the analysis. However, when the sample does not have sufficient data, numerous errors are generated in the forecast.
The methodology used to reduce NTLs is beneficial for the energy distribution sector. It can be extended to many utilities in LAC and the rest of the world that present similar situations; moreover, from the social point of view, a culture of efficient use of electricity can be developed.
Using the information from CIS, GIS, and socio-economic data, multivariate data analysis methods, classification, grouping algorithms (k-means), and neural networks can be applied to obtain a list of possible revisions of the measurement systems, to optimize the revisions and their routes, and to recover the most unbilled energy.
The projects that originate through this methodology will allow obtaining an economic return in the short term. The rapid change that technological advances promote daily allows various investigations such that more in-depth studies can be performed on distribution systems, especially for mitigating non-technical energy losses.
This research recommends investing in electrical projects that consider applying these techniques since financial indicators will always be positive and recovery will be obtained in the short term.

Abbreviations
The following abbreviations are used in this manuscript:

NTL
Non-technical losses GIS Geographic information system CIS Customer information systems SAP System application products CENTROSUR Empresa Eléctrica Regional Centro Sur C.