A Hybrid Algorithm for Missing Data Imputation and Its Application to Electrical Data Loggers

The storage of data is a key process in the study of electrical power networks related to the search for harmonics and the finding of a lack of balance among phases. The presence of missing data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, current in each phase and power factor) affects any time series study in a negative way that has to be addressed. When this occurs, missing data imputation algorithms are required. These algorithms are able to substitute the data that are missing for estimated values. This research presents a new algorithm for the missing data imputation method based on Self-Organized Maps Neural Networks and Mahalanobis distances and compares it not only with a well-known technique called Multivariate Imputation by Chained Equations (MICE) but also with an algorithm previously proposed by the authors called Adaptive Assignation Algorithm (AAA). The results obtained demonstrate how the proposed method outperforms both algorithms.


Introduction
Currently, the importance of problems due to harmonics in electric networks is growing. This fact is due to the increase in the amount of non-linear loads. The two main problems related to harmonics are the overheating of conductors due to the skin effect and the activation of automatic breakers, which produce problems for supply continuity. Additionally, distortion of the voltage waveform may cause the malfunction of some devices. The monitoring of harmonics in real time is required to control them.
Another common problem in electrical networks is the imbalance between phases. This is usually caused by a bad load distribution between phases and provokes a high current return displayed by the neutral, as it has to compensate for the gap existing at the centre of the scheme vectors.
Electricity quality is an important issue that is present in the following variables: voltage, current, frequency anomalies, etc. The quality affects all devices connected to the power network, causing failure of the systems or disability [1]. Currently, an electric system is analyzed in terms of efficiency, Hz ±10 −2 * 3.10 −2 * 1.10 −2 * ±3.10 −2 * 1.10 −2 * * Accuracy in Hz.

The Data Description
In this paper, the next dataset, which includes measurements of variables from an electrical power supply of an edifice, has been used. A building called Severo Ochoa, in honour of the Novel Prize winner, was used in this work for the dataset. The University of Oviedo (Spain) is the owner of this building, which has a total area of  All mentioned measurements can be performed by the four devices used in the present work. The four devices have additional features. Shark 100, Shark 200 and Shark MP200 incorporate V-Switch technology, which allows the operator to add new functions to the devices using programming commands at any time after its installation. In the case of the Nexus 1252 device, it is possible to add isolated input/output modules and software options for additional functions. All of them have communication capabilities (some optional) as Modbus or DNP 3.0 (Distributed Network Protocol) protocols by an RS485 port, 10/100BaseT Ethernet capabilities or IrDA port. A deep analysis of features of each device is made in [10].

The Data Description
In this paper, the next dataset, which includes measurements of variables from an electrical power supply of an edifice, has been used. A building called Severo Ochoa, in honour of the Novel Prize winner, was used in this work for the dataset. The University of Oviedo (Spain) is the owner of this building, which has a total area of 8.150 m 2 , distributed over two basement levels and five floors; a total of 78 employees work in it. The ITS (Information Technology Services) of the University of Oviedo are also located in this edifice. The equipment of the ITS is distributed across server rooms and scientific laboratories. This equipment has to be supplied by a good quality power network at all times. The laboratory equipment mentioned above includes electron microscopes, NMR spectrometers, X-ray diffractometers, etc. The energy consumption is 190.572 KWh per day, on average. The data set detailed here was already employed by the authors in previous research [9].
The equipment mentioned above, and the building services, incorporate devices such as UPS (Uninterruptible Power Supply), VSD (Variable Speed Drive) and inductive and capacitive loads in switching mode. These electronic circuits are nonlinear loads, and all of them can create harmonic distortion in the power line. The harmonic distortion in the distribution system is caused by the harmonic currents flowing in the electronic loads.

Methodology
The data set employed in this research has a total of 17,763 samples that correspond to the period of time referred to in the description of the data. A process of random data deletion was performed using this data set.
The new algorithm presented in this paper hybridizes the Self-Organized Maps Neural Networks methodology with the Mahalanobis distances. The hybrid method obtained is combined with an algorithm already presented in this journal by the authors, called AAA [9], based on Multivariate Adaptive Regression Splines. The proposed methodology is new and its performance is even better than the one referenced and presented in a previous paper when applied to the same database. This method is considered a hybrid method because it combines well-known pattern recognition and machine learning methodologies in a hybrid model that is able to impute missing data [11,12].
The performance of the proposed new methodology, in comparison with AAA and MICE, has been evaluated using the mean absolute error (MAE) and the root mean square error (RMSE). They are very common metrics in forecasting research [13,14]. The reason why, in the present research, both are employed is their complementarity. The purpose of the MAE is the measurement of the average magnitude of the error in a set of forecasts without considering their direction while the RMSE is employed for its ability to describe uniformly distributed errors [13]. A more detailed explanation including the formulas employed can be found in [9] Let us assume that we have a dataset formed by c different variables v 1 , v 2 , . . . , v c that are the columns of a data matrix whose total number of rows is r. The algorithm is applied via the following steps.

Creation of a New Matrix with Missing Values from the Original Data Set
This step of the algorithm is not required when it is applied to a data set in which missing data are going to be imputed, but it is mandatory in the present research to validate the algorithm by using a complete data set.
Let A be the original matrix (rxc) of r rows and c columns. As a first step and to obtain a matrix with a certain amount of missing data, a proportion of p elements in the matrix is removed. Let B be the new (rxc) matrix, with a proportion p of missing elements. The removal is performed completely at random; therefore, the type of imputation that is going to be tested to determine the performance of the algorithm is the one known as missing completely at random (MCAR).

Creation of the Reduced Matrix
A new matrix in which all the rows with missing data are removed is created. This new matrix is called B red . Although the number of rows s (s ≤ r) of this matrix will change depending on the matrix that is going to be imputed, in those cases like the one presented in this algorithm in which the removal of data has been performed completely at random and in a proportion p, the number of remaining rows u will be represented by the following formula: where: p: proportion of missing data considered; r: number of rows of the original matrix; c: number of columns of the data matrix; Afterwards the B red matrix is normalized.

Determination of the Director Vectors by Means of Self-Organized Maps Neural Networks
The Self-Organized Maps (SOM) Neural Network is a type of unsupervised neural-network algorithm whose main application is related to the visualization and interpretation of large dimensional data sets [15].
These types of maps are used to represent all the available observations (data vectors), with an optimized accuracy, by means of a reduced set of models. This is the reason why this technique has been chosen in the present research.
Let N be the dimension of the n director vectors X(t) ∈ R n , t = 1, 2, . . . , n, where each sample vector is identified by a label. The two-dimensional output layer of the SOM map contains a rectangular mesh of k = 1, . . . , x dim × y dim nodes. Each one of these nodes is employed as a codebook vector W k of dimension N. The calculus of the weight vectors is performed by using the following algorithm [16].
For a certain amount of iterations, follow the steps detailed below: 1.
Choose one sample vector X (t) at random; 2.
Update the weights W i by means of the following rule: where h ci (t) is the neighbour function, which, in the case of the present research and is being very common in the literature [15], is of the Gaussian type: Weight of neurons lying in the neighbourhood h ci (t) of the winning neuron is moved closer to X (t). The learning rate α (t) ∈ [0, 1] decreases monotously as the number of iterations increases, σ (t) determining that the radius of the neighbourhood also decreases monotonically. After many iterations and the slow reduction of α (t) and σ (t), the neighbourhood covers only a single node and the map is formed. Please note that those neurons, whose weights are closer in the parameter space W, are also closer on the mesh. After this process, the director vectors obtained are denormalized. The number of director vectors chosen to create the Self-Organized Map in the case of the present algorithm is related to the number of rows in the B red matrix. Let u be the number of rows in the matrix B red ; the total amount of director vectors will be a range of values d = e·u / e ∈ [0.05, 0.8] ; the reason for this range of values, empirically found, will be explained in the results sections.

Finding the Closest Director Vectors by Means of Mahalanobis Distances
The Mahalanobis distance is a well-known, non-Euclidean distance measure based on correlations between variables [17]. These correlations allow for the identification and analysis of different patterns. This measure is a useful way of determining the similarity of an unknown sample set to a known one, and, in the present research, it is used to compare each one of the rows of the data matrix with missing data with all the director vectors. It can be defined by the following formula: where x 1 and x 2 represent the sets of variables of two different rows of the data matrix, and A ∈ R nxn is a positively semi-definite matrix that represents the inverse of the covariance matrix of class {I} . By means of the eigenvalue decomposition, A can be decomposed into A = W·W T .
In the case of the present algorithm, the Mahalanobis distance of each vector row with two or more missing data points to all the director vectors is calculated. Please note that, in order to make this operation possible, all those variables with missing data in the row that come from the data matrix are removed in the director vector. The director vector with the lowest Mahalanobis distance value is selected and those missing variables in this row of the data matrix are filled using the values present in the corresponding row of the director vector.
Finally, the original matrix is reconstructed and the value of the missing data of those rows with only one or two missing data points are imputed by means of the AAA algorithm. As it has already been stated, this algorithm was presented in a previous work [9] published in this journal. The referenced algorithm is based on a multivariate non-parametric technique called Multivariate Adaptive Regression Splines (MARS) [18][19][20][21].

Results and Discussion
In this section, the results of the Hybrid Adaptive Assignation Algorithm (HAAA) are presented and compared with those of the AAA and MICE. The test was performed using the MCAR methodology, deleting 10%, 15% and 20% of the information. This process was repeated five times. The performances of the three algorithms were compared based on the MAE and RMSE metrics. The results of all the interactions performed are presented. To simplify the comparisons, the results that use the same original MCAR subsets are presented in the same table. The way in which results are presented is the same as the one that was employed in previous research, in which the performance of the AAA algorithm was analysed [9]. Each table also contains the average values of the five replications. Table 2 contains the RMSE values of the MICE, AAA and HAAA algorithms when applied to a database with 10% of the data missing. As can be observed in this table, for the variables of voltage, intensity and power factor employed in this research, the RMSE values obtained by the new algorithm are considerably lower than those obtained by using the AAA and MICE methods. In the case of 10% missing data, Table 2, the variable in which the RMSE is reduced to a lesser amount receives a 15% reduction, while the average reduction of all variables is 62%. For the case of 15% missing data, Table 3, the results are very similar, obtaining at least a reduction of the RMSE of 12% and an average reduction of 46%. Additionally, for the case of 20% missing data, Table 4, the results are equivalent, with a minimum 18% reduction of the RMSE value and an average of 48%.  The results obtained when the MAE metric is applied to the three algorithms are equivalent. Table 5 shows the results obtained using the MAE metric for 10% missing data, while Table 6 does the same for 15% and Table 7 for 20%. When the algorithm proposed is compared with AAA in the case of 10% missing data, the average of improvement regarding the MAE metric is 35%, with a minimum value of 10%. For the case of 15% missing data, the average improvement of the MAE is 29%, with a minimum of an 8% improvement in one of the variables. When the amount of missing data is 20%, the average improvement of the referenced metric is 42%, with a minimum amount of 13%.   Although the overall performance of the new algorithm has already been evaluated using MCAR data, from the point of view of the authors, there are a couple of situations in which the information is not missing completely at random and are of great interest for electrical measurements. These are as follows:

•
The case in which there is correlation in the missingness of data: one possible situation when working with electrical data would be when all the missing information corresponds to the same phase. In order to simulate this kind of failure, five new data sets with a 20% of missing data were created. Each phase is represented by means of four different variables: one variable of phase current, two variables of voltage from phase to phase and one variable of voltage from phase to neutral. It means that each row with missing incomplete information has four missing variables or, in other words, that only 5% of the total of rows will have missing data. In the referred rows, randomly selected, the information of the variables of one of the phases was removed. It means that, for example, when information for variable V an is missing, it is also missing the information of variables, V ab , V ca and I a . The results obtained are presented in Tables 9 and 10. As it can be observed, the performance of the HAAA algorithm is worse than in the MCAR case, but it outperforms both MICE and AAA.

•
The case in which most of the missing data correspond to a certain subset of variables. In order to simulate this kind of failure, five new datasets with a 90% of missing data in a single variable were created. In each dataset, a proportion of 90% elements in one single column were removed, leaving the rest of the variables with their original values. As it can be seen in Table 8, the imputation accuracy for all the algorithms decreased significantly. This was expected in such an unfavourable situation; however, it is possible to ascertain, as both algorithms HAAA and AAA considerably outperform the algorithm of reference MICE, HAAA being the one with the best results.

Conclusions
The improvement of power quality has become a necessity as the presence of power electronics in today's grids has been increasing in the last decades. Due to this problem, network monitoring with the help of real-time data collection devices is helpful. In this context, the availability of missing data imputation techniques is required.
This research presents a new algorithm and compares it with another algorithm proposed in a previous paper by the authors and also with a well-known missing data imputation algorithm. Although the algorithm presented in this paper outperforms the others, as the previous methods to which it is compared, it also has some limitations that must be taken into account. As those proposed before, our algorithm would have imputation problems in those cases in which most of the missing data belonged to the same variable or were concentrated in a certain subset of variables instead of distributed among all the variables of the data set. Currently, the authors continue to develop hybrid algorithms that would improve the results of existing algorithms when they have to address this type of issue. Finally, the missing data imputation in the time-frequency domain will also be explored in future works.