Next Article in Journal
Disposable Non-Enzymatic Glucose Sensors Using Screen-Printed Nickel/Carbon Composites on Indium Tin Oxide Electrodes
Previous Article in Journal
Compressive Sensing Based Design of Sparse Tripole Arrays
Article

A New Missing Data Imputation Algorithm Applied to Electrical Data Loggers

1
Maintenance Department, University of Oviedo, San Francisco 3, Oviedo 33007, Spain
2
Department of Construction and Manufacturing Engineering, University of Oviedo, Campus de Viesques, Gijón 33204, Spain
3
Departamento de Ingeniería Industrial, University of A Coruña, A Coruña 15405, Spain
4
Prospecting and Exploitation of Mines Department, University of Oviedo, Oviedo 33004, Spain
*
Author to whom correspondence should be addressed.
Academic Editor: Vittorio M.N. Passaro
Sensors 2015, 15(12), 31069-31082; https://doi.org/10.3390/s151229842
Received: 7 October 2015 / Revised: 7 December 2015 / Accepted: 7 December 2015 / Published: 10 December 2015
(This article belongs to the Section Physical Sensors)

Abstract

Nowadays, data collection is a key process in the study of electrical power networks when searching for harmonics and a lack of balance among phases. In this context, the lack of data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, and current in each phase and power factor) adversely affects any time series study performed. When this occurs, a data imputation process must be accomplished in order to substitute the data that is missing for estimated values. This paper presents a novel missing data imputation method based on multivariate adaptive regression splines (MARS) and compares it with the well-known technique called multivariate imputation by chained equations (MICE). The results obtained demonstrate how the proposed method outperforms the MICE algorithm.
Keywords: missing data imputation; multivariate imputation by chained equations (MICE); Multivariate adaptive regression splines (MARS); quality of electric supply; voltage; current; power factor missing data imputation; multivariate imputation by chained equations (MICE); Multivariate adaptive regression splines (MARS); quality of electric supply; voltage; current; power factor

1. Introduction

The presence of harmonics in an electrical system is associated with many problems in its performance. The main problems are overheating in conductors, especially in the neutral ones, due to the skin effect, and activating automatic breakers producing problems with supply continuity. Finally, the deterioration of the waveform of the voltage harmonic distortion associated would cause malfunctions of some devices.
As the existence of harmonics cannot be avoided, monitoring in real-time is necessary in order to control them within certain limits. Additionally, sometimes they can be transferred by acting on the installation in order to avoid its effects by means of filters either active or passive. In these cases, the use of isolation transformers, super-immunized differential breakers, etc., must be studied.
Another problem frequently encountered in an electrical installation is the imbalance between phases. Although it is well known that balance is achieved by working at the highest levels of the installed capacity in order to take full advantage of the installation, sometimes this is not possible. An imbalance is usually caused by a bad load distribution between phases and provokes a high current return displayed by the neutral, as it has to compensate for the gap being at the center of the scheme vectors. These problems will increase if these charges are also producing linear and harmonic distortion. In addition, imbalances may also cause the performance of the protection of the low voltage at the output of the transformer arise above its caliber in the overloaded phase currents.
In this context, the quality of electricity is a problem represented in all of its parameters: voltage, current, frequency anomalies, etc., that cause failures or disability of electrical or electronic devices [1]. Nowadays, the quality of electricity is a challenge in terms of efficiency, optimization, stability, fault prevention, and so on [2]. Science and technology have advanced, and continue to do so, significantly, with the aim of mitigating some of the consequent typical problems, which disturb electrical quality and, thus, the above mentioned challenges would be impossible to overcome, at least satisfactorily [3].
There are several different contributions with the above-mentioned aim. For example, in [4] a new power quality deviation index based on principal curves is proposed. [5] shows a complete review of signal processing and intelligent methods used for self-classification of power quality events and an influence of noise on recognition and classification of perturbations. A smart instrument used for recognition, labeling, and quantitation of power and energy quality disturbance is described in [6]. In the same way, [7] presents an intelligent instrument for instantaneous high-resolution frequency measurement in accordance with typical indicator values for the quality of electrical power control and monitoring, while [8] describes a communication infrastructure developed to obtain reliable data delivery with low cost, in order to avoid the problems in the provision of the power quality monitoring service.
In some buildings it would be of interest to monitor the main electrical parameters. This real-time monitoring and control is required in order to balance new loads and reduce the general consumption of the building by means of the assessment of the residual consumption (or consumption out of working hours). Such information is also useful to optimize the rates to be contracted. Additionally, this monitoring would be useful for studying supply problems due to lack of balance or harmonics, for analyzing the quality of the energy, and also for preventing incidents with the machinery due to poor signal quality. Finally, it would also be of interest to study the operation of the building and analyze its efficiency depending on parameters such us the number of people who use it, power installed, square meters in use, etc.
During the data collection process it is possible, due to different circumstances, for a small amount of the information retrieved to be lost. For these situations it is important to have missing data imputation. A process of missing data imputation consist of filling missing values in data series with estimated ones.
The quality of the electric supply of buildings is not only limited to the continuity of the supply, as concepts such as reliability, safety, and maintenance are also important indicators. It is also necessary that the available information be complete. The lack of information in some records, generally translated as zeros, distort the results.
There is also an important economic component of the data record, and that is the optimization of supply contracts, in other words, knowing the consumption of a building distributed over time. It is possible to associate the activities in the building so that we obtain a balanced installation, performing the most demanding activities at the most convenient hours of the day. To this end, it is necessary to collect information both before the decision and after the implementation of measures, in order to compare similar periods of expenditure. The latter would also serve in the event that energy-saving measures of another kind, such as replacing lighting by low consumption, placing detectors in corridors, placing inverters in circulation pumps, etc., were implemented.
This paper evaluates a new imputation method, which allows the system to fill in the missing data of any of the sensor devices that are used in this research for the recording of voltages, currents and power factors. The proposed algorithm is based on multivariate adaptive regression splines and outperforms the results obtained by a benchmark method, as it is the multivariate imputation by chained equations (MICE) [9].
Nowadays, the two major methods for missing data imputation are multiple imputation and maximum likelihood. The maximum likelihood chooses as parameter estimates those values which, if true, would maximize the probability that have in fact been observed. The multiple imputation is based on different methodologies but all follow these steps: some random variation is introduced into the data set and several imputed data sets are generated. After that, those data sets are used for problem analysis and finally the combination of the results into a single set of parameter estimates, standard errors and test statistics is made. Since the missing at random (MAR) assumption cannot be checked from the data at hand, it is important to take into account if missing data can be considered as MAR. In those cases that cannot be considered, they called not missing at random (NMAR). Several models of NMAR data have been developed and its detailed analysis is beyond the scope of the present research.
The rest of the paper is organized as follows: Section 2 includes information about the measurement equipment employed and a description of the data recorded. Section 3 describes the new proposed algorithm and the benchmark technique employed detailing also the two metrics used for comparing their performance. A comparison of the results achieved with each method for different levels of missing data is presented in Section 4. Finally, conclusions are drawn in Section 5.

2. Experimental Section

2.1. Measurement Equipment

In the present research, the devices employed are specific to the measurement of power quality variables, which are described in this section. They have some common measurement features in common, namely: Voltage Line/Neutral (V. L/N), Voltage Line/Line (V. L/L), Current by line (Current), Power Input/Output (+/− Watts), Energy Input/Output (+/− Wh), Reactive Power (+/− VARs), Reactive Power Input/Output (+/− VARh), Apparent Power (VA), Apparent Energy (VAh), Power Factor (PF), and Frequency (Frequency). Table 1 shows the accuracy for each device during the different electrical measurements. It should be noted that the values shown in percentage corresponds to the reading percentage.
Table 1. The variables accuracy for each device.
Table 1. The variables accuracy for each device.
VariableUnitsS-100S200NEXUS 1252MP200
200 ms1 s
V. L/NV, KV0.1%0.1%0.1%0.05%0.3%
V. L/LV, KV0.1%0.2%0.1%0.05%0.5%
CurrentA, KA0.1%0.1%0.1%0.025%0.3%
+/− wattsW0.2%0.2%0.1%0.06%0.5%
+/− whWh0.2%0.2%N/A0.04%0.5%
+/− VARsVARs0.2%0.2%0.1%0.08%1.0%
+/− VARhVARh0.2%0.2%N/A0.08%1.0%
VAVA0.2%0.2%0.1%0.1%1.0%
VAhVAh0.2%0.2%N/A0.08%1.0%
FP+/−0.5 to 10.2%0.2%0.1%0.08%1.0%
FrequencyHertz1.10-2+/−3.10-23.10-21.10-2+/−1.10-2
The four devices used in the present study can perform all the mentioned measurements [10]; also, each device has additional capabilities that are discussed below in the following subsections.

2.2. Shark 100 (S-100)

One of the options included for this equipment is the optical IrDA port, which allows the programing of the device from a laptop or personal digital assistant (PDA). Additionally, it incorporates V-Switch technology. This tool lets the users update and include the required functions using programing commands, even after the installation of the device installation. The offered VSwitches (VSw) offered are:
• VSw 1—Volts and Amperes Meter—Default.
• VSw 2—Volts, Amperes, kW, kVA, kVAR, Frequency, PF.
• VSw 3—Volts, Amperes, kW, kVA, kVAR, Frequency, PF, kVAh, kVARh, kWh and Distributed Network Protocol (DNP) v.3.0.
• VSw 4—Volts, Amperes, kW, kVA, kVAR, Frequency, PF, kVAh, kVARh, kWh, %THD (total harmonic distortion), Boundary Alerts and Distributed Network Protocol (DNP) v.3.0.
A RS485 Port can be added as an option. With it, communication is feasible by using Modbus or DNP 3.0 Protocols. In addition to the RS485, the device also incorporates a KYZ pulse, which is used to send instantaneous information regarding energy consumption to other devices. It is possible to add an Ethernet option with the INP10 module, which is a 10/100BaseT Ethernet with the Modbus TCP protocol.

2.3. Shark 200 (S-200)

The Shark 200 system is a small-size device used for power and energy measurements. It provides an invoicing measuring feature, in conjunction with an advanced data recording system, measurement of the electrical power quality, communication, and I/O capabilities. This equipment also includes V-Switch technology. The V-Switches in this case incorporate the features shown in the Table 2.
Table 2. Features of the V-Switches technology.
Table 2. Features of the V-Switches technology.
FeatureVs1Vs2Vs3Vs4Vs5Vs6
Input/Output Expansion and Multifunction Measurement
2 MB (Megabytes) datalogging (dl)
3 MB -dl
4 MB -dl
Harmonic Study
TLC (transformers line compensation) and CT (Current transformers) / PT (Power Current) Compensation
Functions for Control and Limits Configuration
64 SPC (samples per cycle) Waves Datalogger
512 SPC Waves Datalogger
The Shark 200 device from feature V2 to V6, offers the possibility of data recording by using historic tendencies, limit alerts, input/output deviations, and events categorization. For the V5 and V6 models, the waveform can be recorded.
It is possible to make an independent CBEMA (Computer and Business Equipment Manufacturers Association’s) log plotter: The system records an independent CBEMA and it makes an autonomous CBEMA record for size, as well as potential event times.
The S-200 model offers an on-line harmonic analysis from to the 40th up to the 255th order for current and voltage inputs.
Regarding communication, this model includes the following features:
One port RS485 port allows communication using Distributed Network Protocol (DNP) v.3.0 or Modbus protocols.
KYZ Pulse—this device incorporates Pulse Outputs mapped to total energy.
Furthermore, it has an optical IrDA port with the same functions as the previously-explained model.

2.4. Nexus 1252

In general terms, this device has advanced features that offer a global view of power and energy usage and, of course, visualization of the quality of electrical power within a power network. The device is able to capture a maximum of 512 samples per working cycle by event. Additionally, this device performs events analysis by 16 bits A/D converter, for electric voltage and electric current, which offers high-resolution. Furthermore, it is possible to activate a waveform datalogging by triggers that enable power quality surveys, fault detection, and the like, to be performed.
In terms of harmonic measurements, the device is capable of measuring up to the 255th order, in the case of current and voltage. If necessary, it can measure the harmonics in real time up to the 128th order. The device provides the THD percentage and the K-Factor with the harmonics. Additionally, it is possible to monitor switching noise from several elements of an installation. Like the previous device, the Nexus 1252 is able to make an independent CBEMA, and it makes an autonomous CBEMA log for size and time of potential events, which gives the consequent advantages mentioned previously.
In terms of communications, the device has four ports, and each one is able to communicate in several common protocols, with the aim of reading purposes and control simultaneously. Several peripherals are available for displaying or for external I/O options.

2.5. Shark MP200

The MP200 model measures and provides information of power usage from eight three-phase WYE circuits or from twenty-four single-phase systems. The MP200 system can create precise reports of power usage, analyze peak demand, and provide control signals to limit peak demand and billing based on usage and demand.
The MP200 offers communication possibilities like the previous models. One typical USB port and two standard RS485 ports, with optional RJ45 wired, or 802.11 WiFi, are provided. These ports support standard protocols as Modbus ASCII, RTU, and TCP/IP. By V-Switch options, the MP200 can be configured for basic sensors with real-time data (V1) to Advanced Logger up to 2400 Days (V3).

2.6. Description of the Data

The data set employed for the present research corresponds to measurements of the voltage phase to neutrum, (three variables) phase-to-phase voltage (three variables), current in each phase (three variables) and the average power factor (one variable) of a three-phase electrical supply of a building. The records were taken each 15 min from 27 November 2014 at 18:45 to 31 May 2015 at 23:45.
The building under study in the present research belongs to the University of Oviedo (Spain). This building is called Severo Ochoa after the Nobel Prize-winning scientist and has five floors and two basement levels that sum a total of 8150 m2. This building holds the Information Technology Services of the University including their server rooms and some scientific laboratories that include equipment such as nuclear magnetic resonance spectrometers, electron microscopes, X-ray diffractometers, and the like. For all these kind of facilities it is essential to guarantee a good quality standard of electrical supply 24 h a day, every day of the week. A total of 78 employees work in this building, which has an average daily energy consumption of 190,572 kWh.

2.7. Harmonics and Harmonic Distortion

The large number of heterogeneous receivers in the building, such as computers, uninterruptible power supply devices, ballasts of fluorescent lighting systems, variable speed drives, induction ovens, and capacitors all create harmonic distortions in the net. All of these non-linear loads cause the flow of harmonic currents in the distribution system.
According to Fourier’s theorem, a periodic continuous function f ( x ) with a period of 2 · L may be expressed as the sum of a series of sine or cosine terms each of which has a specific amplitude and phase coefficients known as Fourier coefficients. This theorem can be expressed with the following formula [11]:
f ( x ) = 1 2 · a 0 + n = 1 [ a n · cos ( n · π · x L ) + b n · sin ( n · π · x L ) ]
where:
a n = 1 L L L f ( x ) · cos ( n · π · x L ) · d x
b n = 1 L L L f ( x ) · sin ( n · π · x L ) · d x
Harmonic frequencies are multiples of the waveform’s fundamental frequency. The harmonic distortion may be defined as the degree to which a waveform deviates from a pure sinusoidal wave. In the case of an ideal sine wave, its harmonic component is equal to zero. The total harmonic distortion (THD) is defined as the sum of all harmonic components of the voltage or current waveform compared to the fundamental component of the voltage or current wave. For the case of the current, the THD formula can be expressed as follows:
T H D = i = 1 I i 2 I 1
where I i represents the amplitude of the different harmonics.

3. Methodology

The data set is made up of a total of 17,763 samples that correspond to the period of time referred in the description of the data. It is used to test two different algorithms: multiple imputation by chained equations (MICE) and the proposed algorithm AAA (Adaptive Assignation Algorithm). The dataset is submitted to a process of random data deletion. This process consisted of supposing that the probability of an observation being missing does not depend on observed or unobserved measurements. It is called missing-completely-at-random (MCAR). The process of random data deletion was repeated five times for three different levels of missing data: 10%, 15%, and 20% of the total. After each deletion process, both algorithms were applied to the resulting data subset and the performance of the two methods compared.

3.1. Multivariate Adaptive Regression Splines (MARS)

The algorithm proposed in the present research is based on the computation of multivariate adaptive regression splines (MARS) models, for the prediction of the missing values. MARS is a multivariate nonparametric technique [12]. Its main purpose is to predict the values of a continuous dependent variable, y   ( n × 1 ) , from a set of independent exploratory variables, X   ( n × p ) . This model can be represented by the following Equation [13,14]:
y = f ( X ) + e
where f is a balanced sum of basis functions that depend on X and e is the error vector. One of the main advantages of MARS models is that they do not require any a priori assumptions about the functional relationships between dependent and independent variables [15,16,17]. The reason is that this relation is driven by the basis function determined by the regression data ( X , y ) .
MARS is a generalization of classification and regression trees [18] and is able to overcome some of the limitation of this method. The MARS regression model is constructed by means of basis functions called splines. These splines are defined as follows:
[ ( x t ) ] + q = { ( t x ) q                 i f   x < t 0                                             o t h e r w i s e
[ ( x t ) ] + q = { ( t x ) q                 i f   x < t 0                                             o t h e r w i s e

3.2. The Proposed Algorithm AAA

In order to introduce the new algorithm, let us assume that we have a dataset formed by n different variables v 1 ,   v 2 ,   ,   v n . In order to calculate the missing values of the i-th column, all the rows with no missing value in the said column are employed. Then, a certain number of MARS models are calculated. It is possible to find rows with very different amounts of missing data from zero (no missing data) to n (all values are missing). Those columns with all values missing will be removed and will be neither used for the model calculation nor imputed. Therefore, any amount of missing data from 0 to n 2 is feasible (all variables but one with missing values).
In other words, if the dataset is formed by variables v 1 ,   v 2 ,   ,   v n . and we want to estimate the missing values in column v i , then the maximum number of different MARS models that would be computed for this variable (and in general for each column) is as follows: k = 1 n 1 ( n 1 k ) . For the case of the data under study in this research, with 10 different variables, a maximum of 5110 distinct MARS models would be trained (511 for each variable).
Table 3 represents the 25 first rows of the dataset in which the algorithm will be applied. When the algorithm is applied to the third column of these datasets (variable v 3 ), all those rows with missing data (represented by means of the symbol ‘o’) in the third column are not employed for the calculus of the models (rows in red). If those rows were removed, different models would be trained for the prediction of v 3 using different subsets of variables. Continuing with the example of variable v 3 and taking into account the data missing in the 25 first rows, it would be possible to train the following models:
Model 1: a model that uses as output variable v 3 and the other nine as input variables ( v 1 ,   v 2 ,   v 4 ,   v 5 ,   v 6 ,   v 7 ,   v 8 ,   v 9 ,   v 10 ).
Model 2: a model that uses as output variable v 3 and as input variables v 2 ,   v 4 ,   v 5 ,   v 6 ,   v 7 ,   v 8 ,   v 9 ,   v 10 .
Model 3: a model that uses as output variable v 3 and as input variables v 1 ,     v 4 ,   v 5 ,   v 6 ,   v 7 ,   v 8 ,   v 9 ,   v 10 .
Model 4: a model that uses as output variable v 3 and as input variables v 1 ,   v 2 ,   v 4 ,   v 6 ,   v 7 ,   v 8 ,   v 9 ,   v 10 .
Model 5: a model that uses as output variable v 3 and as input variables v 1 ,   v 2 ,   v 4 ,   v 5 ,   v 7 ,   v 8 ,   v 9 ,   v 10 .
Model 6: a model that uses as output variable v 3 and as input variables v 4 ,   v 5 ,   v 6 ,     v 7 ,   v 8 ,   v 9 ,   v 10 .
Model 7: a model that uses as output variable v 3 and as input variables v 1 ,   v 5 ,   v 6 ,   v 7 ,   v 8 ,   v 9 ,   v 10 .
Model 8: a model that uses as output variable v 3 and as input variables v 1 ,   v 4 ,   v 5 ,   v 6 ,   v 8 ,   v 9 ,   v 10 .
After the calculation of all the available models, the missing data of each row will be calculated using those models that employ all the available non-missing variables of the row. In those cases in which no model was calculated, the missing data will be replaced by the median of the column. Please note in that the case of large data sets with a not-too-high percentage of missing data, these will be an infrequent case. In the case of missing completely at random data, the probability, represented by letter Q, of not having at least two non-missing values in a certain row can be expressed by the following formula:
Q = p n + ( 1 p ) · ( n 1 ) · p n 1
where: N is the number of variables; P is the rate of missing data in a MCAR case.
In the case of our example, none of the rows was in this situation for the 10 and 15% of missing data, while in the case of 20% of missing data it happened only in one line (less than 0.006% of the total amount of lines). These results are in line with those expected by the formula.
As a general rule for the algorithm, it has been decided that when certain value can be estimated using more than one MARS model, it must be estimated using the MARS model with the largest number of input variables; the value would be estimated by any of those models chosen at random. Finally, in those exceptional cases in which no model is available for estimation, the median value of the variable will be used for the imputation.
Table 3. Example of the dataset (25 first rows).
Table 3. Example of the dataset (25 first rows).
Row #v1v2v3v4v5v6v7v8v9v10Model 1Model 2Model 3Model 4Model 5Model 6Model 7Model 8
1XXXXXXXXXXYesyesyesyesyesyesyesyes
2XoXoXXXXXXNonoyesnonoyesyesno
3XXXXoXXXXXNononoyesnononono
4XXooXXXXXXNononononononono
5XXXXXXXXXXYesyesyesyesyesyesyesyes
6XoXXXXXXXXNonoyesnonoyesyesyes
7oXXXXXXXXXNoyesnononoyesnono
8XXXXXXXXXXYesyesyesyesyesyesyesyes
9oooXXXXXXXNononononononono
10XXoXXXXXXXNononononononono
11XXoXXXXXXXNononononononono
12XooXXXXXXXNononononononono
13XXXXXXXXXXYesyesyesyesyesyesyesyes
14ooXXXXXXXXNononononoyesnono
15oXXXXXXXXXNoyesnononoyesnono
16XXXXXXXXXXYesyesyesyesyesyesyesyes
17ooooooXXXXNononononononono
18XXXXXXXXXXYesyesyesyesyesyesyesyes
19XoXXXXoXXXNonoyesnonononoyes
20XXXXXoXXXXNonononoyesnonono
21XooXoXXXXXNononononononono
22XXXXXXXXXXYesyesyesyesyesyesyesyes
23XXooXXXXXXNononononononono
24XXXXXoXXXXNonononoyesnonono
25XXoXXXXXoXNononononononono

3.3. The Benchmark Rechnique: The MICE Algorithm

The algorithm called multiple imputation by chained equations (MICE) algorithm was developed by van Buuren and Groothuis-Oudshoorn [19]. This referred algorithm is a Markov Chain Monte Carlo Method in which the state space is the collection of all imputed values [9]. As with any other Markov Chain, the MICE algorithm has to accomplish three properties [20,21,22,23] in order to converge. The referred properties are as follows:
The chain must be able to reach all parts of the state space. This means that it is irreducible.
The chain should not oscillate between different states. In other words, the Markov Chain must be aperiodic.
Finally, the chain must be recurrent. This means, as in any other Markov Chain, that the probability of the chain of starting from i and returning to i will be equal to one.
According to the experience of the algorithm creator [19], and also from our own previous experience [9], the convergence of the MICE algorithm is achieved after a relatively low number of iterations, usually somewhere between five and 20 [23]. In the case of the present research, up to 20 iterations were considered but as not statistically significant improvements with respect to five iterations were achieved, the results for five iterations are presented.
The MICE algorithm [23] for the imputation of multivariate missing data consists of the steps that are listed in Algorithm 1. In this algorithm Y represents a n × p matrix of partially-observed sample data, R is a n × p matrix, 0 1 response indicators of Y , and represents the parameters space. This methodology was already explained by the authors in previous research published in this journal [9]. For a more detailed explanation of the algorithm we recommend another look at the original research by van Buuren and Groothuis-Oudshoorn [23].
Algorithm 1. MICE algorithm for imputation of multivariate missing data [19].
  • Specify an imputation model P ( Y j m i s | Y j o b s , Y j , R ) for variable Y j with j = 1 , , p .
  • For each j , fill in starting imputations Y j 0 by random draws from Y j o b s .
  • Repeat for t = 1 , , T (iterations).
  • Repeat for j = 1 , , p (variables).
  • Define Y j t = ( Y 1 t , , Y j 1 t , Y j + 1 t 1 , , Y p t 1 ) as the currently complete data except Y j .
  • Draw j t ~ P ( j t | Y j o b s , Y j t , R ) .
  • Draw imputations Y j t ~ P ( Y j m i s | Y j o b s , Y j t , R , j t ) .
  • End repeat j .
  • End repeat t .

3.4. Performance of the Algorithms

The performance of the proposed algorithm in comparison with MICE has been evaluated using the mean absolute error (MAE) and the root mean square error (RMSE). MAE measures the average magnitude of the error in a set of forecasts without considering their direction. It is a linear score, which weights all the individual differences equally, while RMSE is a quadratic scoring rule, which measures the average magnitude of the error. In the case of the RMSE, as errors are squared before they are averaged, it gives a relatively higher weight to large errors. When results are analyzed using both variables, it should be noted that the greater the difference between them, the greater the variance in the individual errors in the sample, taking into account that the lower their values, the better the model.
The formulae for both kind of errors are as follows:
M A E = 1 n · i = 1 n | e i |
R M S E = 1 n · i = 1 n e i 2
where: n is the number of samples; e i is the error of the i-th sample calculated as the difference of predicted value versus real value.
The present article uses both RMSE and MAE. The underlying assumption when presenting RMSE [24] is that the errors are unbiased and follow a normal distribution. The MAE is suitable to describe uniformly distributed errors. As model errors are likely to have a normal distribution, the RMSE is a better metric to present than the MAE for such kind of data. Although in the case of errors following a normal distribution RMSE is more appropriate to use than MAE, it is the preferred metric for the indication of the model average error.

4. Results and Discussion

In this section the results of the MICE algorithm and the proposed one AAA package are presented and their performances compared. As was already stated in the section describing the data, due to the random component of both algorithms, a process of MCAR data deletion of 10%, 15%, and 20% of the information was performed five times. The performance of both algorithms was compared by means of RMSE and MAE metrics. In order to verify that the results obtained with the proposed AAA package for the five different iterations were better than those achieved by other methods, the results of the five iterations are presented. Those tables also contain the average values of the five replications the iterations with the same number use the same database. Table 4 shows the RMSE values of the MICE and the proposed AAA package when applied to a database with 10% of the data missing. As can be observed for the ten variables considered, the RMSE values obtained by the new algorithm are considerably lower than those obtained using the MICE method. On average they are 15 times lower, and in all cases the RMSE values of the proposed algorithm are considerably lower. The results obtained for missing data rates of 15% (Table 5) and 20% (Table 6) are similar to those obtained for the missing rates of 10%.
Table 4. RMSE obtained with a 10% of missing data MICE and new proposed AAA package.
Table 4. RMSE obtained with a 10% of missing data MICE and new proposed AAA package.
RMSE MICE 10% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
119.680433.203820.805837.704239.621248.50600.22940.25011.83280.0031
217.590130.471222.972141.632428.166749.41470.30480.34191.70360.0032
317.723829.271722.871932.035237.952849.51320.26120.33221.81130.0030
416.166530.916420.128941.804028.457747.64960.27100.23441.83490.0033
518.873933.306520.984332.609640.973045.33240.24920.27681.65020.0031
Average18.006931.433921.552637.157135.034348.08320.26310.28711.76660.0032
RMSE NEW ALGORITHM 10% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
11.55561.60470.97581.56212.18201.80540.13200.14460.12330.0020
21.14171.08471.03341.56232.00751.89900.14410.13970.12060.0020
31.01861.00770.83252.67581.66841.85500.13660.14580.11700.0020
41.07501.12471.14101.45691.75981.69580.13490.15580.12780.0017
51.10561.09920.96801.67831.94871.82090.13310.13170.11280.0021
Average1.17931.18420.99011.78711.91331.81520.13610.14350.12030.0020
Table 5. RMSE obtained with a 15% of missing data MICE and new proposed AAA package.
Table 5. RMSE obtained with a 15% of missing data MICE and new proposed AAA package.
RMSE MICE 15% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
117.083728.897523.126938.230226.725945.77810.38160.23521.85780.0031
219.783131.629221.640644.417631.186750.59780.27330.42891.65150.0030
316.888732.157323.256534.870936.440449.87712.00800.37680.41980.0032
418.943230.806521.165543.272932.055843.27230.34580.33261.74070.0028
519.064730.026223.586132.440228.960944.97380.53760.25171.84020.0034
Average18.352730.703322.555138.646331.073946.89980.70920.32511.50200.0031
RMSE NEW ALGORITHM 15% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
11.09161.06250.99371.65621.78511.78740.13550.13140.11840.0021
21.14171.00610.99901.67991.92291.78430.12850.14460.12350.0019
31.11781.13111.08161.71742.35601.71110.13450.14600.12490.0021
41.51091.00431.16891.51171.97861.80570.12500.13940.12620.0018
51.11511.01091.03511.63812.55431.86370.12900.13640.13240.0019
Average1.19541.04301.05561.64062.11941.79040.13050.13960.12500.0020
Something similar to the RMSE occurs with the values obtained for the MAE metric. In this case, also, the values obtained with the new AAA package are significantly lower than those obtained for the MICE algorithm in the three cases: 10% (Table 7), 15% (Table 8), and 20% (Table 9). Please also note that the average improvement of the MAE values in the case of the proposed algorithm is 15 times greater than the MAE values obtained by means of the MICE algorithm.
Table 6. RMSE obtained with a 20% of missing data MICE and new proposed AAA package.
Table 6. RMSE obtained with a 20% of missing data MICE and new proposed AAA package.
RMSE MICE 20% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
116.553631.898622.510940.558432.803645.91280.27210.32501.87910.0032
218.588633.854123.396537.186028.711545.37111.92420.38570.42180.0031
318.000627.025722.417245.945129.749946.90850.46400.46571.84500.0031
418.473433.645522.639034.845542.467448.57270.26750.29511.68940.0032
519.018531.573922.986736.893630.823748.18720.28830.36181.82280.0029
Average18.126931.599622.790039.085732.911246.99050.64320.36671.53160.0031
RMSE NEW ALGORITHM 20% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
11.03031.00311.00811.55671.72952.19890.12930.15350.11680.0018
21.33000.96451.44211.69011.92251.99700.13860.14300.11970.0019
31.40281.07511.02091.54441.69052.12160.13840.14620.12560.0018
41.14100.94420.98361.72621.85541.75540.13070.13910.12790.0019
51.07601.02851.04641.69811.80162.16960.13960.13510.12330.0019
Average1.19601.00311.10031.64311.79992.04850.13530.14340.12260.0019
Table 7. MAE obtained with a 10% of missing data MICE and new proposed AAA package.
Table 7. MAE obtained with a 10% of missing data MICE and new proposed AAA package.
MAE MICE 10% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
115.366126.301516.478729.917830.643738.74310.17690.19101.38840.0026
213.463523.037319.093232.489123.015539.65580.21500.22341.28210.0026
313.981822.836018.253925.137529.544440.27280.20010.21071.36840.0024
412.908424.446916.214733.717022.321436.20610.22360.18921.39930.0026
514.913626.401116.714726.253532.138234.58870.19640.20521.25520.0025
Average14.126624.604517.351029.503027.532637.89330.20240.20391.33870.0025
MAE NEW ALGORITHM 10% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
10.89540.89440.74441.19661.70991.42100.10530.11340.09710.0016
20.90170.85680.83581.22131.55611.51290.11200.10930.09600.0016
30.82360.79700.65331.45661.27411.44500.11230.11400.08810.0016
40.86170.87080.92401.14011.40341.35990.10450.12200.09750.0013
50.91350.88860.79261.33321.59111.46100.10570.10440.09200.0016
Average0.87920.86150.79001.26961.50691.44000.10800.11260.09410.0016
For each of the ten variables involved in the present study, two-way ANOVA tests were performed in order to examine the influence of the kind of algorithm employed for the imputation (MICE versus proposed algorithm), the level of missing data (10%, 15% and 20%) and the interaction of both factors. These studies were carried out for the RMSE and MAE metrics. The influence of the model employed was found in all the variables for both metrics. Neither the percentage of missing data nor its interaction with the model employed for imputation were found to be significant in any of the variables. For the RMSE parameter the p-value was of p < 0.001 for the kind of model (MICE vs proposed algorithm) in the variables V a n , V b n , V c n , V a b , V b c , V c a , I c and P F , for I a was p = 0.044 and for I b p = 0.001 . For the RMSE, when considering the variable percentage of missing data, there were no statistically significant differences between percentages (10, 15 and 20%) and the following p-values were obtained: 0.980 for V a n , 0.885 for V b n , 0.106 for V c n , 0.921 for V a b , 0.591 for V b c , 0.770 for V c a , 0.523 for I a , 0.168 for I b , 0.800 for I c , and 0.784 for P F .
Table 8. MAE obtained with a 15% of missing data MICE and new proposed AAA package.
Table 8. MAE obtained with a 15% of missing data MICE and new proposed AAA package.
MAE MICE 15% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
112.781921.911818.684230.792921.532335.77180.24210.18451.40060.0024
215.683724.379617.132734.655224.397740.50200.19590.24891.30480.0023
312.983825.111018.394428.717628.771938.07551.51480.21570.22400.0025
414.373523.898617.465134.675225.777033.35850.21640.21171.30230.0022
514.916223.671718.203325.813123.063435.01130.26750.18771.42500.0026
Average14.147823.794517.975930.930824.708436.54380.48730.20971.13140.0024
MAE NEW ALGORITHM 15% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
10.88820.81960.78661.32801.44231.38840.10910.10530.09730.0017
20.91520.81580.76461.30881.47491.39370.10350.11170.09860.0016
30.87100.90640.82551.35771.52121.34800.10670.11210.09610.0016
40.99510.77570.91291.20651.56481.43410.10020.11140.09830.0014
50.86250.79920.79591.27261.77871.48670.10290.10430.10210.0015
Average0.90640.82340.81711.29471.55641.41020.10450.10890.09850.0015
Table 9. MAE obtained with a 20% of missing data MICE and new proposed AAA package.
Table 9. MAE obtained with a 20% of missing data MICE and new proposed AAA package.
MAE MICE 20% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
112.625225.200418.456033.259125.832835.24140.20070.21331.44310.0025
215.034926.672518.277928.868122.468534.94931.46750.24650.22720.0024
314.188720.776418.109336.614323.257736.05190.24380.24991.40360.0025
414.148225.918118.743528.144033.764637.74830.18930.20331.24350.0025
514.727025.337618.520328.829424.896738.15300.19710.21051.32530.0023
Average14.144824.781018.421431.143026.044136.42880.45970.22471.12850.0024
MAE NEW ALGORITHM 20% MISSING DATA
IterationVanVbnVcnVabVbcVcaIaIbIcPF
10.78140.81200.80391.23981.37211.45790.10120.11550.09260.0014
20.87450.76640.84751.33721.52381.53220.10960.11050.09500.0015
30.91770.87720.81671.21341.30971.45250.10830.11450.09710.0014
40.86240.74060.77191.36911.45221.38760.10460.10860.10280.0015
50.83500.80220.80321.36661.38941.49250.10960.10620.09670.0015
Average0.85420.79970.80861.30521.40941.46460.10670.11100.09690.0015
In the case of the metric MAE, the p-value was of p < 0.001 for the kind of model in the variables V an , V bn , V cn , V ab , V bc , V ca , I b , I c , and PF ; for I a the p-value was of 0.038. Additionally, for the MAE metric, in the case of the variable percentage of missing data, there are no statistically significant differences between percentages (10%, 15% and 20%) obtaining the following p-values: 0.990 for V an , 0.786 for V bn , 0.113 for V cn , 0.887 for V ab , 0.655 for V bc , 0.643 for V ca , 0.686 for I a , 0.315 for I b , 0.796 for I c , and 0.424 for PF .
Finally, all the calculi of both the MICE and the AAA algorithm was performed with a computer equipped with an Intel Xeon E5-1650 processor and 16 GB RAM. The average time of the MICE algorithm runs was of 123.54 s. The AAA algorithm average completion time was of 74.36 s with a standard deviation of 8.32 s. In both case the dataset was formed by 17,763 samples each on them with 10 variables.

5. Conclusions

The existence of harmonics in electrical installations is an unavoidable issue nowadays. The use of real-time data collection devices is indispensable. During the process collection, it is possible for some data to be missing and, in this context, the use of missing data imputation techniques is essential.
The algorithm proposed in this research greatly improves the results obtained by means of one of the most renowned and common techniques used today. From the point of view of the authors, this new algorithm is of great interest for applications like the one proposed in the present paper. In spite of the good performance of the proposed algorithm, it must be also be taken into account that the proposed algorithm, like many others, would have imputation problems in those cases in which most of the missing data belonged to the same column or to a reduced subset of columns. In future research the use of support vector machines (SVM) [23,25] and hybrid methods [26,27,28] will be explored by the authors in order to find a new algorithm with even higher performance. Furthermore, authors will try to study the nonlinear time varying systems and other power quality features, taken into account proposals like [29,30]. Finally, another research line that will be explored is the missing data imputation in the time-frequency domain. It consists on estimating missing regions of the time-frequency representation of signals [31]. In this kind of researches, the imputation methods also make use of harmonics for the imputation, considering for instance, that in a certain moment there is missing information but that not all the information of all the frequencies is necessary lost at the same time. The algorithms developed would be of interest for any kind of signals.
The estimation of missing data is required in many different applications, such as time series analysis. The use of missing data imputation techniques allows the creation of prediction models using incomplete datasets.

Acknowledgments

Francisco Javier de Cos Juez and Fernando Sánchez Lasheras appreciate support from the Spanish Economics and Competitiveness Ministry, through grant AYA2014-57648-P and the Government of the Principality of Asturias (Consejería de Economía y Empleo), through grant FC-15-GRUPIN14-017. We would also like to thank the linguistic expert Anthony Ashworth for his revision of the English grammar and spelling of the manuscript.

Author Contributions

Francisco Javier de Cos Juez, José Luis Calvo Rolle, Concepción Crespo Turrado and Fernando Sánchez Lasheras conceived the study. Andrés José Piñón Pazos and Francisco Javier de Cos Juez programmed the required algorithms. Fernando Sánchez Lasheras and Francisco Javier de Cos Juez interpreted the results and drafted the manuscript; Concepción Crespo Turrado, Andrés José Piñón Pazos and José Luis Calvo Rolle supervised the experimental data analysis; they also contributed to the critical revision and improvement of the paper. All of the authors have approved the final version of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chattopadhyay, S.; Mitra, M.; Sengupta, S. Electric Power Quality. In Electric Power Quality; Springer: Dordrecht, The Netherlands, 2011; pp. 5–12. [Google Scholar]
  2. Dixit, J.B.; Yadav, A. Electrical Power Quality; University Science Press: New Delhi, India, 2010. [Google Scholar]
  3. Stones, J.; Collinson, A. Power quality. Power Eng. J. 2001, 15, 58–64. [Google Scholar] [CrossRef]
  4. Ferreira, D.D.; de Seixas, J.M.; Cerqueira, A.S.; Duque, C.A.; Bollen, M.H.J.; Ribeiro, P.F. A new power quality deviation index based on principal curves. Electr. Power Syst. Res. 2015. [Google Scholar] [CrossRef]
  5. Mahela, O.P.; Shaik, A.G.; Gupta, N. A critical review of detection and classification of power quality events. Renew. Sustain. Energy Rev. 2015. [Google Scholar] [CrossRef]
  6. Granados-Lieberman, D.; Valtierra-Rodriguez, M.; Morales-Hernandez, L.; Romero-Troncoso, R.; Osornio-Rios, R. A Hilbert Transform-Based Smart Sensor for Detection, Classification, and Quantification of Power Quality Disturbances. Sensors 2013, 13, 5507–5527. [Google Scholar] [CrossRef] [PubMed]
  7. Granados-Lieberman, D.; Romero-Troncoso, R.J.; Cabal-Yepez, E.; Osornio-Rios, R.A.; Franco-Gasca, L.A. A Real-Time Smart Sensor for High-Resolution Frequency Estimation in Power Systems. Sensors 2009, 9, 7412–7429. [Google Scholar] [CrossRef] [PubMed]
  8. Lim, Y.; Kim, H.-M.; Kang, S. A design of wireless sensor networks for a power quality monitoring system. Sensors 2010, 10, 9712–9725. [Google Scholar] [CrossRef] [PubMed]
  9. Turrado, C.; López, M.; Lasheras, F.; Gómez, B.; Rollé, J.; Juez, F. Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions. Sensors 2014, 14, 20382–20389. [Google Scholar] [CrossRef] [PubMed]
  10. www.electroind.com. Available online: http://www.electroind.com/products/ (accessed on 8 December 2015).
  11. Kammler, D.W. A First Course in Fourier Analysis; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  12. Friedman, J.H. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
  13. García Nieto, P.J.; Martínez Torres, J.; de Cos Juez, F.J.; Sánchez Lasheras, F. Using multivariate adaptive regression splines and multilayer perceptron networks to evaluate paper manufactured using Eucalyptus globulus. Appl. Math. Comput. 2012, 219, 755–763. [Google Scholar] [CrossRef]
  14. Guzmán, D.; Juez, F.J.C.; Myers, R.; Guesalaga, A.; Lasheras, F.S. Modeling a MEMS deformable mirror using non-parametric estimation techniques. Opt. Expr. 2010, 18, 21356–21369. [Google Scholar] [CrossRef] [PubMed]
  15. García Nieto, P.J.; Alonso Fernández, J.R.; Sánchez Lasheras, F.; de Cos Juez, F.J.; Díaz Muñiz, C. A new improved study of cyanotoxins presence from experimental cyanobacteria concentrations in the Trasona reservoir (Northern Spain) using the MARS technique. Sci. Total Environ. 2012, 430, 88–92. [Google Scholar] [CrossRef] [PubMed]
  16. De Cos Juez, F.J.; Lasheras, F.S.; García Nieto, P.J.; Suarez, M.A.S. A new data mining methodology applied to the modelling of the influence of diet and lifestyle on the value of bone mineral density in post-menopausal women. Int. J. Comput. Math. 2009, 86, 1878–1887. [Google Scholar] [CrossRef]
  17. Machon-Gonzalez, I.; Lopez-Garcia, H.; Calvo-Rolle, J.L. A hybrid batch SOM-NG algorithm. In Proceedings of the 2010 International Joint Conference on Neural Networks, Barcelona, Spain, 18–23 July 2010.
  18. De Andrés, J.; Lorca, P.; de Cos Juez, F.J.; Sánchez-Lasheras, F. Bankruptcy forecasting: A hybrid approach using Fuzzy c-means clustering and Multivariate Adaptive Regression Splines (MARS). Expert Syst. Appl. 2011, 38, 1866–1875. [Google Scholar] [CrossRef]
  19. Van Buuren, S.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 1, 1–67. [Google Scholar] [CrossRef]
  20. Roberts, G.O. Markov chain concepts related to sampling algorithms. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 45–47. [Google Scholar]
  21. Tierney, L. Introduction to general state-space Markov chain theory. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 59–71. [Google Scholar]
  22. Van Buuren, S. Flexible Imputation of Missing Data; Chapman & Hall/CRC Press: London, UK, 2012. [Google Scholar]
  23. Liu, Y.; Brown, S.D. Comparison of five iterative imputation methods for multivariate classification. Chemom. Intell. Lab. Syst. 2013, 120, 106–115. [Google Scholar] [CrossRef]
  24. Chai, T.; Draxler, R.R. Root mean square error ( RMSE ) or mean absolute error (MAE)? —Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
  25. Álvarez Antón, J.C.; García Nieto, P.J.; de Cos Juez, F.J.; Sánchez Lasheras, F.; González Vega, M.; Roqueñí Gutiérrez, M.N. Battery state-of-charge estimator using the SVM technique. Appl. Math. Model. 2013, 37, 6244–6253. [Google Scholar] [CrossRef]
  26. García Nieto, P.J.; Alonso Fernández, J.R.; de Cos Juez, F.J.; Sánchez Lasheras, F.; Díaz Muñiz, C. Hybrid modelling based on support vector regression with genetic algorithms in forecasting the cyanotoxins presence in the Trasona reservoir (Northern Spain). Environ. Res. 2013, 122, 1–10. [Google Scholar] [CrossRef] [PubMed]
  27. Quintian, H.; Calvo-Rolle, J.L.; Corchado, E. A hybrid regression system based on local models for solar energy prediction. Informatica 2014, 25, 265–282. [Google Scholar] [CrossRef]
  28. Manuel Vilar-Martinez, X.; Montero-Sousa, J.A.; Calvo-Rolle, J.L.; Casteleiro-Roca, J.L. Expert system development to assist on the verification of “TACAN” system performance. Dyna 2014, 89, 112–121. (In Spanish) [Google Scholar]
  29. Viveros, R.A.; Yuz, J.I.; Perez-Ibacache, R.R. Simultaneous State and Parameter Estimation for a Nonlinear Time-Varying System. Rev. Iberoam. Autom. Inform. Ind. 2014, 11, 263–274. [Google Scholar] [CrossRef]
  30. Farias, G.; Dormido-Canto, S.; Vega, J.; Santos, M.; Pastor, I.; Fingerhuth, S.; Ascencio, J. Iterative noise removal from temperature and density profiles in the TJ-II Thomson scattering. Fusion Eng. Des. 2014, 89, 761–765. [Google Scholar] [CrossRef]
  31. Smaragdis, P.; Raj, B.; Shashanka, M. Missing Data Imputation for Time-Frequency Representations of Audio Signals. J. Signal Process. Syst. 2011, 65, 361–370. [Google Scholar] [CrossRef]
Back to TopTop