Next Article in Journal
Poisoning Effect of SO2 on Honeycomb Cordierite-Based Mn–Ce/Al2O3Catalysts for NO Reduction with NH3 at Low Temperature
Previous Article in Journal
Silicon Photonics towards Disaggregation of Resources in Data Centers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generative Adversarial Networks Based Heterogeneous Data Integration and Its Application for Intelligent Power Distribution and Utilization

1
Beijing Key Laboratory of Distribution Transformer Energy-Saving Technology, China Electric Power Research Institute, Beijing 100192, China
2
Department of Engineering Physics, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2018, 8(1), 93; https://doi.org/10.3390/app8010093
Submission received: 31 October 2017 / Revised: 5 January 2018 / Accepted: 7 January 2018 / Published: 11 January 2018

Abstract

:

Featured Application

Authors are encouraged to provide a concise description of the specific application or a potential application of the work. This section is not mandatory.

Abstract

Heterogeneous characteristics of a big data system for intelligent power distribution and utilization have already become more and more prominent, which brings new challenges for the traditional data analysis technologies and restricts the comprehensive management of distribution network assets. In order to solve the problem that heterogeneous data resources of power distribution systems are difficult to be effectively utilized, a novel generative adversarial networks (GANs) based heterogeneous data integration method for intelligent power distribution and utilization is proposed. In the proposed method, GANs theory is introduced to expand the distribution of completed data samples. Then, a so-called peak clustering algorithm is proposed to realize the finite open coverage of the expanded sample space, and repair those incomplete samples to eliminate the heterogeneous characteristics. Finally, in order to realize the integration of the heterogeneous data for intelligent power distribution and utilization, the well-trained discriminator model of GANs is employed to check the restored data samples. The simulation experiments verified the validity and stability of the proposed heterogeneous data integration method, which provides a novel perspective for the further data quality management of power distribution systems.

1. Introduction

With the rapid development of smart grid and sensing technology, China's power user side data showed high complexity and redundancy. Since 2011, the user side data volume of power distribution system in China has been booming, from GB to TB, even to PB level, and gradually forms a big data system. Facing the era of big data, power companies have not only improved traditional MySQL, Oracle, and other relational database systems, but also produced lots of new big data systems, such as HBase, GBase, and etc. All of these database systems mentioned above have already formed a multi-source heterogeneous big data system for intelligent power distribution and utilization (IPDU) [1,2,3,4]. On the other hand, affected by local economic levels, the monitoring and testing conditions in local power companies and manufacturers for distribution network equipment are quite different, while parts of the complex monitoring and testing equipment is unreasonable and impossible to repeat purchase, leading to the further heterogeneity of IPDU big data.
The IPDU big data contains operation data and external scene information of distribution network, and supports the further decision analysis for the planning, construction, operation, maintenance, and other business sectors of power distribution system. Through the deep mining of multi-source data, it would be easy to realize the accurate analysis of current situations and future trends of distribution networks, and operation control of intelligent power distribution network [5,6]. However, the traditional data analysis and decision technologies of distribution networks require that all the data samples share the same monitoring and testing indexes, which makes the IPDU heterogeneous big data difficult to be effectively utilized, resulting in a huge waste of data resources. Because of the structure heterogeneity among the existing database systems and the limitation of data qualities from both objective and subjective factors, the IPDU big data system could not directly meet the requirements of traditional data analysis and decision technologies. At the same time, it also comes into a small sample environment for parts of samples in IPDU big data, and brings out great challenges for the data quality management of distribution networks, which makes it difficult to guarantee the accuracy of analysis and decisions [7,8]. Therefore, the study on heterogeneous data integration technology has a significant influence on the future intelligent power distribution and utilization, which can also greatly improve the accuracy and efficiency of distribution network’s operation and decision-making.
So far, data integration researches in intelligent power distribution and utilization could be broadly divided into two categories, i.e., time-series data integration and index data integration. The research subjects of the former one mainly include load data, distributed power output, wind speed of wind turbine, and etc. These kinds of data integration technologies are relatively mature, and already form a relatively perfect research system, in which grey relational analysis, collaborative filtering, Markov chain, support vector machine and neural network are widely used as data analysis tools [9,10,11,12]. The research subjects of index data integration mainly include the structured and semi-structured data, in addition to time-series data. Many experts and scholars have also carried out some inspiring works in this research field, such as: Liu and et al. introduced low rank and sparsity theory for data integration to detect the false data injection in power grid [13]; Xu and et al. proposed an XLPE power cable lifetime evaluation method by employing low-rank matrix completion technology [14]; Mateos and et al. used robust nonparametric regression via sparsity control to perform data cleaning and repair tasks [15].
Although the techniques mentioned above can meet the requirements of some engineering applications in accuracy, but still not be able to satisfy the efficiency requirement. In order to deal with IPDU big data problems, some experts and scholars turned their research directions to data integration based on machine learning. Yu and et al. proposed an extreme learning machine based missing data completion method [16]. Li and Socher introduced deep learning theory to fulfill the incomplete data restoration and integration tasks, respectively [17,18]. However, these researches do not take the small sample environment of IPDU data into account, so it is difficult to be directly applied in the actual projects, and the integration of heterogeneous data in distribution network is not quite satisfying.
In order to solve the problem that heterogeneous data resources for intelligent power distribution and utilization are difficult to be effectively utilized in the small sample environment, a novel heterogeneous data integration technology that is based on generative adversarial networks (GANs-HDI) is proposed. In this proposed GANs-HDI method, the sample space expansion is realized by employing the generator of Goodfellow and et al.’s GANs [19,20], according to the targeted samples with all of the measurement indexes complete. In order to eliminate the heterogeneous characteristics, a so-called peak clustering algorithm is proposed to realize the finite open coverage of the expanded sample space, and repair those incomplete samples. Finally, the repaired samples are checked by using well-trained discriminator of GANs. By doing this, GANs learning together with clustering theory form a closed loop to improve heterogeneous data integration performance greatly. This proposed heterogeneous data integration method is helpful to realize the efficient integration of heterogeneous data, and also provides a novel perspective for the further data quality management in power companies.

2. Generative Adversarial Networks Based Sample Space Expansion

Facing the big data for intelligent power distribution and utilization (IPDU), power companies have not only improved traditional MySQL, Oracle, and other relational database systems, but also produced lots of new big data systems, such like HBase, GBase, and etc. All of these database systems mentioned above provide excessive multi-source heterogeneous samples, as shown in Figure 1. These targeted samples together form a real space, in which, according to Heine-Borel theorem, a limited number of open intervals could be chosen to form a finite open coverage of this targeted sample set. In each open interval, the samples shall hold the same data characteristics, and be able to support other samples with missing indexes. However, in some small sample environments, the samples are not always enough for data completion and integration tasks in all of the open intervals. Therefore, in order to obtain satisfying data integration results, this paper introduces generative adversarial theory to enrich the sample space.
Generative adversarial networks (GANs) is a generative model derived from Nash zero-sum game, in which the generator model and discriminator model are invited to participate. The generator model is designed to learn the distribution of training data, while the discriminator is designed to estimate the probability that the targeted data sample comes from training data rather than the generator. Both of these two models could improve their performances in mutual confrontation and iterative optimization, extend the targeted sample set, improve the discrimination ability, and approach the Nash equilibrium eventually [19]. As one of the most exciting ideas in the research field of machine learning over the last decade, the theory of GANs has been widely used in image and graphic processing, natural language processing, computer virus monitoring, chess game programming, and etc.
Inspired by Goodfellow and Springenberg’s works [20,21,22], GANs theory is employed to realize the expansion of targeted sample space in this paper. First of all, a targeted data set D = { d i } i = 1 N with all the measurement indexes is constructed, where N stands for the sample number of data set. GANs algorithm is used to train generator G and discriminator D in TensorFlow platform. Taking D = { d i } i = 1 N as inputs and zeros as outputs, the discriminator D could be initialized in TensorFlow as the following equation:
D ( d i ) = j = 1 L i = 1 N β j g ( ω i T d i + b j )
where, L is the number of hidden neural nodes, ω i R K is the input weights of i-th hidden neural node, and β j R and b j R represent the output weights and threshold values of j-th hidden neural node respectively, g ( ) : R R stands for the activation function in neural networks.
Furthermore, train the generator G and discriminator D simultaneously: adjusting parameters for G to minimize log ( 1 D ( d ) ) and for D to minimize log D ( d ) , as if they are following the two-player min-max game with value function v ( G , D ) [21]:
G = j = 1 L i = 1 N β j f ( ( ϖ i , ) + b j ) = arg min G max D v ( G , D )
s . t .   v ( θ ( G ) , θ ( D ) ) = Ε d D ¯ log D ( d ) + Ε d D ˜ log ( 1 D ( d ) )
where, D ˜ = { d i } i = 1 M stands for the data set consisted of new generated samples from generator G ; L is the number of hidden neural nodes; ϖ i R P is the input weights of i-th hidden neural node, β j R K and b j R represent the output weights and threshold values of j-th hidden neural node. respectively, and f ( ) : R R stands for the activation function in neural networks. By using the well-trained generator G , M new data samples could be generated with random vector set Z = { z i R P } i = 1 M as the inputs. Take D = { d i } i = 1 N and D ˜ = { d i } i = 1 M as inputs and zeros and ones as outputs respectively, train and renew the discriminator D .
Finally, determine whether the probability of newly generated samples falls within the interval [ 0.5 c , 0.5 + c ] by using discriminator D . If this condition is satisfied, then it demonstrates that generator G performs well in convergence. Combine the new generated sample set D ˜ and original data set D , and denote the combination as D G A N s = { d i } i = 1 N + M for the future data restoration. Otherwise, the discriminative error of D is back propagated to retraining of generator G . More obviously, the calculation processing of GANs could be shown as Figure 2.

3. Peak Clustering Based Data Restoration

Based on the proliferation result of data samples by introducing generator discussed above, a so-called peak clustering based incomplete data restoration method is proposed in this section. In order to overcome the restoration failures of traditional algorithms with linear inseparable data, the proposed method constructs as few as possible open intervals with a fixed neighborhood radius for all of the data samples. Then, the set of open intervals form a finite open coverage, avoiding the interference of linear inseparable data samples on clustering results, as shown in Figure 3.
Inspired by Rodriguez’s work in Reference [23], peak clustering algorithm is proposed for incomplete data restoration to improve the calculation efficiency, while sustaining the restoration precision. Supposing the finite open coverage C o v e r a g e i ( d ) contains n i data samples { d j } j = 1 n i , calculate the peak distance of density peaks (distance between data sample and density peak point) D i s t ( d j ,   T e m p _ P e a k s i ( d ) ) . Then, n i clusters are constructed according to the phase angle with the density peak point as the center, and each cluster contains only one sample [24]. If the absolute value of the peak distance difference of the cluster with similar phase angle is smaller than or equal to the threshold value, the two classes are merged, and the distance between the peak point of the new class and the peak to peak value of the density is calculated.
If the absolute value of the peak distance difference of the clusters with the similar phase angle is no larger than the threshold value k , combine the two clusters and calculate new density peak points and peak distances. Repeat the operations above, until the absolute value of the peak distance difference of the clusters with the similar phase angle becomes larger than the threshold value k , or the total cluster number becomes 1. Then, end the iteration and output the clustering result of the last iteration.
Finally, after the peak clustering of targeted data samples, the weighted averages of corresponding values in complete samples could be used as the predictive values of missing data. The concept of entropy in information theory is introduced, and the weighting coefficient is determined by the similarity between data samples. Generally speaking, the process of peak clustering based data restoration could be shown in the following Table 1.

4. Realization of GANs Based Heterogeneous Data Integration

In order to solve the problem that the IPDU heterogeneous data resources are difficult to be effectively utilized in the small sample environment, a novel generative adversarial networks based heterogeneous data integration technology (GANs-HDI) is proposed in this paper. In the GANs-HDI method, the sample space is expanded by introducing GANs, according to the targeted samples with all of the measurement indexes complete. According to all of the complete and fixed samples, peak clustering and information entropy are employed to restore the incomplete ones. Based on the new sample set expanded by the generative model of GANs, this method constructs a peak clustering model to realize the finite open coverage of the restored sample space, and repair those incomplete samples with entropy function. Finally, all of the repaired samples would be checked by using well-trained discriminator of GANs to guarantee the heterogeneous data integration performances. Generally speaking, the process of GANs based heterogeneous data integration could be presented, as shown in Figure 4.
In this paper, select all the data samples with all measurement indexes complete from heterogeneous database D = { d i } i = 1 N , and denote them as a dataset D ¯ = { d i } i = 1 N ¯ for the further generator G and discriminator D ’s training in GANs. Then, taking D ¯ = { d i } i = 1 N ¯ and D ˜ = { d i } i = 1 M generated by generator G as inputs and zeros and ones as outputs respectively, train and renew the discriminator D . Finally, if the data samples generated by G could meet iterative termination condition, combine the new generated sample set D ˜ and original dataset D ¯ , and denote the combination as D G A N s = D ¯ D ˜ = { d i } i = 1 N ¯ + M , perform the peak clustering based data restoration task; otherwise, the discriminant error of discriminator D would be back propagated to re-train the generator G , as shown in Table 2.

5. Simulation Experiments and Result Analysis

In this section, the simulation experiments are divided into two parts, i.e., data restoration on University of California Irvine (UCI) standard datasets and heterogeneous data integration on intelligent power distribution and utilization datasets. The former one is performed to verify the validity and stability of our proposed GANs-HDI algorithm, while the latter one is performed to test the actual effect of our proposed GANs-HDI algorithm for intelligent power distribution and utilization heterogeneous data in TensorFlow platform. All of the following simulation experiments were performed in Matlab 2012a and JetBrains PyCharm 2017.2 environment with Core-TM [email protected] and NVIDIA GeForce 840M processor, respectively.

5.1. Simulation Experiments on UCI Standard Datasets

The simulation experiment introduced three UCI standard datasets, i.e., ’Abalone’, ‘Heart Disease’, and ‘Bank Marketing’, for performance comparison of data restoration in the Matlab 2012a environment. In this simulation experiment, the incomplete sample proportion in the total samples was set as 20%, and the information loss rate was 25%. Incomplete data sample and missing indexes were randomly selected. The detailed information of the three UCI standard datasets is as shown in Table 3. Taking ‘Abalone’ dataset as an example, 60 samples were randomly selected from a total of 4177 data samples as the incomplete samples. In these 60 samples, two indexes were randomly picked out to delete their corresponding information, and formed a data sample set that to be repaired.
In order to verify the data restoration performance of the proposed GANs-HDI algorithm on UCI standard datasets, k-nearest neighbors (k-NN), error-back propagation (BP), matrix completion [14], Deep Learning [18], and proposed Peak Clustering in Section 2 were chosen as control groups with parts of the model parameters selected by experience. Specifically speaking, the cluster number was equal to the sample class number in k-NN algorithm. The numbers of hidden neural nodes were set to be 10/25/12 and the layer number set to be 8 for three UCI standard datasets in BP algorithm with Sigmodal function as the activation function. The layer number set to be 8, and the numbers of hidden neural nodes were set to be [15, 12, 12, 10, 10, 8, 8, 8]/[35, 24, 24, 17, 17, 15, 15, 15]/[20, 18, 18, 15, 15, 12, 12, 12] for three UCI standard datasets in Deep Learning algorithm with Sigmodal function as the activation function. In the proposed Peak Clustering and GANs-HDI algorithms, the threshold of discrimination rate was set to be c = 0.05, the reducing pace was set to be α = 0.0015, the numbers of hidden neural nodes were set to be L = 10/25/12, initialized threshold values as R 1 = 0.85 and R 2 = 0.6, clustering thresholds as k = 2, select Sigmodal function and new generation proportion N ¯ / N = 0.5.
Repair the incomplete data samples with k-NN, Peak Clustering, BP, Matrix Completion [14], Deep Learning [18], and GANs-HDI algorithms, respectively. Then, determine whether the categories of restored samples were correct or not by using support vector machine (SVM), and calculate the accuracy values. Repeat 10 trials independently, and calculate the averages and root mean squared error (RMSE) of the accuracy values of data restoration results, as shown in Table 4 (more details in Table A1).
According to the data shown in Table 4, it is obvious that the time-consuming of both k-NN, matrix completion, Peak Clustering algorithms held the almost same quantity level, while Peak Clustering performed much better than the traditional k-NN algorithm and matrix completion algorithm on the accuracy of data restoration, especially held a more prominent repair effect for linear inseparable data samples. The data restoration performances of BP and Deep Learning [18] algorithm succeeded to beat Peak Clustering algorithm on UCI datasets. However, its RMSE was far from requirement, so the BP algorithm is not stable enough to carry out the engineering application directly. It is worth noting that the GANs-HDI algorithm is far superior to the other control groups in both of the accuracy and RMSE with 20–35 percentage points ahead. However, the algorithm takes a longer time to run, and needs to rely on regularization constraints and distributed computing technologies to improve its convergence efficiency.
In summary, the data restoration performances of GANs-HDI on UCI standard datasets were outstanding when compared with k-NN, BP, matrix completion, Peak Clustering, and deep learning algorithms. The validity and stability of our proposed GANs-HDI algorithm are verified through the simulation comparison experiments. Furthermore, the experimental results from UCI standard data also showed that the sample number might have a great influence on the final performances of the GANs-HDI algorithm.

5.2. Simulation Experiments on Intelligent Power Distribution and Utilization Dataset (I)

In this section, the simulation experiment took power cable test data of sixty 22 kV XLPE power cable samples for performance comparison of heterogeneous data integration in the JetBrains PyCharm 2017.2 environment. The power cable tests include the accelerated thermal aging tensile fracture test, accelerated thermal extension test, differential scanning calorimetry test, breakdown test, and DC leakage current test. In this simulation experiment, the incomplete sample proportion in the total samples was set to be 20%, and information loss rates were set to be 15%. Incomplete data sample and missing indexes were randomly selected. According to the incomplete sample proportion, 12 samples were randomly selected from total 60 data samples as the incomplete samples. Then, in these chosen samples, two indexes were randomly picked out to delete their corresponding information, and formed a set of data samples to be repaired, according to the information loss rate. Insulating state test indicators of 22 kV XLPE power cable are as shown in Table 5. After the data restoration, this section employed support vector machine (SVM) to predict targeted power cable samples’ relative aging times, where 15 samples were treated as the test group, and the other 45 samples were the training ones.
In order to verify the data integration performance of the proposed GANs-HDI algorithm, k-nearest neighbors (k-NN), and error-back propagation (BP) were chosen as control groups with parts of the model parameters selected by experience. The cluster number was equal to the sample class number in k-NN algorithm. The number of hidden neural nodes was set to be 10 in BP algorithm with Sigmodal function as the activation function. In the proposed GANs-HDI algorithm, the threshold of discrimination rate was set to be c = 0.05, the reducing pace was set to be α = 0.0005, the number of hidden neural nodes was set to be L = 10, initialized threshold values as R 1 = 0.85 and R 2 = 0.6, clustering threshold values as k = 2, Sigmodal function was chosen as the activation function, and new generation proportion N ¯ / N was set to be 0.5.
Repair the incomplete data samples with k-NN, BP, and GANs-HDI algorithms, respectively, and calculated the deviation rate with the real values. After the data restoration, employ SVM to perform the relative aging time prediction tasks. Repeat 10 trials independently, and calculate the averages and RMSE of the accuracy values of data restoration results, as shown in Table 6.
According to the data shown in Table 6, life prediction results could be greatly improve by the data restoration of missing information in this case. It also demonstrated that the newly proposed GANs-HDI algorithm can effectively deal with small sample sized life prediction problems, which cannot be handled by the combinations of traditional algorithms, as caused by the disunity on cable test categories of different manufacturers.

5.3. Simulation Experiments on Intelligent Power Distribution and Utilization Dataset (II)

In this section, the simulation experiment took the medium voltage basic data of 171 towns in power quality on-line monitoring system, from 2015 to 2016, for performance comparison of heterogeneous data integration in the JetBrains PyCharm 2017.2 environment. In this simulation experiment, the incomplete sample proportion in the total samples was set to be 20%, and information loss rates were set to be 5%, 15%, 30%, respectively. Incomplete data sample and missing indexes were randomly selected. According to the incomplete sample proportion, parts of samples were randomly selected from total 342 data samples as the incomplete samples. Then, in these chosen samples, parts of indexes were randomly picked out to delete their corresponding information, and formed a set of data samples to be repaired, according to the information loss rate. Normalized data of typical samples is as shown in Table 7 (original data in Table A2).
In order to verify the data integration performance of the proposed GANs-HDI algorithm on IPDU heterogeneous dataset, k-nearest neighbors (k-NN), and error-back propagation (BP) were chosen as control groups with parts of the model parameters selected by experience. The cluster number was equal to the sample class number in k-NN algorithm. The number of hidden neural nodes was set to be 10 in BP algorithm with Sigmodal function as the activation function. In the proposed GANs-HDI algorithm, the threshold of discrimination rate was set to be c = 0.05, the reducing pace was set to be α = 0.0005, the number of hidden neural nodes was set to be L = 10, initialized threshold values as R 1 = 0.85 and R 2 = 0.6, clustering threshold values as k = 2, and Sigmodal function was chosen as the activation function.
Repair the incomplete data samples with k-NN, BP, and GANs-HDI algorithms, respectively, and calculated the deviation rate with the real values. Repeat 10 trials independently, and calculate the averages and RMSE of the accuracy values of data restoration results, as shown in Table 8.
According to the data shown in Table 6, information loss rate and deviation rate shown a significant proportional relationship. Since there is no strong causal link between the indexes in IPDU heterogeneous datasets, the performance of traditional BP algorithm was not satisfactory in the experiments. On the other side, the performance of GANs-HDI algorithm was much better than k-NN and BP on deviation rate with 15 percentage points ahead. Moreover, when the information loss rate took 30%, the deviation rate of k-NN algorithm zoomed up, and the integration results were far away from the real sample space. However, the new proposed GANs-HDI algorithm holds good resistance to the changes of information loss rates, and showed its outstanding stability.
When considering the influence of sample number on the algorithm performances shown in Section 4, it would be necessary to study on the relationship between data integration performance and parameters in GANs-HDI. In order to further proof the influences of incomplete sample proportion and information loss rate on the heterogeneous data integration performance of GANs-HDI, deviation rates were calculated with different incomplete sample proportions and information loss rates on IPDU heterogeneous datasets, as shown in Figure 5.
In Figure 5, the color of each color block indicates the reciprocal of the mean of deviation rates from 10 independent repeated experiments with same incomplete sample proportion and information loss rate. The brighter the color, the better the algorithm works. With the decreases of incomplete sample proportion and information loss rate, data integration performance of GANs-HDI algorithm gradually improves. In Figure 4, the boundaries of deviation rate 20% and 50% were marked. It is obvious that, when the incomplete sample proportion is less than 30%, and the information loss rate is less than 20%, the confidence of IPDU heterogeneous data integration is considerably higher. Generally speaking, the larger volume of dataset is, the higher accuracy of data integration would be, and the confidence level of the results of heterogeneous data integration of distribution network will also show an overall upward trend.

6. Summary

Aiming at the low utilization efficiency problem of heterogeneous data resources for intelligent power distribution and utilization in the small sample environment, this paper proposed a so-called GANs based heterogeneous data integration technology. In this proposed method, the sample space is expanded by introducing GANs theory, according to the targeted samples with all of the measurement indexes complete. Then, a novel peak clustering model is constructed to realize the finite open coverage of the expanded sample space, and repair those incomplete samples. At last, the repaired samples are checked by using well-trained discriminator of GANs. Generally speaking, according to creative establishment the finite open coverage of targeted sample space, this paper succeeded in combining of GANs learning and clustering theory, and provided a novel heterogeneous data integration, which cannot be realized by any individual theory alone.
It is worth noting that, as an important part of this work, generative adversarial network models’ convergence has not been perfectly proved in theory by any experts and scholars yet, and its convergence rate still needs further improvement. In the next stage of our team’s works, we would like to study on the improved convergence schemes of GANs for vector data samples, and the distributed learning schemes of GANs with heterogeneous hardware.

Author Contributions

Yuanpeng Tan and Xiaojing Bai developed the theory and carried out the experiment. Yuanpeng Tan wrote the manuscript with support from Wei Liu and Jian Su.

Conflicts of Interest

There are no conflicts of interest.

Appendix A

Table A1. Comparison detail information of four algorithms on UCI datasets.
Table A1. Comparison detail information of four algorithms on UCI datasets.
AlgorithmDatasetAccuracy/%
12345678910
k-NNAbalone42.8742.8742.8742.8742.8742.8742.8742.8742.8742.87
Heart Disease33.3333.3333.3333.3333.3333.3333.3333.3333.3333.33
Bank Marketing60.0260.0260.0260.0260.0260.0260.0260.0260.0260.02
Peak ClusteringAbalone59.2860.1261.0858.3261.0858.2056.5358.2057.6059.28
Heart Disease43.3339.9343.3348.3343.3341.6741.6741.6744.8844.88
Bank Marketing69.5467.6369.4870.0269.4863.6467.9663.6467.4169.59
BPAbalone60.7260.0067.7863.8367.7859.4063.9559.4055.3368.86
Heart Disease28.3334.9836.6765.0236.6760.0763.3360.0743.3365.02
Bank Marketing66.4759.3267.7267.4967.7260.0161.8160.0167.7459.80
GANs-HDIAbalone92.2291.2694.0195.2194.0196.0595.2196.0593.6594.49
Heart Disease68.3366.6766.6771.6766.6771.6768.3371.6769.9768.33
Bank Marketing89.6989.6689.7989.7789.7989.9689.5989.9689.7190.65

Appendix B

Table A2. Original data of typical samples.
Table A2. Original data of typical samples.
IndexUnitSample No.IndexUnitSample No.
12341234
User number of public users/6674805318,782Transformer number of public users/6674806218869
Transformer capacity of public userskVA63,62645,600843,9671,454,859User number of specialized users/359100624096559
Transformer number of specialized users/510114726837202Transformer capacity of specialized userskVA461,078632,421842,0451,241,475
Total number of transformers/576122110,74526,071Total capacity of transformerskVA524,704678,0211,686,0122,696,334
Total number of electricity users/425108010,46225,341Total capacity of electricity userskVA524,704678,0211,688,1022,717,469
Length of power cable linekm117.0189.53212.06103.98Total length of power linekm270.83473.356644.558789.97
Number of switching equipment/1502064004237Average segment numberkm/per segment2.482.339.667.17

References

  1. Chen, M.; Mao, S.; Liu, Y. Big data: A survey. Mol. Netw. Appl. 2014, 19, 171–209. [Google Scholar] [CrossRef]
  2. Song, Y.; Zhou, G.; Zhu, Y. Present status and challenges of big data processing in smart grid. Power Syst. Technol. 2013, 37, 927–935. [Google Scholar]
  3. Diamantoulakis, P.D.; Kapinas, V.M.; Karagiannidis, G.K. Big data analytics for dynamic energy management in smart grids. Big Data Res. 2015, 2, 94–101. [Google Scholar] [CrossRef]
  4. Kezunovic, M.; Xie, L.; Grijalva, S. The role of big data in improving power system operation and protection. In Proceedings of the 2013 IREP Symposium Bulk Power System Dynamics and Control-IX Optimization, Security and Control of the Emerging Power Grid (IREP), Rethymno, Greece, 25–30 August 2013; pp. 1–9. [Google Scholar]
  5. Gungor, V.C.; Sahin, D.; Kocak, T.; Ergut, S.; Buccella, C.; Cecati, C.; Hancke, G.P. Smart grid technologies: Communication technologies and standards. IEEE Trans. Ind. Informs. 2011, 7, 529–539. [Google Scholar] [CrossRef]
  6. Kim, Y.J.; Thottan, M.; Kolesnikov, V.; Lee, W. A secure decentralized data-centric information infrastructure for smart grid. Commun. Mag. 2010, 48, 58–65. [Google Scholar] [CrossRef]
  7. Fluhr, J.; Ahlert, K.H.; Weinhardt, C. A stochastic model for simulating the availability of electric vehicles for services to the power grid. In Proceedings of the 2010 43rd Hawaii IEEE International Conference on System Sciences (HICSS), Honolulu, HI, USA, 5–8 January 2010; pp. 1–10. [Google Scholar]
  8. Watts, D.J.; Strogatz, S.H. Collective dynamics of ‘small-world’ networks. Nature 1998, 393, 440–442. [Google Scholar] [CrossRef] [PubMed]
  9. Ma, C.; Chen, Y.; Wilkins, D. Ranking of cancer genes in Markov chain model through integration of heterogeneous sources of data. In Proceedings of the 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, UK, 2–5 November 2014; pp. 248–253. [Google Scholar]
  10. Chen, X.Y.; Kang, C.Q.; Tong, X.; Xia, Q.; Yang, J. Improving the accuracy of bus load forecasting by a two-stage bad data identification method. IEEE Trans. Power Syst. 2014, 29, 1634–1641. [Google Scholar] [CrossRef]
  11. Dong, X.L.; Srivastava, D. Big data integration. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 8–12 April 2013; pp. 1245–1248. [Google Scholar]
  12. Lenzerini, M. Data integration: A theoretical perspective. In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, 3–5 June 2002; pp. 233–246. [Google Scholar]
  13. Liu, L.; Esmalifalak, M.; Han, Z. Detection of false data injection in power grid exploiting low rank and sparsity. In Proceedings of the 2013 IEEE International Conference on Communications (ICC), Budapest, Hungary, 9–13 June 2013; pp. 4461–4465. [Google Scholar]
  14. Xu, G.; Tan, Y.; Huang, L. Low-rank matrix completion based lifetime evaluation of XLPE power cable. Trans. China Electrotech. Soc. 2014, 29, 268–276. [Google Scholar]
  15. Mateos, G.; Giannakis, G.B. Robust nonparametric regression via sparsity control with application to load curve data cleansing. IEEE Trans. Signal Process. 2012, 60, 1571–1584. [Google Scholar] [CrossRef]
  16. Yu, Q.; Miche, Y.; Eirola, E.; Van Heeswijk, M.; Séverin, E.; Lendasse, A. Regularized extreme learning machine for regression with missing data. Neurocomputing 2013, 102, 45–51. [Google Scholar] [CrossRef]
  17. Li, R.; Zhang, W.; Suk, H.I.; Wang, L.; Li, J.; Shen, D.; Ji, S. Deep learning based imaging data completion for improved brain disease diagnosis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Nagoya, Japan, 22–26 September 2014; pp. 305–312. [Google Scholar]
  18. Socher, R.; Chen, D.; Manning, C.D.; Ng, A. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 926–934. [Google Scholar]
  19. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 2015; arXiv:1511.06434. [Google Scholar]
  20. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
  21. Springenberg, J.T. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv, 2015; arXiv:1511.06390. [Google Scholar]
  22. Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. arXiv, 2016; arXiv:1701.00160. [Google Scholar]
  23. Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed]
  24. Du, M.; Ding, S.; Xu, X.; Xue, Y. Density peaks clustering using geodesic distances. Int. J. Mach. Learn. Cybern. 2017, 8, 1–15. [Google Scholar] [CrossRef]
Figure 1. Multi-source heterogeneous big data system for intelligent power distribution and utilization (IPDU).
Figure 1. Multi-source heterogeneous big data system for intelligent power distribution and utilization (IPDU).
Applsci 08 00093 g001
Figure 2. Diagram of generative adversarial networks (GANs) calculation processing.
Figure 2. Diagram of generative adversarial networks (GANs) calculation processing.
Applsci 08 00093 g002
Figure 3. Diagram of comparative performance of data restoration.
Figure 3. Diagram of comparative performance of data restoration.
Applsci 08 00093 g003
Figure 4. Diagram of GANs based heterogeneous data integration for intelligent power distribution and utilization.
Figure 4. Diagram of GANs based heterogeneous data integration for intelligent power distribution and utilization.
Applsci 08 00093 g004
Figure 5. Deviation rate of generative adversarial networks based heterogeneous data integration (GANs-HDI) algorithm on heterogeneous datasets for intelligent power distribution and utilization.
Figure 5. Deviation rate of generative adversarial networks based heterogeneous data integration (GANs-HDI) algorithm on heterogeneous datasets for intelligent power distribution and utilization.
Applsci 08 00093 g005
Table 1. Peak clustering based data restoration algorithm.
Table 1. Peak clustering based data restoration algorithm.
InputsEstablish the combined dataset D G A N s = { d i } i = 1 N + M ; initialize threshold values R 1 , R 2 and clustering threshold value k ;
Step 1Establish the finite open coverage of the targeted combined dataset:
Step 1.1: if C e n t r e φ , randomly select d C e n t r e from the central point set; otherwise, go to Step 1.4.
Step 1.2: calculate i-th open interval:
C o v e r a g e i ( d ) = { d D G A N s | D i s t ( d , d ) R 1 } ; C o v e r a g e = { C o v e r a g e i ( d ) }
and renew the temporary peak point set T e m p _ P e a k s { T e m p _ P e a k s , d } , i i + 1 .
Step 1.3: renew the central point set C e n t r e C C e n t r e { d | D i s t ( d , d ) R 2 } , and calculate the peak point set:
T e m p _ P e a k s i ( d ) = d C o v e r a g e i ( d ) d | C o v e r a g e i ( d ) | , P e a k s = { T e m p _ P e a k s i ( d ) }
where, | | represents the elemental number of vector. Then, return to Step 1.1.
Step 1.4: Return the finite open coverage set C o v e r a g e and peak point set P e a k s .
Step 2Based finite open coverage set and peak point set, perform the peak clustering task:
Step 2.1: Establishing subsets according to phase angle clockwise:
C l u s t e r S e t = { t e m p _ s e t i | t e m p _ s e t i C o v e r a g e ,    | t e m p _ s e t i | = 1 , t e m p _ s e t i t e m p _ s e t j , i j } ;
Step 2.2: Calculate D i = m i n { D i s t ( d j , P e a k s ) | d j t e m p _ s e t i } ;
Step 2.3: If the condition | C l u s t e r S e t | > 1 and m a x ( | D G A N s i D G A N s i + 1 | ) k are satisfied, combine t e m p _ s e t i and t e m p _ s e t i + 1 , return to Step 2.2; otherwise, return C l u s t e r S e t .
Step 3Based on information entropy theory, implement the incomplete data restoration task:
Step 3.1: calculate Euclidean distance as similarity { s j } j = 1 n i , and normalize the similarity set: p j = s j / j = 1 n i s j ;
Step 3.2: calculate the entropy value of each complete data sample h j = p j ln p j ; calculate the weight of each complete data sample w j = ( 1 h j ) / ( n i j = 1 n i h j ) ; calculate missing attribute values f = j = 1 n i w j x j , where x j represents the corresponding attribute values of data samples in the group.
OutputsRestored dataset D ^ = { d ^ i } i = 1 N + M .
Table 2. GANs Based Heterogeneous Data Integration.
Table 2. GANs Based Heterogeneous Data Integration.
InputsEstablish the original dataset D = { d i R K } i = 1 N ; initialize discrimination rate threshold c , reducing pace α , sample number N ¯ , activation function g and f , hidden neural node number L , thresholds R 1 , R 2 , clustering threshold k .
Step 1Initialize central point set C e n t r e = D , and initialize peak point set T e m p _ P e a k s = { } .
Step 2Select all data samples with all measurement indexes complete from heterogeneous database D = { d i } i = 1 N , and denote them as a dataset D ¯ = { d i } i = 1 N ¯ . Train generative adversarial networks, and obtain the generator and discriminator. Determine whether the discrimination rate of newly generated samples falls within the interval [ 0.5 c , 0.5 + c ] by using discriminator D . If this condition is satisfied, combine the new generated samples and original dataset, and denote the combination as D G A N s = { d i } i = 1 N ¯ + M , and return to Step 3. Otherwise, the discriminative error of D is back propagated to retraining of generator G .
Step 3Employ peak clustering algorithm to repair the samples with incomplete information, and assume the restoration of original dataset D ^ = { d ^ i } i = 1 N .
Step 4Determine whether the repaired samples can be verified by discriminator. If it fails, the threshold c would be reduced to c α based on the reducing pace α , and return to Step 2.3; otherwise, return integrated dataset D ^ = { d ^ i } i = 1 N .
OutputsIntegrated dataset D ^ = { d ^ i } i = 1 N .
Table 3. Detail information of UCI standard datasets.
Table 3. Detail information of UCI standard datasets.
DatasetsIncomplete Sample NumberTotal Sample NumberMissing Dimensional NumberTotal Dimensional Number
Abalone835417728
Heart Disease603031975
Bank Marketing904245,211417
Table 4. Comparison of four algorithms on UCI datasets.
Table 4. Comparison of four algorithms on UCI datasets.
AlgorithmIndexDataset
AbaloneHeart DiseaseBank Marketing
k-NNAccuracy/%42.8733.3360.02
RMSE///
Time-consuming/s7.21302.162150.1315
Peak ClusteringAccuracy/%58.5643.3367.92
RMSE0.0133790.0223610.018203
Time-consuming/s9.16552.309776.1617
Matrix Completion [14]Accuracy/%53.2740.2161.23
RMSE0.0751520.0654650.116122
Time-consuming/s17.566010.6516226.6646
BPAccuracy/%62.4650.1764.26
RMSE0.0390620.1350820.037608
Time-consuming/s17.21328.1261180.2661
Deep Learning [18]Accuracy/%71.2666.0269.95
RMSE0.0362610.04616100.091664
Time-consuming/s140.559482.6167362.9500
GANs-HDIAccuracy/%94.2469.1789.72
RMSE0.0142350.0186340.001142
Time-consuming/s154.128889.26611374.1626
Table 5. Insulating state test indicators of 22 kV XLPE power cable [14].
Table 5. Insulating state test indicators of 22 kV XLPE power cable [14].
IndexSample No.IndexSample No.
123123
Real operation time/year4711Breakdown test pressure level diff. 1171717
Relative aging time0.150.270.52Breakdown test pressure level diff. 214117
Elongation at break (%)240225190Breakdown test0.130.210.38
Load elongation (%)23.747.697.1Insulation resistance per unit length/GΩ412919
Permanent elongation (%)2.13.506.4Operating ambient temperature/°C909090
DSC peak temperature/263258243Operating ambient temperature/°C909090
Table 6. Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.
Table 6. Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.
AlgorithmRestoration Deviation Rate/%Life Prediction Deviation Rate/%RMSE
SVM/24.61/
k-NN + SVM40.2658.65/
BP + SVM28.5865.610.6216
GANs-HDI + SVM22.7086.620.2646
Table 7. Normalized data of typical samples.
Table 7. Normalized data of typical samples.
IndexSample No.IndexSample No.
12341234
User number of public users0.00000.00020.24960.5849Transformer number of public users0.00000.00020.24990.5876
Transformer capacity of public users0.02550.01820.33760.5819User number of specialized users0.03900.10930.26180.7129
Transformer number of specialized users0.03400.07650.17890.4801Transformer capacity of specialized users0.30740.42160.56140.8277
Total number of transformers0.01440.03050.26860.6518Total capacity of transformers0.19460.25150.62531.0000
Total number of electricity users0.01060.02700.26160.6335Total capacity of electricity users0.19310.24950.62121.0000
Length of power cable line0.42090.32210.76280.3740Total length of power line0.02870.05010.70300.9300
Number of switching equipment0.01880.02580.05000.5296Average segment number0.11870.11150.46220.3431
Table 8. Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.
Table 8. Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.
AlgorithmInformation Loss Rate/%Deviation Rate/%RMSE
k-NN529.26/
1538.72/
30102.16/
BP545.560.7980
1563.220.8384
3088.361.0916
GANs-HDI514.590.1460
1519.730.1975
3026.010.2603

Share and Cite

MDPI and ACS Style

Tan, Y.; Liu, W.; Su, J.; Bai, X. Generative Adversarial Networks Based Heterogeneous Data Integration and Its Application for Intelligent Power Distribution and Utilization. Appl. Sci. 2018, 8, 93. https://doi.org/10.3390/app8010093

AMA Style

Tan Y, Liu W, Su J, Bai X. Generative Adversarial Networks Based Heterogeneous Data Integration and Its Application for Intelligent Power Distribution and Utilization. Applied Sciences. 2018; 8(1):93. https://doi.org/10.3390/app8010093

Chicago/Turabian Style

Tan, Yuanpeng, Wei Liu, Jian Su, and Xiaojing Bai. 2018. "Generative Adversarial Networks Based Heterogeneous Data Integration and Its Application for Intelligent Power Distribution and Utilization" Applied Sciences 8, no. 1: 93. https://doi.org/10.3390/app8010093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop