Generative Adversarial Networks Based Heterogeneous Data Integration and Its Application for Intelligent Power Distribution and Utilization

Heterogeneous characteristics of a big data system for intelligent power distribution and utilization have already become more and more prominent, which brings new challenges for the traditional data analysis technologies and restricts the comprehensive management of distribution network assets. In order to solve the problem that heterogeneous data resources of power distribution systems are difficult to be effectively utilized, a novel generative adversarial networks (GANs) based heterogeneous data integration method for intelligent power distribution and utilization is proposed. In the proposed method, GANs theory is introduced to expand the distribution of completed data samples. Then, a so-called peak clustering algorithm is proposed to realize the finite open coverage of the expanded sample space, and repair those incomplete samples to eliminate the heterogeneous characteristics. Finally, in order to realize the integration of the heterogeneous data for intelligent power distribution and utilization, the well-trained discriminator model of GANs is employed to check the restored data samples. The simulation experiments verified the validity and stability of the proposed heterogeneous data integration method, which provides a novel perspective for the further data quality management of power distribution systems.


Introduction
With the rapid development of smart grid and sensing technology, China's power user side data showed high complexity and redundancy.Since 2011, the user side data volume of power distribution system in China has been booming, from GB to TB, even to PB level, and gradually forms a big data system.Facing the era of big data, power companies have not only improved traditional MySQL, Oracle, and other relational database systems, but also produced lots of new big data systems, such as HBase, GBase, and etc.All of these database systems mentioned above have already formed a multi-source heterogeneous big data system for intelligent power distribution and utilization (IPDU) [1][2][3][4].On the other hand, affected by local economic levels, the monitoring and testing conditions in local power companies and manufacturers for distribution network equipment are quite different, while parts of the complex monitoring and testing equipment is unreasonable and impossible to repeat purchase, leading to the further heterogeneity of IPDU big data.

Generative Adversarial Networks Based Sample Space Expansion
Facing the big data for intelligent power distribution and utilization (IPDU), power companies have not only improved traditional MySQL, Oracle, and other relational database systems, but also produced lots of new big data systems, such like HBase, GBase, and etc.All of these database systems mentioned above provide excessive multi-source heterogeneous samples, as shown in Figure 1.These targeted samples together form a real space, in which, according to Heine-Borel theorem, a limited number of open intervals could be chosen to form a finite open coverage of this targeted sample set.In each open interval, the samples shall hold the same data characteristics, and be able to support other samples with missing indexes.However, in some small sample environments, the samples are not always enough for data completion and integration tasks in all of the open intervals.Therefore, in order to obtain satisfying data integration results, this paper introduces generative adversarial theory to enrich the sample space.

Generative Adversarial Networks Based Sample Space Expansion
Facing the big data for intelligent power distribution and utilization (IPDU), power companies have not only improved traditional MySQL, Oracle, and other relational database systems, but also produced lots of new big data systems, such like HBase, GBase, and etc.All of these database systems mentioned above provide excessive multi-source heterogeneous samples, as shown in Figure 1.These targeted samples together form a real space, in which, according to Heine-Borel theorem, a limited number of open intervals could be chosen to form a finite open coverage of this targeted sample set.In each open interval, the samples shall hold the same data characteristics, and be able to support other samples with missing indexes.However, in some small sample environments, the samples are not always enough for data completion and integration tasks in all of the open intervals.Therefore, in order to obtain satisfying data integration results, this paper introduces generative adversarial theory to enrich the sample space.Generative adversarial networks (GANs) is a generative model derived from Nash zero-sum game, in which the generator model and discriminator model are invited to participate.The generator model is designed to learn the distribution of training data, while the discriminator is designed to estimate the probability that the targeted data sample comes from training data rather than the generator.Both of these two models could improve their performances in mutual confrontation and iterative optimization, extend the targeted sample set, improve the discrimination ability, and approach the Nash equilibrium eventually [19].As one of the most exciting ideas in the research field of machine learning over the last decade, the theory of GANs has been widely used in image and graphic processing, natural language processing, computer virus monitoring, chess game programming, and etc.
Inspired by Goodfellow and Springenberg's works [20][21][22], GANs theory is employed to realize the expansion of targeted sample space in this paper.First of all, a targeted data set all the measurement indexes is constructed, where N stands for the sample number of data set.
GANs algorithm is used to train generator G and discriminator D in TensorFlow platform.
as inputs and zeros as outputs, the discriminator D could be initialized in TensorFlow as the following equation: Generative adversarial networks (GANs) is a generative model derived from Nash zero-sum game, in which the generator model and discriminator model are invited to participate.The generator model is designed to learn the distribution of training data, while the discriminator is designed to estimate the probability that the targeted data sample comes from training data rather than the generator.Both of these two models could improve their performances in mutual confrontation and iterative optimization, extend the targeted sample set, improve the discrimination ability, and approach the Nash equilibrium eventually [19].As one of the most exciting ideas in the research field of machine learning over the last decade, the theory of GANs has been widely used in image and graphic processing, natural language processing, computer virus monitoring, chess game programming, and etc.
Inspired by Goodfellow and Springenberg's works [20][21][22], GANs theory is employed to realize the expansion of targeted sample space in this paper.First of all, a targeted data set D = {d i } N i=1 with all the measurement indexes is constructed, where N stands for the sample number of data set.GANs algorithm is used to train generator G and discriminator D in TensorFlow platform.Taking D = {d i } N i=1 as inputs and zeros as outputs, the discriminator D could be initialized in TensorFlow as the following equation: where, L is the number of hidden neural nodes, ω i ∈ R K is the input weights of i-th hidden neural node, and β j ∈ R and b j ∈ R represent the output weights and threshold values of j-th hidden neural node respectively, g(•) : R → R stands for the activation function in neural networks.
Furthermore, train the generator G and discriminator D simultaneously: adjusting parameters for G to minimize log(1 − D(d)) and for D to minimize log D(d), as if they are following the two-player min-max game with value function v(G, D) [21]: where, D = {d i } M i=1 stands for the data set consisted of new generated samples from generator G; L is the number of hidden neural nodes; i ∈ R P is the input weights of i-th hidden neural node, β j ∈ R K and b j ∈ R represent the output weights and threshold values of j-th hidden neural node.respectively, and f(•) : R → R stands for the activation function in neural networks.By using the well-trained generator G, M new data samples could be generated with random vector set Z = z i ∈ R P M i=1 as the inputs.Take D = {d i } N i=1 and D = {d i } M i=1 as inputs and zeros and ones as outputs respectively, train and renew the discriminator D.
Finally, determine whether the probability of newly generated samples falls within the interval [0.5 − c, 0.5 + c] by using discriminator D. If this condition is satisfied, then it demonstrates that generator G performs well in convergence.Combine the new generated sample set D and original data set D, and denote the combination as D GANs = {d i } N+M i=1 for the future data restoration.Otherwise, the discriminative error of D is back propagated to retraining of generator G.More obviously, the calculation processing of GANs could be shown as Figure 2.
where, L is the number of hidden neural nodes,  [21]: where, stands for the data set consisted of new generated samples from generator G ; L is the number of hidden neural nodes; and j b R  represent the output weights and threshold values of j-th hidden neural node.
respectively, and stands for the activation function in neural networks.By using the well-trained generator G , M new data samples could be generated with random vector set as the inputs.Take as inputs and zeros and ones as outputs respectively, train and renew the discriminator D .
Finally, determine whether the probability of newly generated samples falls within the interval 0.5 , 0. Otherwise, the discriminative error of D is back propagated to retraining of generator G .More obviously, the calculation processing of GANs could be shown as Figure 2.

Peak Clustering Based Data Restoration
Based on the proliferation result of data samples by introducing generator discussed above, a so-called peak clustering based incomplete data restoration method is proposed in this section.In order to overcome the restoration failures of traditional algorithms with linear inseparable data, the proposed method constructs as few as possible open intervals with a fixed neighborhood radius for all of the data samples.Then, the set of open intervals form a finite open coverage, avoiding the interference of linear inseparable data samples on clustering results, as shown in Figure 3.
Inspired by Rodriguez's work in Reference [23], peak clustering algorithm is proposed for incomplete data restoration to improve the calculation efficiency, while sustaining the restoration precision.Supposing the finite open coverage calculate the peak distance of density peaks (distance between data sample and density peak point) , ( Dist d Temp Peaks d .Then, i n clusters are constructed according to the phase angle with the density peak point as the center, and each cluster contains only one sample [24].If the absolute value of the peak distance difference of the cluster with similar phase angle is smaller than or equal

Peak Clustering Based Data Restoration
Based on the proliferation result of data samples by introducing generator discussed above, a so-called peak clustering based incomplete data restoration method is proposed in this section.In order to overcome the restoration failures of traditional algorithms with linear inseparable data, the proposed method constructs as few as possible open intervals with a fixed neighborhood radius for all of the data samples.Then, the set of open intervals form a finite open coverage, avoiding the interference of linear inseparable data samples on clustering results, as shown in Figure 3.
Inspired by Rodriguez's work in Reference [23], peak clustering algorithm is proposed for incomplete data restoration to improve the calculation efficiency, while sustaining the restoration precision.Supposing the finite open coverage Coverage i (d) contains n i data samples d j n i j=1 , calculate the peak distance of density peaks (distance between data sample and density peak point) Dist(d j , Temp_Peaks i (d)).Then, n i clusters are constructed according to the phase angle with the density peak point as the center, and each cluster contains only one sample [24].If the absolute value of the peak distance difference of the cluster with similar phase angle is smaller than or equal to the threshold value, the two classes are merged, and the distance between the peak point of the new class and the peak to peak value of the density is calculated.
to the threshold value, the two classes are merged, and the distance between the peak point of the new class and the peak to peak value of the density is calculated.If the absolute value of the peak distance difference of the clusters with the similar phase angle is no larger than the threshold value k , combine the two clusters and calculate new density peak points and peak distances.Repeat the operations above, until the absolute value of the peak distance difference of the clusters with the similar phase angle becomes larger than the threshold value k , or the total cluster number becomes 1.Then, end the iteration and output the clustering result of the last iteration.
Finally, after the peak clustering of targeted data samples, the weighted averages of corresponding values in complete samples could be used as the predictive values of missing data.The concept of entropy in information theory is introduced, and the weighting coefficient is determined by the similarity between data samples.Generally speaking, the process of peak clustering based data restoration could be shown in the following Table 1.

Inputs
Establish the combined dataset R and clustering threshold value k ; Step 1 Establish the finite open coverage of the targeted combined dataset: Step 1.1: if Centre   , randomly select d Centre  from the central point set; otherwise, go to Step 1.4.
Step 1.2: calculate i-th open interval: and renew the temporary peak point set , and calculate the peak point set: where, | |  represents the elemental number of vector.Then, return to Step 1.1.
Step 1.4:Return the finite open coverage set Coverage and peak point set Peaks .

Step 2
Based finite open coverage set and peak point set, perform the peak clustering task: Step 2.1: Establishing subsets according to phase angle clockwise: Step 2.3: are satisfied, combine _ i temp set and 1 _ i temp set  , return to Step 2.2; otherwise, return ClusterSet .

Step 3
Based on information entropy theory, implement the incomplete data restoration task: Step 3.1: calculate Euclidean distance as similarity 1 { } i j n j s  , and normalize the similarity set: Step 3.2: calculate the entropy value of each complete data sample ln ; calculate the weight of each complete data sample ; calculate missing attribute values , where j x represents the corresponding attribute values of data samples in the group.If the absolute value of the peak distance difference of the clusters with the similar phase angle is no larger than the threshold value k, combine the two clusters and calculate new density peak points and peak distances.Repeat the operations above, until the absolute value of the peak distance difference of the clusters with the similar phase angle becomes larger than the threshold value k, or the total cluster number becomes 1.Then, end the iteration and output the clustering result of the last iteration.
Finally, after the peak clustering of targeted data samples, the weighted averages of corresponding values in complete samples could be used as the predictive values of missing data.The concept of entropy in information theory is introduced, and the weighting coefficient is determined by the similarity between data samples.Generally speaking, the process of peak clustering based data restoration could be shown in the following Table 1.
Step 1.3: renew the central point set Centre ← C Centre {d |Dist(d, d ) ≤ R 2 } , and calculate the peak point set:

Step 2
Based finite open coverage set and peak point set, perform the peak clustering task: Step 2.1: Establishing subsets according to phase angle clockwise: Step 2.2: Calculate D i = min Dist(d j , Peaks) d j ∈ temp_set i ; Step 2.3: If the condition |ClusterSet|> 1 and max(|D GANsi − D GANsi+1 |) ≤ k are satisfied, combine temp_set i and temp_set i+1 , return to Step 2.2; otherwise, return ClusterSet.

Step 3
Based on information entropy theory, implement the incomplete data restoration task: Step 3.1: calculate Euclidean distance as similarity {s j } ni j=1 , and normalize the similarity set: p j = s j /∑ ni j=1 s j ; Step 3.2: calculate the entropy value of each complete data sample h j = −p j ln p j ; calculate the weight of each complete data sample w j = (1 − h j )/(n i − ∑ ni j=1 h j ); calculate missing attribute values f = ∑ ni j=1 w j x j , where x j represents the corresponding attribute values of data samples in the group.

Realization of GANs Based Heterogeneous Data Integration
In order to solve the problem that the IPDU heterogeneous data resources are difficult to be effectively utilized in the small sample environment, a novel generative adversarial networks based heterogeneous data integration technology (GANs-HDI) is proposed in this paper.In the GANs-HDI method, the sample space is expanded by introducing GANs, according to the targeted samples with all of the measurement indexes complete.According to all of the complete and fixed samples, peak clustering and information entropy are employed to restore the incomplete ones.Based on the new sample set expanded by the generative model of GANs, this method constructs a peak clustering model to realize the finite open coverage of the restored sample space, and repair those incomplete samples with entropy function.Finally, all of the repaired samples would be checked by using well-trained discriminator of GANs to guarantee the heterogeneous data integration performances.Generally speaking, the process of GANs based heterogeneous data integration could be presented, as shown in Figure 4.

Realization of GANs Based Heterogeneous Data Integration
In order to solve the problem that the IPDU heterogeneous data resources are difficult to be effectively utilized in the small sample environment, a novel generative adversarial networks based heterogeneous data integration technology (GANs-HDI) is proposed in this paper.In the GANs-HDI method, the sample space is expanded by introducing GANs, according to the targeted samples with all of the measurement indexes complete.According to all of the complete and fixed samples, peak clustering and information entropy are employed to restore the incomplete ones.Based on the new sample set expanded by the generative model of GANs, this method constructs a peak clustering model to realize the finite open coverage of the restored sample space, and repair those incomplete samples with entropy function.Finally, all of the repaired samples would be checked by using well-trained discriminator of GANs to guarantee the heterogeneous data integration performances.Generally speaking, the process of GANs based heterogeneous data integration could be presented, as shown in Figure 4.
, perform the peak clustering based data restoration task; otherwise, the discriminant error of discriminator D would be back propagated to re-train the generator G, as shown in Table 2.

Inputs
Establish the original dataset D = d i ∈ R K N i=1 ; initialize discrimination rate threshold c, reducing pace α, sample number N, activation function g and f, hidden neural node number L, thresholds R 1 , R 2 , clustering threshold k.

Step 1
Initialize central point set Centre = D, and initialize peak point set Temp_Peaks = {}.
Step 2 Select all data samples with all measurement indexes complete from heterogeneous database D = {d i } N i=1 , and denote them as a dataset D = {d i } N i=1 .Train generative adversarial networks, and obtain the generator and discriminator.Determine whether the discrimination rate of newly generated samples falls within the interval [0.5 − c, 0.5 + c] by using discriminator D. If this condition is satisfied, combine the new generated samples and original dataset, and denote the combination as D GANs = {d i } N+M i=1 , and return to Step 3. Otherwise, the discriminative error of D is back propagated to retraining of generator G.

Step 3
Employ peak clustering algorithm to repair the samples with incomplete information, and assume the restoration of Step 4 Determine whether the repaired samples can be verified by discriminator.If it fails, the threshold c would be reduced to c − α based on the reducing pace α, and return to Step 2.3; otherwise, return integrated dataset

Simulation Experiments and Result Analysis
In this section, the simulation experiments are divided into two parts, i.e., data restoration on University of California Irvine (UCI) standard datasets and heterogeneous data integration on intelligent power distribution and utilization datasets.The former one is performed to verify the validity and stability of our proposed GANs-HDI algorithm, while the latter one is performed to test the actual effect of our proposed GANs-HDI algorithm for intelligent power distribution and utilization heterogeneous data in TensorFlow platform.All of the following simulation experiments were performed in Matlab 2012a and JetBrains PyCharm 2017.2 environment with Core-TM i3-M330@2.13GHzand NVIDIA GeForce 840M processor, respectively.

Simulation Experiments on UCI Standard Datasets
The simulation experiment introduced three UCI standard datasets, i.e., 'Abalone', 'Heart Disease', and 'Bank Marketing', for performance comparison of data restoration in the Matlab 2012a environment.In this simulation experiment, the incomplete sample proportion in the total samples was set as 20%, and the information loss rate was 25%.Incomplete data sample and missing indexes were randomly selected.The detailed information of the three UCI standard datasets is as shown in Table 3. Taking 'Abalone' dataset as an example, 60 samples were randomly selected from a total of 4177 data samples as the incomplete samples.In these 60 samples, two indexes were randomly picked out to delete their corresponding information, and formed a data sample set that to be repaired.In order to verify the data restoration performance of the proposed GANs-HDI algorithm on UCI standard datasets, k-nearest neighbors (k-NN), error-back propagation (BP), matrix completion [14], Deep Learning [18], and proposed Peak Clustering in Section 2 were chosen as control groups with parts of the model parameters selected by experience.Specifically speaking, the cluster number was equal to the sample class number in k-NN algorithm.The numbers of hidden neural nodes were set to be 10/25/12 and the layer number set to be 8 for three UCI standard datasets in BP algorithm with Sigmodal function as the activation function.The layer number set to be 8, and the numbers of hidden neural nodes were set to be [15,12,12,10,10,8,8,8]/[35, 24,24,17,17,15,15,15]/ [20,18,18,15,15,12,12,12] for three UCI standard datasets in Deep Learning algorithm with Sigmodal function as the activation function.In the proposed Peak Clustering and GANs-HDI algorithms, the threshold of discrimination rate was set to be c = 0.05, the reducing pace was set to be α = 0.0015, the numbers of hidden neural nodes were set to be L = 10/25/12, initialized threshold values as R 1 = 0.85 and R 2 = 0.6, clustering thresholds as k = 2, select Sigmodal function and new generation proportion N/N = 0.5.
Repair the incomplete data samples with k-NN, Peak Clustering, BP, Matrix Completion [14], Deep Learning [18], and GANs-HDI algorithms, respectively.Then, determine whether the categories of restored samples were correct or not by using support vector machine (SVM), and calculate the accuracy values.Repeat 10 trials independently, and calculate the averages and root mean squared error (RMSE) of the accuracy values of data restoration results, as shown in Table 4 (more details in Table A1).According to the data shown in Table 4, it is obvious that the time-consuming of both k-NN, matrix completion, Peak Clustering algorithms held the almost same quantity level, while Peak Clustering performed much better than the traditional k-NN algorithm and matrix completion algorithm on the accuracy of data restoration, especially held a more prominent repair effect for linear inseparable data samples.The data restoration performances of BP and Deep Learning [18] algorithm succeeded to beat Peak Clustering algorithm on UCI datasets.However, its RMSE was far from requirement, so the BP algorithm is not stable enough to carry out the engineering application directly.It is worth noting that the GANs-HDI algorithm is far superior to the other control groups in both of the accuracy and RMSE with 20-35 percentage points ahead.However, the algorithm takes a longer time to run, and needs to rely on regularization constraints and distributed computing technologies to improve its convergence efficiency.
In summary, the data restoration performances of GANs-HDI on UCI standard datasets were outstanding when compared with k-NN, BP, matrix completion, Peak Clustering, and deep learning algorithms.The validity and stability of our proposed GANs-HDI algorithm are verified through the simulation comparison experiments.Furthermore, the experimental results from UCI standard data also showed that the sample number might have a great influence on the final performances of the GANs-HDI algorithm.

Simulation Experiments on Intelligent Power Distribution and Utilization Dataset (I)
In this section, the simulation experiment took power cable test data of sixty 22 kV XLPE power cable samples for performance comparison of heterogeneous data integration in the JetBrains PyCharm 2017.2 environment.The power cable tests include the accelerated thermal aging tensile fracture test, accelerated thermal extension test, differential scanning calorimetry test, breakdown test, and DC leakage current test.In this simulation experiment, the incomplete sample proportion in the total samples was set to be 20%, and information loss rates were set to be 15%.Incomplete data sample and missing indexes were randomly selected.According to the incomplete sample proportion, 12 samples were randomly selected from total 60 data samples as the incomplete samples.Then, in these chosen samples, two indexes were randomly picked out to delete their corresponding information, and formed a set of data samples to be repaired, according to the information loss rate.Insulating state test indicators of 22 kV XLPE power cable are as shown in Table 5.After the data restoration, this section employed support vector machine (SVM) to predict targeted power cable samples' relative aging times, where 15 samples were treated as the test group, and the other 45 samples were the training ones.In order to verify the data integration performance of the proposed GANs-HDI algorithm, k-nearest neighbors (k-NN), and error-back propagation (BP) were chosen as control groups with parts of the model parameters selected by experience.The cluster number was equal to the sample class number in k-NN algorithm.The number of hidden neural nodes was set to be 10 in BP algorithm with Sigmodal function as the activation function.In the proposed GANs-HDI algorithm, the threshold of discrimination rate was set to be c = 0.05, the reducing pace was set to be α = 0.0005, the number of hidden neural nodes was set to be L = 10, initialized threshold values as R 1 = 0.85 and R 2 = 0.6, clustering threshold values as k = 2, Sigmodal function was chosen as the activation function, and new generation proportion N/N was set to be 0.5.
Repair the incomplete data samples with k-NN, BP, and GANs-HDI algorithms, respectively, and calculated the deviation rate with the real values.After the data restoration, employ SVM to perform the relative aging time prediction tasks.Repeat 10 trials independently, and calculate the averages and RMSE of the accuracy values of data restoration results, as shown in Table 6.According to the data shown in Table 6, life prediction results could be greatly improve by the data restoration of missing information in this case.It also demonstrated that the newly proposed GANs-HDI algorithm can effectively deal with small sample sized life prediction problems, which cannot be handled by the combinations of traditional algorithms, as caused by the disunity on cable test categories of different manufacturers.

Simulation Experiments on Intelligent Power Distribution and Utilization Dataset (II)
In this section, the simulation experiment took the medium voltage basic data of 171 towns in power quality on-line monitoring system, from 2015 to 2016, for performance comparison of heterogeneous data integration in the JetBrains PyCharm 2017.2 environment.In this simulation experiment, the incomplete sample proportion in the total samples was set to be 20%, and information loss rates were set to be 5%, 15%, 30%, respectively.Incomplete data sample and missing indexes were randomly selected.According to the incomplete sample proportion, parts of samples were randomly selected from total 342 data samples as the incomplete samples.Then, in these chosen samples, parts of indexes were randomly picked out to delete their corresponding information, and formed a set of data samples to be repaired, according to the information loss rate.Normalized data of typical samples is as shown in Table 7 (original data in Table A2).In order to verify the data integration performance of the proposed GANs-HDI algorithm on IPDU heterogeneous dataset, k-nearest neighbors (k-NN), and error-back propagation (BP) were chosen as control groups with parts of the model parameters selected by experience.The cluster number was equal to the sample class number in k-NN algorithm.The number of hidden neural nodes was set to be 10 in BP algorithm with Sigmodal function as the activation function.In the proposed GANs-HDI algorithm, the threshold of discrimination rate was set to be c = 0.05, the reducing pace was set to be α = 0.0005, the number of hidden neural nodes was set to be L = 10, initialized threshold values as R 1 = 0.85 and R 2 = 0.6, clustering threshold values as k = 2, and Sigmodal function was chosen as the activation function.
Repair the incomplete data samples with k-NN, BP, and GANs-HDI algorithms, respectively, and calculated the deviation rate with the real values.Repeat 10 trials independently, and calculate the averages and RMSE of the accuracy values of data restoration results, as shown in Table 8.
According to the data shown in Table 6, information loss rate and deviation rate shown a significant proportional relationship.Since there is no strong causal link between the indexes in IPDU heterogeneous datasets, the performance of traditional BP algorithm was not satisfactory in the experiments.On the other side, the performance of GANs-HDI algorithm was much better than k-NN and BP on deviation rate with 15 percentage points ahead.Moreover, when the information loss rate took 30%, the deviation rate of k-NN algorithm zoomed up, and the integration results were far away from the real sample space.However, the new proposed GANs-HDI algorithm holds good resistance to the changes of information loss rates, and showed its outstanding stability.When considering the influence of sample number on the algorithm performances shown in Section 4, it would be necessary to study on the relationship between data integration performance and parameters in GANs-HDI.In order to further proof the influences of incomplete sample proportion and information loss rate on the heterogeneous data integration performance of GANs-HDI, deviation rates were calculated with different incomplete sample proportions and information loss rates on IPDU heterogeneous datasets, as shown in Figure 5.According to the data shown in Table 6, information loss rate and deviation rate shown a significant proportional relationship.Since there is no strong causal link between the indexes in IPDU heterogeneous datasets, the performance of traditional BP algorithm was not satisfactory in the experiments.On the other side, the performance of GANs-HDI algorithm was much better than k-NN and BP on deviation rate with 15 percentage points ahead.Moreover, when the information loss rate took 30%, the deviation rate of k-NN algorithm zoomed up, and the integration results were far away from the real sample space.However, the new proposed GANs-HDI algorithm holds good resistance to the changes of information loss rates, and showed its outstanding stability.
When considering the influence of sample number on the algorithm performances shown in Section 4.1, it would be necessary to study on the relationship between data integration performance and parameters in GANs-HDI.In order to further proof the influences of incomplete sample proportion and information loss rate on the heterogeneous data integration performance of GANs-HDI, deviation rates were calculated with different incomplete sample proportions and information loss rates on IPDU heterogeneous datasets, as shown in Figure 5.In Figure 5, the color of each color block indicates the reciprocal of the mean of deviation rates from 10 independent repeated experiments with same incomplete sample proportion and information loss rate.The brighter the color, the better the algorithm works.With the decreases of incomplete sample proportion and information loss rate, data integration performance of GANs-HDI algorithm gradually improves.In Figure 4, the boundaries of deviation rate 20% and 50% were marked.It is obvious that, when the incomplete sample proportion is less than 30%, and the information loss rate is less than 20%, the confidence of IPDU heterogeneous data integration is considerably higher.Generally speaking, the larger volume of dataset is, the higher accuracy of data integration would be, and the confidence level of the results of heterogeneous data integration of distribution network will also show an overall upward trend.

Summary
Aiming at the low utilization efficiency problem of heterogeneous data resources for intelligent power distribution and utilization in the small sample environment, this paper proposed a so-called GANs based heterogeneous data integration technology.In this proposed method, the sample space is expanded by introducing GANs theory, according to the targeted samples with all of the measurement indexes complete.Then, a novel peak clustering model is constructed to realize the finite open coverage of the expanded sample space, and repair those incomplete samples.At last, the repaired samples are checked by using well-trained discriminator of GANs.Generally speaking, according to creative establishment the finite open coverage of targeted sample space, this paper succeeded in combining of GANs learning and clustering theory, and provided a novel heterogeneous data integration, which cannot be realized by any individual theory alone.
It is worth noting that, as an important part of this work, generative adversarial network models' convergence has not been perfectly proved in theory by any experts and scholars yet, and its convergence rate still needs further improvement.In the next stage of our team's works, we would like to study on the improved convergence schemes of GANs for vector data samples, and the distributed learning schemes of GANs with heterogeneous hardware.

Figure 1 .
Figure 1.Multi-source heterogeneous big data system for intelligent power distribution and utilization (IPDU).

Figure 1 .
Figure 1.Multi-source heterogeneous big data system for intelligent power distribution and utilization (IPDU).

1 {
using discriminator D .If this condition is satisfied, then it demonstrates that generator G performs well in convergence.Combine the new generated sample set D  and original data set D , and denote the combination as

Figure 3 .
Figure 3. Diagram of comparative performance of data restoration.

Figure 3 .
Figure 3. Diagram of comparative performance of data restoration.

1 . 1 . 4 :
d)| , Peaks = {Temp_Peaks i (d)} where, |•| represents the elemental number of vector.Then, return to Step 1.Step Return the finite open coverage set Coverage and peak point set Peaks.

Figure 4 .
Figure 4. Diagram of GANs based heterogeneous data integration for intelligent power distribution and utilization.

Figure 4 .
Figure 4. Diagram of GANs based heterogeneous data integration for intelligent power distribution and utilization.

Figure 5 .
Figure 5. Deviation rate of generative adversarial networks based heterogeneous data integration (GANs-HDI) algorithm on heterogeneous datasets for intelligent power distribution and utilization.

Figure 5 .
Figure 5. Deviation rate of generative adversarial networks based heterogeneous data integration (GANs-HDI) algorithm on heterogeneous datasets for intelligent power distribution and utilization.

Table 1 .
Peak clustering based data restoration algorithm.
Inputs Establish the combined dataset D GANs = {d i } N+M i=1 ; initialize threshold values R 1 , R 2 and clustering threshold value k;Step 1Establish the finite open coverage of the targeted combined dataset:Step 1.1: if Centre = ϕ, randomly select d ∈ Centre from the central point set; otherwise, go to Step 1.4.Step 1.2: calculate i-th open interval: Coverage

Table 2 .
GANs Based Heterogeneous Data Integration.

Table 3 .
Detail information of UCI standard datasets.

Table 4 .
Comparison of four algorithms on UCI datasets.

Table 6 .
Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.

Table 7 .
Normalized data of typical samples.

Table 8 .
Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.

Table 8 .
Performance comparison on heterogeneous datasets for intelligent power distribution and utilization.