# An Imbalanced Data Handling Framework for Industrial Big Data Using a Gaussian Process Regression-Based Generative Adversarial Network

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background Knowledge and Literature Reviews

#### 2.1. Gaussian Processes Regression

#### 2.2. Generative Adversarial Network

## 3. Generative Adversarial Network-Based Missing Value Estimation Framework

## 4. Data Issues in Air Pressure System and Numerical Analysis

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks (accessed on 8 December 2017).
- Paul, D.A. Missing Data; Sage Publications Inc.: Thousand Oaks, CA, USA, 2002; pp. 27–74. [Google Scholar]
- Dempster, P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc.
**1997**, 39, 1–22. [Google Scholar] [CrossRef] - Moon, T.K. The expectation-maximization algorithm. IEEE Signal Process. Mag.
**1996**, 13, 47–60. [Google Scholar] [CrossRef] - Hastie, T.; Tibshirani, R.; Sherlock, G.; Eisen, M.; Brown, P.; Bolstein, D. Imputing Missing Data for Gene Expression Arrays; Technical Report; Standford University Press: Standford, CA, USA, 1999. [Google Scholar]
- Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. Missing value estimation methods for DNA microarrays. Bioinformatics
**2001**, 17, 520–525. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Zhang, Z. Missing data imputation: Focusing on single imputation. Ann. Transl. Med.
**2016**, 4, 9. [Google Scholar] [CrossRef] [PubMed] - Gondara, L.; Wang, K. Multiple imputation using deep denoising autoencoders. arXiv
**2017**, arXiv:1705.02737. [Google Scholar] - Gemmeke, J.F.; Hamme, H.V.; Cranen, B.; Boves, L. Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition. IEEE J. Sel. Top. Signal Process.
**2010**, 4, 272–287. [Google Scholar] [CrossRef] [Green Version] - Oba, S.; Sato, M.; Takemasa, I.; Monden, M.; Matsubara, K.; Ishii, S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics
**2003**, 19, 2088–2096. [Google Scholar] [CrossRef] - Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; NJ John Wiley & Sons Inc.: Hoboken, NJ, USA, 2020; pp. 29–42. [Google Scholar]
- Gondek, C.; Hafner, D.; Sampson, O.R. Prediction of Failures in the Air Pressure System of Scania Trucks using a Random Forest and Feature Engineering. Adv. Intell. Data Anal.
**2016**, 9897, 398–402. [Google Scholar] [CrossRef] - Perepu, S.K.; Tangirala, A.K. Reconstruction of missing data using compressed sensing techniques with adaptive dictionary. J. Process Control
**2016**, 47, 175–190. [Google Scholar] [CrossRef] - Chodosh, N.; Wang, C.; Lucey, S. Deep Convolutional Compressed Sensing for LiDAR Depth Completion. Comput. Vis. ACCV
**2018**, 11361, 499–513. [Google Scholar] [CrossRef] [Green Version] - Williams, C.K.I.; Rasmussen, C.E. Gaussian Processes for Machine Learning; The MIT Press: London, UK, 2006; pp. 7–128. [Google Scholar]
- Williams, C.K.I.; Rasmussen, C.E. Gaussian Processes for Regression. Adv. Neural Process. Syst.
**1996**, 8, 514–520. [Google Scholar] - Rasmussen, C.E. Gaussian Processes in Machine Learning. Adv. Lect. Mach. Learn.
**2004**, 3176, 63–71. [Google Scholar] [CrossRef] [Green Version] - Chu, W.; Ghahramani, Z. Gaussian Processes for Ordinal Regression. J. Mach. Learn. Res.
**2005**, 6, 1019–1041. [Google Scholar] - Schulz, E.; Speekenbrink, M.; Krause, A. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. J. Math. Psychol.
**2017**, 85, 1–16. [Google Scholar] [CrossRef] - Jochem, V.; Juan, P.R.; Anatoly, G.; Jesus, D.; Jose, M.; Gustau, C. Spectral band selection for vegetation properties retrieval using Gaussian processes regression. Int. J. Appl. Earth Obs. Geoinf.
**2016**, 52, 554–567. [Google Scholar] [CrossRef] - Ak, C.; Ergonul, O.; Sencan, I.; Torunoglu, M.A.; Gonen, M. Spatiotemporal prediction of infectious diseases using structured Gaussian processes with application to Crimean-Congo hemorrhagic fever. PLoS Negl. Trop. Dis.
**2018**, 12, e0006737. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Luttinen, J.; Ilin, A. Efficient Gaussian process inference for short-scale spatio-temporal modeling. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, La Palma, Canary Islands, 21–23 April 2012; pp. 741–750. [Google Scholar]
- Nguyen, D.; Peters, J. Learning Robot Dynamics for Computed Torque Control using Local Gaussian Processes Regression. In Proceedings of the ECSIS Symposium on Learning and Adaptive Behaviors for Robotic Systems, Edinburgh, UK, 6–8 August 2008; pp. 59–64. [Google Scholar]
- Nguyen, L.; Hu, G.; Spanos, C.J. Spatio-temporal environmental monitoring for smart buildings. In Proceedings of the 13th IEEE International Conference on Control and Automation, Ohrid, Macedonia, 3–6 July 2017; pp. 277–282. [Google Scholar]
- Chen, N.; Qian, Z.; Meng, X.; Nabney, I.T. Short-term wind power forecasting using Gaussian processes. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; pp. 2790–2796. [Google Scholar]
- Oh, E.; Lee, H. Development of a Convolution-Based Multi-Directional and Parallel Ant Colony Algorithm Considering a Network with Dynamic Topology Changes. Appl. Sci.
**2019**, 9, 3646. [Google Scholar] [CrossRef] [Green Version] - Goodfellow, I.J.; Pouget, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the NIPS 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
- Kim, H.; Lee, H. Fault Detect and Classification Framework for Semiconductor Manufacturing Processes using Missing Data Estimation and Generative Adversary Network. J. Korean Inst. Intell. Syst.
**2018**, 28, 393–400. [Google Scholar] [CrossRef] - Yoon, J.; Jordon, J.; Schaar, M. GAIN: Missing Data Imputation using Generative Adversarial Nets. arXiv
**2018**, arXiv:1806.02920. [Google Scholar] - Kim, H.; Lee, H. Generative Adversarial Networks based Data Generation Framework for Overcoming Imbalanced Manufacturing Process Data. J. Korean Inst. Intell. Syst.
**2019**, 29, 1–8. [Google Scholar] [CrossRef] - Shang, C.; Palmer, A.; Sun, J.; Chen, K.; Lu, J.; Bi, J. VIGAN: Missing view imputation with generative adversarial networks. In Proceedings of the IEEE International Conference on Big Data, Boston, MA, USA, 11–14 December 2017; pp. 766–775. [Google Scholar]
- Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least Squares Generative Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
- Zhao, J.; Mathieu, M.; LeCun, Y. Energy-based generative adversarial network. arXiv
**2016**, arXiv:1609.03126. [Google Scholar] - Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual Generative Adversarial Networks for Small Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, HI, USA, 21–26 July 2017; pp. 1222–1230. [Google Scholar]

**Figure 6.**Backpropagation-based learning process for the ggenerative adversarial network (GAN). (

**a**) architecture of the discriminator from real data; (

**b**) architecture of the discriminator from fake data; (

**c**) architecture of the generator.

**Figure 7.**A time series-based numerical analysis using GPR-GAN: (

**a**) randomly generated original data; (

**b**) comparison between actual data and data estimated using GPR.

**Figure 8.**Numerical analysis using the proposed framework: (

**a**) randomly generated data and their pass/fail outputs; (

**b**) randomly selected missing values in the data set; (

**c**) newly generated data using the proposed framework; (

**d**) classification results between the original data and the generated data using the proposed framework.

**Figure 9.**GAN-based DNN structure: (

**a**) structure of the proposed framework; (

**b**) classification structure using DNN.

**Figure 10.**Comparison graphs among each classification model: (

**a**) comparisons of “an actual passes versus predicted passes” in the negative class; (

**b**) comparison of “an actual failures versus predicted failures” in the positive class.

Missing Value Handling Method | Research Studies and Applications | Main Estimation Techniques and Frameworks |
---|---|---|

Imputation | Paul [2] | Multiple imputation (MI)-based missing value estimation |

Dempster, Laird, and Rubin [3] | Probability modeling maximum likelihood estimation (MLE)-based estimation Expectation maximization-based estimation [4] | |

Hastie et al. [5], Troyanskaya et al. [6] | Singular value decomposition (SVD) and K-nearest neighbor (KNN)-based missing data imputation | |

Zhang [7] | Regression-based imputation | |

Gondara and Wang [8] | Deep denoising autoencoder-based imputation | |

Gemmeke et al. [9] | Missing data imputation using sparse imputation-based compressive sensing (CS) | |

Estimation | Oba et al. [10] | Bayesian network-based preprocessing Gene profiling expression |

Little and Rubin [11] | Least squares-based missing data analysis | |

Gondek, Hafner, and Sampson [12] | Missing value imputation using random forest and feature engineering | |

Perepu and Tangirala [13] | Missing value estimation using a CS method with adaptive dictionary | |

Chodosh, Wang, and Lucey [14] | Estimating a dense depth map using a CS method and alternating direction neural networks |

Research Studies Using GPR | Application Areas |
---|---|

Jochem et al. [20] | Automated spectral band analysis |

Ak et al. [21] | The time and space prediction of an infectious diseases |

Luttinen and Ilin [22] | Sea level temperature reconstruction using GPR |

Nguyen and Peters [23] | Kinetics model estimation |

Nguyen, Hu and Spanos [24] | Efficient building field formation using an estimation of indoor environment fields |

Chen et al. [25] | Wind prediction for energy efficiency |

Oh and Lee [26] | Estimation of pheromone values based on ant colony optimization |

Research Studies Using GAN | Application and Characteristics |
---|---|

Kim and Lee [28] | Missing data generation of semiconductor manufacturing processes data method: Oversample → GAN based data generation |

Yoon, Jordon, and Schaar [29] | Missing data imputation of breast cancer, spam, letter recognition, credit, news data GAN-based hint generation |

Kim and Lee [30] | Missing data generation of steel Plates faults data Estimate the missing value by adding missing term based on the GAN |

Shang et al. [31] | Image generation GAN-based missing view imputation |

Mao et al. [32] | Image generation Least squares loss function-based discriminator in a GAN |

Zhao, Mathieu, and Le Cun [33] | Image generation Energy value allocation according to data density-based a discriminator in a GAN |

Li et al. [34] | Object detection GAN-based high-quality image generation |

Categories | Total Size | Data Item with Missing Values | Percentages of Missing Items | |
---|---|---|---|---|

Training data | Number of attributes | 170 | 170 | 100% |

Data set | 60,000 | 59,998 | 99.99% | |

Test data | Number of attributes | 170 | 169 | 99.41% |

Data set | 16,000 | 16,000 | 100% |

**Table 5.**Models and parameters for each testing algorithm. Note: classification and regression tree = CART; compressed sensing = CS.

Tested Frameworks | Equation | Parameter |
---|---|---|

GPR-based GAN (Proposed framework) | $\begin{array}{ll}\underset{G}{\mathrm{min}}\text{}\underset{D}{\mathrm{max}}V\left(D,G\right)& ={E}_{x~{p}_{data\left(x\right)}}\left[\mathrm{log}D\left(x\right)\right]\\ & +{E}_{z~{p}_{z}\left(z\right)}\left[\mathrm{log}(1-D\left(G\left(z\right))\right)\right]\end{array}$ | Max epochs = 100 Learning rate = 0.1 Momentum = 0.5 |

CART | ${I}_{G}\left(f\right)=1-{\displaystyle {\displaystyle \sum}_{i=1}^{m}}{{f}_{i}}^{2}$ $\begin{array}{ll}Cost\text{}function& =\frac{{m}_{left}}{m}{I}_{G}\left({f}_{left}\right)\\ & +\frac{{m}_{right}}{m}{I}_{G}\left({f}_{right}\right)\end{array}$ | Tree’s depth = 50,000 Costs of misclassification = 1:1 Occupation percentage = 1:1 |

GPR | $f~\mathrm{GP}\left(\mathrm{m}\left(x\right),\mathrm{k}\left(x,{x}_{*}\right)\right)$ | Nonparametric |

K-means | $\mathrm{arg}\text{}\underset{S}{\mathrm{min}}{\displaystyle {\displaystyle \sum}_{i=1}^{k}}{\displaystyle {\displaystyle \sum}_{x\in {S}_{i}}}{\Vert x-{\mu}_{i}\Vert}^{2}$ | Maximum number of runs = 100 Distance calculation method = Euclidean Surface pretreatment = Normalization |

Mean-based GAN | $\begin{array}{ll}\underset{G}{\mathrm{min}}\text{}\underset{D}{\mathrm{max}}V\left(D,G\right)& ={E}_{x~{p}_{data\left(x\right)}}\left[\mathrm{log}D\left(x\right)\right]\\ & +{E}_{z~{p}_{z}\left(z\right)}\left[\mathrm{log}(1-D\left(G\left(z\right))\right)\right]\end{array}$ | Max epochs = 100 Learning rate = 0.1 Momentum = 0.5 |

CS | $\underset{x}{\mathrm{min}}{\Vert x\Vert}_{1}\text{}subject\text{}to\text{}y=wx$ | Max epochs = 100 Learning rate = 0.1 |

Proposed Framework | Pass (Predicted) | Fail (Predicted) | |
---|---|---|---|

Pass (Actual) | 15,418 | 207 | |

Fail (Actual) | 59 | 316 | |

CART | Pass (Predicted) | Fail (Predicted) | |

Pass (Actual) | 14,084 | 1541 | |

Fail (Actual) | 259 | 116 | |

GPR | Pass (Predicted) | Fail (Predicted) | |

Pass (Actual) | 14,625 | 1000 | |

Fail (Actual) | 214 | 161 | |

K-means | Pass (Predicted) | Fail (Predicted) | |

Pass (Actual) | 11,249 | 4376 | |

Fail (Actual) | 295 | 80 | |

Mean-based GAN | Pass (Predicted) | Fail (Predicted) | |

Pass (Actual) | 14,654 | 971 | |

Fail (Actual) | 146 | 229 | |

CS | Pass (Predicted) | Fail (Predicted) | |

Pass (Actual) | 14,782 | 843 | |

Fail (Actual) | 106 | 269 |

Proposed Framework | CART | GPR | K-Means | Mean-Based GAN | CS | |
---|---|---|---|---|---|---|

Precision | 0.996 | 0.981 | 0.985 | 0.974 | 0.99 | 0.992 |

Recall | 0.986 | 0.901 | 0.936 | 0.719 | 0.937 | 0.946 |

Fall-out | 0.157 | 0.690 | 0.570 | 0.786 | 0.389 | 0.282 |

Accuracy | 0.983 | 0.887 | 0.924 | 0.708 | 0.93 | 0.94 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Oh, E.; Lee, H.
An Imbalanced Data Handling Framework for Industrial Big Data Using a Gaussian Process Regression-Based Generative Adversarial Network. *Symmetry* **2020**, *12*, 669.
https://doi.org/10.3390/sym12040669

**AMA Style**

Oh E, Lee H.
An Imbalanced Data Handling Framework for Industrial Big Data Using a Gaussian Process Regression-Based Generative Adversarial Network. *Symmetry*. 2020; 12(4):669.
https://doi.org/10.3390/sym12040669

**Chicago/Turabian Style**

Oh, Eunseo, and Hyunsoo Lee.
2020. "An Imbalanced Data Handling Framework for Industrial Big Data Using a Gaussian Process Regression-Based Generative Adversarial Network" *Symmetry* 12, no. 4: 669.
https://doi.org/10.3390/sym12040669