An Imbalanced Data Handling Framework for Industrial Big Data Using a Gaussian Process Regression-Based Generative Adversarial Network

: The developments in the ﬁelds of industrial Internet of Things (IIoT) and big data technologies have made it possible to collect a lot of meaningful industrial process and quality-based data. The gathered data are analyzed using contemporary statistical methods and machine learning techniques. Then, the extracted knowledge can be used for predictive maintenance or prognostic health management. However, it is di ﬃ cult to gather complete data due to several issues in IIoT, such as devices breaking down, running out of battery, or undergoing scheduled maintenance. Data with missing values are often ignored, as they may contain insu ﬃ cient information from which to draw conclusions. In order to overcome these issues, we propose a novel, e ﬀ ective missing data handling mechanism for the concepts of symmetry principles. While other existing methods only attempt to estimate missing parts, the proposed method generates a whole set of data set using Gaussian process regression and a generative adversarial network. In order to prove the e ﬀ ectiveness of the proposed framework, we examine a real-world, industrial case involving an air pressure system (APS), where we use the proposed method to make quality predictions and compare the results with existing state-of-the-art estimation methods.


Introduction
Failure analysis and process predictions for manufacturing processes have significant impacts on improving product quality and process reliability. A number of studies on failure cause analysis and classification using machine learning or deep learning methods are actively been conducted, which will lead to improvements in product quality and reliability. For instance, a failure of the brakes in a vehicle may lead to a significant accident, so it is important to predict any potential failures. In heavy vehicles such as trucks excavators, the performance of the brakes is very important due to the heavy weight they must bring to a halt. The brakes of a truck are driven by an air pressure system, and therefore require pressurized air to operate. If the air pressure in the system does not reach a certain level, its brakes may not work precisely and then serious accidents can occur. This research focuses on quality prediction of an air compressor and pressure system found in vehicles. The air pressure system (APS), an essential part of the vehicle, stops or decelerates the vehicle by applying compressed air to the brakes. Figure 1 depicts the APS structure.
As shown in Figure 1, the air compressor compresses air that is initially at atmospheric pressure and maintains the air at optimum pressure through the air compressor governor. The compressed air moves to the air reservoir. When the driver presses the brake pedal, the brake valve closes and As shown in Figure 1, the air compressor compresses air that is initially at atmospheric pressure and maintains the air at optimum pressure through the air compressor governor. The compressed air moves to the air reservoir. When the driver presses the brake pedal, the brake valve closes and compressed air stored in the air reservoir is used to brake by applying pressure to the brake chamber through the line.
It is important to predict defects and malfunctions in the APS. Well considered analysis of data leads to reliable predictions of potential APS failures and contributes to the minimization of dangers, as well as reduces maintenance costs. In this study, the data set for the failure analysis is from APS operations and the relevant quality data is from the Scania trucks data set [1]. This data set consists of the operating sensor attributes from a broken Scania truck, in which the attributes are anonymized. In order to investigate the causes of any APS failure, these attributes and their data are analyzed using several machine learning and statistical methods.
However, the data have a number of missing values due to sensor failures. Missing data can mean that there is insufficient information for analysis by existing methods and can make it difficult to identify the actual cause of any failure. In addition, missing values cause disproportionately distributed data, and for models trained on properly proportioned data this reduces the accuracy of the classification performance. Figure 2 shows the issue of missing values in a real APS failure incident in the Scania trucks data set. Missing values are marked as not available (na). The data are used for the failure study of a vehicle. The positive class (pos) of the data set indicates that the failure is in the APS, and the negative class (neg) indicates that the failure is not related to the APS. It is important to predict defects and malfunctions in the APS. Well considered analysis of data leads to reliable predictions of potential APS failures and contributes to the minimization of dangers, as well as reduces maintenance costs. In this study, the data set for the failure analysis is from APS operations and the relevant quality data is from the Scania trucks data set [1]. This data set consists of the operating sensor attributes from a broken Scania truck, in which the attributes are anonymized. In order to investigate the causes of any APS failure, these attributes and their data are analyzed using several machine learning and statistical methods.
However, the data have a number of missing values due to sensor failures. Missing data can mean that there is insufficient information for analysis by existing methods and can make it difficult to identify the actual cause of any failure. In addition, missing values cause disproportionately distributed data, and for models trained on properly proportioned data this reduces the accuracy of the classification performance. Figure 2 shows the issue of missing values in a real APS failure incident in the Scania trucks data set. Missing values are marked as not available (na). The data are used for the failure study of a vehicle. The positive class (pos) of the data set indicates that the failure is in the APS, and the negative class (neg) indicates that the failure is not related to the APS.
In APS failure analysis, the classification performance depends heavily on the data's completeness level. The more missing values in the data, the lower the completeness level is. Unfortunately, the Scania APS data contains a number of missing values. In addition, these missing values may cause a lack of data for analysis. Therefore, it is important to have a framework that handles incomplete data. Missing values and data imbalances are a primary cause for less accurate quality prediction. In addition, several kinds of multivariate statistical analysis cannot be applied to this data set due to the issues of missing values. This situation makes missing value estimations the most important preprocessing steps. A number of existing studies [2][3][4][5][6][7][8][9][10][11][12][13] have proposed methods to overcome this data imbalance issue. Table 1 summarizes several existing missing value estimation methods and their applications. In APS failure analysis, the classification performance depends heavily on the data's completeness level. The more missing values in the data, the lower the completeness level is. Unfortunately, the Scania APS data contains a number of missing values. In addition, these missing values may cause a lack of data for analysis. Therefore, it is important to have a framework that handles incomplete data. Missing values and data imbalances are a primary cause for less accurate quality prediction. In addition, several kinds of multivariate statistical analysis cannot be applied to this data set due to the issues of missing values. This situation makes missing value estimations the most important preprocessing steps. A number of existing studies [2][3][4][5][6][7][8][9][10][11][12][13] have proposed methods to overcome this data imbalance issue. Table 1 summarizes several existing missing value estimation methods and their applications.

Imputation
Paul [2] Multiple imputation (MI)-based missing value estimation Dempster, Laird, and Rubin [3] Probability modeling maximum likelihood estimation (MLE)-based estimation Expectation maximization-based estimation [4] Hastie et al. [5], Troyanskaya et al. [6] Singular value decomposition (SVD) and K-nearest neighbor (KNN)-based missing data imputation Zhang [7] Regression-based imputation Gondara and Wang [8] Deep denoising autoencoder-based imputation Gemmeke et al. [9] Missing data imputation using sparse imputation-based compressive sensing (CS) Estimation Oba et al. [10] Bayesian network-based preprocessing Gene profiling expression Little and Rubin [11] Least squares-based missing data analysis Gondek, Hafner, and Sampson [12] Missing value imputation using random forest and feature engineering Perepu and Tangirala [13] Missing value estimation using a CS method with adaptive dictionary Chodosh, Wang, and Lucey [14] Estimating a dense depth map using a CS method and alternating direction neural networks Most research methods resort to data imputation, where a data set with missing values is ignored. However, these kinds of methods cause distortion of the captured data-based probability distribution. In addition, the variance of the data that is used becomes heavily distorted. In order to overcome the issue, this paper proposes a novel, effective framework to generate a complete data set using a generative adversarial network (GAN) and Gaussian processes regression (GPR). The proposed framework is based on the symmetry principles, which the original data and the generated data have. Scania APS data are used as exemplary test data and the proposed method is compared with other existing methods.
Section 2 examines the background of the key techniques we use (GPR and GAN) and reviews the relevant literature. Section 3 proposes an overall framework for generating a data set using GPR and GAN. Section 4 shows the proposed method's effectiveness with experimental analysis of the proposed framework and provides comparisons with other classification models using the APS data.

Gaussian Processes Regression
GPR [15][16][17][18][19] is a Bayesian algorithm and has the ability to provide a statistical uncertainty measure. Since it can provide high uncertainty prediction measurements in changing environments, the GPR algorithm has been applied in various research fields, as shown in Table 2. Table 2. Research studies and applications using Gaussian processes regression (GPR).

Research Studies Using GPR Application Areas
Jochem et al. [20] Automated spectral band analysis Ak et al. [21] The time and space prediction of an infectious diseases Luttinen and Ilin [22] Sea level temperature reconstruction using GPR Nguyen and Peters [23] Kinetics model estimation Nguyen, Hu and Spanos [24] Efficient building field formation using an estimation of indoor environment fields Chen et al. [25] Wind prediction for energy efficiency Oh and Lee [26] Estimation of pheromone values based on ant colony optimization Figure 3 depicts the general process of GPR. The latent variable f i is derived from the input value X i with the observed value Y i . The distribution of the test observation value Y * is estimated using the Gaussian field f * for the input value X * .
Symmetry 2020, 12, x FOR PEER REVIEW 5 of 21 Table 2. Research studies and applications using Gaussian processes regression (GPR).

Research Studies Using GPR Application Areas
Jochem et al. [20] Automated spectral band analysis Ak et al. [21] The time and space prediction of an infectious diseases Luttinen and Ilin [22] Sea level temperature reconstruction using GPR Nguyen and Peters [23] Kinetics model estimation Nguyen, Hu and Spanos [24] Efficient building field formation using an estimation of indoor environment fields Chen et al. [25] Wind prediction for energy efficiency Oh and Lee [26] Estimation of pheromone values based on ant colony optimization Figure 3 depicts the general process of GPR. The latent variable is derived from the input value with the observed value . The distribution of the test observation value * is estimated using the Gaussian field * for the input value * . The training data set is {( ( ) , y ( ) )} , while Equations (1) and (2) summarize the general GPR model. The training data set is , while Equations (1) and (2) summarize the general GPR model. where, In Equation (1), is a noise parameter that follows a Gaussian distribution, with variance of σ 2 y and a mean of 0. Here, I is an identity matrix constructed according to the data's dimensions, f is the transformation relation-based value for the equivalent input vector X, and Y is the observed output vector. Equation (2) is the "distribution over functions" [19] used to derive the test target value Y using the Gaussian model. The distribution consists of a mean m(x) and variance covariance k(x, x * ) derived by sampling from a multivariate Gaussian distribution. The covariance function k(x, x * ) is commonly parameterized by a kernel parameter and models the dependence between existing observed input points x and new test input points to predict x * . Equation (3) is a radial basis kernel function that calculates a similarity measure between both data instances. The kernel is a closeness measure of data points. It is not just used to model the dependence of observed and unobserved points, but all points.
where γ is a hyper parameter of the kernel function. The parameter γ is related with the variance of a data set. After the mean function and kernel types are selected, the Gaussian process produce predictions are made based on previous observations. However, any actual data obtained from measurements include a lot of noise in general. Therefore, an observed output vector Y with its noise is expressed as in Equation (4).
In Equation (4), the observed output vector Y with a mean 0 and covariance k + σ 2 y I follows the properties of multivariate Gaussian distribution and is used as a prediction model. Equation (5) denotes the prior distribution for f * (X * ) with the noise condition, f * (X * ) is a function that outputs a predicted vector Y * for a vector X * , which has a new input point x * . It is a prior distribution model that considers the noise generated when the test data vector X * is input and the output data are predicted through the function value f * (X * ).
The maximum likelihood estimator (MLE) and variance of the predicted distribution derived using the newly updated Gaussian process is expressed as seen in Equations (6) and (7).
In this study, the introduced GPR framework is used to estimate missing values in our real-world imbalanced data set. The GPR process helps produce more accurate modeling from the data. The following section explains the GAN method, which generates a new data set using the GPR-based missing value estimations.

Generative Adversarial Network
GAN [27] is a generative model that uses the neural network architecture. A general GAN framework includes two types of model, both of which are neural networks: a generator (G) and a discriminator (D). Both models generate data that become closer to the real data set by competing with each other. GANs have been used in many research fields, as shown in Table 3. Table 3. Research studies and applications using a generative adversarial network (GAN).

Research Studies Using GAN Application and Characteristics
Kim and Lee [28] Missing data generation of semiconductor manufacturing processes data method: Oversample → GAN based data generation Yoon, Jordon, and Schaar [29] Missing data imputation of breast cancer, spam, letter recognition, credit, news data GAN-based hint generation Kim and Lee [30] Missing data generation of steel Plates faults data Estimate the missing value by adding missing term based on the GAN Shang et al. [31] Image generation GAN-based missing view imputation Mao et al. [32] Image generation Least squares loss function-based discriminator in a GAN Zhao, Mathieu, and Le Cun [33] Image generation Energy value allocation according to data density-based a discriminator in a GAN Li et al. [34] Object detection GAN-based high-quality image generation As shown in Figure 4, the generator G takes a vector z extracted from random noise as its input and attempts to generate data which is close to the real data. The discriminator D learns how to distinguish between real data and the generated fake data. During training, while this process is being repeated, the generator minimizes the probability that the discriminator can distinguish real from generated data, and the discriminator maximizes the probability of distinguishing real data from the generated data.
Symmetry 2020, 12, x FOR PEER REVIEW 7 of 21 Table 3. Research studies and applications using a generative adversarial network (GAN).

Research Studies Using GAN Application and Characteristics
Kim and Lee [28] Missing data generation of semiconductor manufacturing processes data method: Oversample  GAN based data generation Yoon, Jordon, and Schaar [29] Missing data imputation of breast cancer, spam, letter recognition, credit, news data GAN-based hint generation Kim and Lee [30] Missing data generation of steel Plates faults data Estimate the missing value by adding missing term based on the GAN Shang et al. [31] Image generation GAN-based missing view imputation Mao et al. [32] Image generation Least squares loss function-based discriminator in a GAN Zhao, Mathieu, and Le Cun [33] Image generation Energy value allocation according to data density-based a discriminator in a GAN Li et al. [34] Object detection GAN-based high-quality image generation As shown in Figure 4, the generator G takes a vector z extracted from random noise as its input and attempts to generate data which is close to the real data. The discriminator D learns how to distinguish between real data and the generated fake data. During training, while this process is being repeated, the generator minimizes the probability that the discriminator can distinguish real from generated data, and the discriminator maximizes the probability of distinguishing real data from the generated data. As shown in Figure 4, the sample data extracted from the real data are represented by and the distribution of the real data are ( ) . The distribution of the data from the generator is and the input noise variable is ( ). The discriminator and the generator are differential multilayer perceptrons with and as parameters, respectively. ( ) is the probability that comes from the real data distribution. ( ( )) is the probability that ( ) comes from the , which is not from the real data distribution. ( ) should point to 1 and ( ( )) should point to 0, resulting in a min-max problem as shown in Equation (8). The objective function of GAN is shown in Equation (8). As shown in Figure 4, the sample data extracted from the real data are represented by x and the distribution of the real data are p data(x) . The distribution of the data from the generator is p g and the input noise variable is p g (z). The discriminator and the generator are differential multilayer perceptrons with θ d and θ g as parameters, respectively.
D(x) is the probability that x comes from the real data distribution. D(G(z)) is the probability that G(z) comes from the p g , which is not from the real data distribution. D(x) should point to 1 and D(G(z)) should point to 0, resulting in a min-max problem as shown in Equation (8). The objective function of GAN is shown in Equation (8). where According to the distributions (p g and p data(x) ), the discriminator learns to distinguish between the real and the fake, and the generator also learns to produce a distribution that is similar to the real data in order to prevent the discriminator from easily distinguishing what is fake. If this learning process is repeated, p g = p data(x) , so we get to the point where the discriminator cannot distinguish anymore. Then, the converged D(x) follows Equation (9).
In Equation (8), the optimum value is obtained at p g = p data(x) so, the value of D * (x) is 1 2 . The generator proceeds learning in a way so that D * (x) becomes close to 1 2 . This research applies the GAN framework, which is used for the correction of missing data after GPR correction. The detailed framework is provided in Section 3.

Generative Adversarial Network-Based Missing Value Estimation Framework
In general, real data from manufacturing processes contains a number of missing values. This causes a lack of data and data imbalance as a result. These issues such as data shortages and data imbalances make it difficult to analyze the industrial data accurately. This section explains a new and effective framework to estimate the missing values and generate data that is closer to the real data distribution. Figure 5 shows the detailed procedures for generating a data set, which includes missing values.  )))]. According to the distributions ( and ( ) ), the discriminator learns to distinguish between the real and the fake, and the generator also learns to produce a distribution that is similar to the real data in order to prevent the discriminator from easily distinguishing what is fake. If this learning process is repeated, = ( ) , so we get to the point where the discriminator cannot distinguish anymore. Then, the converged ( ) follows Equation (9). * ( ) = ( ) In Equation (8), the optimum value is obtained at = ( ) so, the value of * ( ) is . The generator proceeds learning in a way so that * ( ) becomes close to .
This research applies the GAN framework, which is used for the correction of missing data after GPR correction. The detailed framework is provided in Section 3.

Generative Adversarial Network-Based Missing Value Estimation Framework
In general, real data from manufacturing processes contains a number of missing values. This causes a lack of data and data imbalance as a result. These issues such as data shortages and data imbalances make it difficult to analyze the industrial data accurately. This section explains a new and effective framework to estimate the missing values and generate data that is closer to the real data distribution. Figure 5 shows the detailed procedures for generating a data set, which includes missing values.  As shown in Figure 5, the missing values of the original data are indicated as not available (na). First, if na exists, it is replaced with the average value of the attribute data that are missing. The average value e hl is derived according to Equation (10).
where n is the number of remaining values, except for the missing value in the attribute, where the missing value comes from; h is the number of instance vectors and l is the number of attributes. Then, approximate estimation of the missing value is achieved using GPR. In this case, GPR is applied to an instance vector.
Equation (11) is derived using the prediction procedure provided in Section 2.1. The missing value is estimated by predicting a new estimate Y * through Equations (12)- (14). Missing values are predicted based on GPR and updated to e hl .
In Equation (12), X is an input vector with l attributes and k is the missing index among these attributes. Equation (13) is the distribution of the latent variables in the k th attribute where the missing value occurs, and the missing value correction value p hk using GPR is the same as Equation (13). Then, the e hk is transformed to Y k using Equation (14).
Finally, GAN is used to estimate missing values. The discriminator distinguishes the real instance vector distribution from the generated instance vector distribution, and the generator produces a new instance vector distribution based on the error generated by the discriminator.
The discriminator derives gradients using the backpropagation algorithm to maximize Equation (8), while the generator derives the relevant gradient using the backpropagation algorithm to minimize f G (x g ) during the learning process. In this study, the gradient is derived to maximize E z∼p z (z) [log(D(G(z)))] to increase the convergence speed of learning. The learning process using the backpropagation algorithm for the discriminator and generator is shown in Figure 6. Figure 6a shows the discriminator's learning process that is used to distinguish whether the input data are from real data or are generated data. Figure 6b shows the learning process of a discriminator to distinguish whether the input data are real or not. Figure 6c shows the learning process in which the generator generates data from random noise values. where n is the number of remaining values, except for the missing value in the attribute, where the missing value comes from; h is the number of instance vectors and l is the number of attributes. Then, approximate estimation of the missing value is achieved using GPR. In this case, GPR is applied to an instance vector.
Equation (11) is derived using the prediction procedure provided in Section 2.1. The missing value is estimated by predicting a new estimate * through Equations (12)- (14). Missing values are predicted based on GPR and updated to .
In Equation (12), is an input vector with attributes and is the missing index among these attributes. Equation (13) is the distribution of the latent variables in the attribute where the missing value occurs, and the missing value correction value using GPR is the same as Equation (13). Then, the is transformed to using Equation (14). Finally, GAN is used to estimate missing values. The discriminator distinguishes the real instance vector distribution from the generated instance vector distribution, and the generator produces a new instance vector distribution based on the error generated by the discriminator.
The discriminator derives gradients using the backpropagation algorithm to maximize Equation (8), while the generator derives the relevant gradient using the backpropagation algorithm to minimize ( ) during the learning process. In this study, the gradient is derived to maximize ~ ( ) [log( ( ( )))] to increase the convergence speed of learning. The learning process using the backpropagation algorithm for the discriminator and generator is shown in Figure 6.  Figure 6a shows the discriminator's learning process that is used to distinguish whether the input data are from real data or are generated data. Figure 6b shows the learning process of a discriminator to distinguish whether the input data are real or not. Figure 6c shows the learning process in which the generator generates data from random noise values.    ∂V where Equations (15)- (17) summarize the backpropagation processes of the discriminator, and Equation (18) and Equation (19) summarize the backpropagation process of the generator. The backpropagation process of the discriminator derives the gradients for D(x) and D(G(z)), as shown in respective Equation (16) and Equation (17).
w is the weight of the neural network, and the sigmoid function is used as its activation function in this paper. Based on the output derived by the discriminator learning, the generator updates w through the gradient to produce a newly generated data. When D(x) converges through repetition, the estimation process is terminated.
The missing value p hk in Equation (20) is estimated and a new data set is generated using Equation (21). Using these processes, a new set of data are generated. The generated data are considered balanced data.
In order to show the effectiveness of GPR-based GAN, time series data with missing values were tested. Figure 7a shows original time-series data, including Autoregressive Moving-Average (ARMA) model -ARMA(1,1) with the Gaussian noise N(0,2). The number of data points is 1024 (N = 1024). Among them, 100 points are randomly picked as missing parts.
Then, GPR-GAN is applied to estimate the missing value. Figure 7b shows the data gap between the original data and the newly generated data. In order to measure the accuracy of the generated method, Equation (22) is applied.  (17) summarize the backpropagation processes of the discriminator, and Equation (18) and Equation (19) summarize the backpropagation process of the generator. The backpropagation process of the discriminator derives the gradients for ( ) and ( ( )), as shown in respective Equation (16) and Equation (17).
w is the weight of the neural network, and the sigmoid function is used as its activation function in this paper. Based on the output derived by the discriminator learning, the generator updates w through the gradient to produce a newly generated data. When ( ) converges through repetition, the estimation process is terminated.
The missing value in Equation (20) is estimated and a new data set is generated using Equation (21). Using these processes, a new set of data are generated. The generated data are considered balanced data.
In order to show the effectiveness of GPR-based GAN, time series data with missing values were tested. Figure 7a shows original time-series data, including Autoregressive Moving-Average (ARMA) model -ARMA(1,1) with the Gaussian noise N(0,2). The number of data points is 1024 (N=1024). Among them, 100 points are randomly picked as missing parts. Then, GPR-GAN is applied to estimate the missing value. Figure 7b shows the data gap between the original data and the newly generated data. In order to measure the accuracy of the generated method, Equation (22) is applied.
The proposed method has 91.23% accuracy in the provided numerical test. In order to show the effectiveness of the proposed framework, randomly generated data and their pass/fail outputs are considered. Figure 8a shows a data set from a randomized time series with Gaussian noise N(0,2). The data set size is 50 data points and each data point is composed of 20 attributes. Their outputs are divided randomly into 44 passes (1) and 6 fails (0). Then, as shown in Figure 8b, several selected sections are considered as the sections with missing values.
The proposed method has 91.23% accuracy in the provided numerical test. In order to show the effectiveness of the proposed framework, randomly generated data and their pass/fail outputs are considered. Figure 8a shows a data set from a randomized time series with Gaussian noise N(0, 2). The data set size is 50 data points and each data point is composed of 20 attributes. Their outputs are divided randomly into 44 passes (1) and 6 fails (0). Then, as shown in Figure 8b, several selected sections are considered as the sections with missing values. Figure 8c shows a newly generated data set using the proposed framework. Finally, the pass/fail predictions using the randomly generated original data and the generated data are conducted as shown in Figure 8d.
The following section shows how the proposed framework is effective using the real data set.

Data Issues in Air Pressure System and Numerical Analysis
This section proves the effectiveness of the proposed framework and compares it with other existing methods. As discussed in previous sections, the proposed framework has the advantage of high performance in classification that comes from using accurate interpolations of missing values using the GPR-based GAN framework.
In order to show the effectiveness of the proposed framework, real-world data are used, specifically the APS failure Scania trucks data set [1]. The data consist of 170 attributes and 60,000 instance vectors. Training data are divided into 59,000 negative classes of APS-based faults and 1000 positive cases. Test data are divided into 15,625 negative cases and 375 positive cases. Several attributes in each piece of data have missing values. Table 4 shows the number of missing values in the APS data. In order to estimate the missing values, the proposed framework is applied. Figure 9a shows the applied structure of the proposed framework. The generator regenerates the data by reflecting the discriminator's objective value that distinguishes whether the data points generated by the generator are real data points or not. The deep neural network (DNN) is trained and tested to produce interpolated data, as shown in the red box in Figure 9b. A pass (1) or fail (0) is diagnosed using the DNN model. The DNN model has one input layer, multiple hidden layers, and one output layer. The hyperparameter for the experiment is set to 0.001 for its learning rate, 28 for the mini-batch size, 100 for max-epochs, and 0.5 for momentum.
In order to verify the effectiveness of the proposed framework, it is compared with other existing methods, including the classification and regression tree (CART), GPR, K-means, mean-based GAN, and compressed sensing (CS) methods. Table 5 summarizes these models and the relevant parameters. Table 5. Models and parameters for each testing algorithm. Note: classification and regression tree = CART; compressed sensing = CS.

Tested Frameworks Equation Parameter
GPR-based GAN (Proposed framework) In order to verify the effectiveness of the proposed framework, it is compared with other existing methods, including the classification and regression tree (CART), GPR, K-means, mean-based GAN, and compressed sensing (CS) methods. Table 5 summarizes these models and the relevant parameters.  Table 6 shows the results of the confusion matrix experiment with test data for each classification model. As shown in Table 6 and Figure 10, the proposed GPR-based GAN framework shows the lowest rates of type-I and type-II errors.  Table 7 summarizes several performance evaluation indicators using the confusion matrix from Table 6. A true positive (TP) is given if a pass is indicated in real data and a pass is indicated by the classification model. A false negative (FN) is given if a pass is indicated in real data but a fail is indicated by the classification model. A false positive (FP) is given if a fail is indicated in real data but a pass is indicated by the classification model. A true negative (TN) is given if a fail is indicated in real data and a fail is indicated by the classification model.  Table 7, the definition of "precision" is the number of TP divided by the number of TP plus FP, while the "recall" is the number of TP divided by the number of TP plus FN. The "precision" and the "recall" handle the cases where the classification model classifies a pass when the real data indicates a pass. The "fall-out" is the number of FP divided by the number of TN plus  Table 7 summarizes several performance evaluation indicators using the confusion matrix from Table 6. A true positive (TP) is given if a pass is indicated in real data and a pass is indicated by the classification model. A false negative (FN) is given if a pass is indicated in real data but a fail is indicated by the classification model. A false positive (FP) is given if a fail is indicated in real data but a pass is indicated by the classification model. A true negative (TN) is given if a fail is indicated in real data and a fail is indicated by the classification model. As outlined in Table 7, the definition of "precision" is the number of TP divided by the number of TP plus FP, while the "recall" is the number of TP divided by the number of TP plus FN. The "precision" and the "recall" handle the cases where the classification model classifies a pass when the real data indicates a pass. The "fall-out" is the number of FP divided by the number of TN plus FP. It handles misclassifications by the classification model when the real data indicate a fail but the classification model gives a pass. The "accuracy" is the number of TP plus TN divided by the sum of TP, TN, FP, and FN. "Accuracy" handles cases where both passes and fails are correctly classified. This is used as the main performance evaluation indicator.
As shown in Table 7 and Figure 11, the accuracy of the proposed framework is 98.3% and the fall-out is 15.7%, thus it is experimentally proved that the missing value handling using the proposed framework has better performance than other existing methods.
Symmetry 2020, 12, x FOR PEER REVIEW 19 of 21 Figure 11. Comparison graph of "fall-out" and "accuracy" among four test models.

Conclusions
Failure analysis and relevant predictions generated from industrial big data are essential processes for industry in order to produce high quality and reliable products. However, most industrial big data sets are incomplete due to various issues. In these situations, existing algorithms fail to provide accurate corrections for the missing data. Therefore, any classification task executed on these kinds of incomplete data sets shows very poor performance.
In order to overcome this issue, the proposed framework generates a new complete data set using the proposed GPR-based GAN framework. The provided framework is based on the symmetry properties. First, the missing values are replaced with the mean value of the appropriate attribute. Then, the missing value estimates are refined by applying GPR. The data characteristics from this GPR process are linked to the GAN. Finally, the GAN is applied to generate further refinements to generate new data that are similar to the real data. The generated data are used as training data and help to overcome any data imbalances in the input data set.
In order to prove the performance of the proposed framework, it is compared with existing classification models using a real industrial data set related to APS failure in Scania trucks. Numerical analysis shows that the proposed framework has higher accuracy and lower fall-out than existing classification models. Through numerical analysis, it was confirmed that the proposed framework is effective compared with existing classification models.
The proposed framework can be used to estimate missing values in a data set with a high frequency of missing data. In addition, industrial data sets that are highly distorted with large data imbalances can be successfully analyzed using the proposed framework. In future studies, we hope to incorporate more efficient computation methods to handle data from multiple industrial areas. Figure 11. Comparison graph of "fall-out" and "accuracy" among four test models.

Conclusions
Failure analysis and relevant predictions generated from industrial big data are essential processes for industry in order to produce high quality and reliable products. However, most industrial big data sets are incomplete due to various issues. In these situations, existing algorithms fail to provide accurate corrections for the missing data. Therefore, any classification task executed on these kinds of incomplete data sets shows very poor performance.
In order to overcome this issue, the proposed framework generates a new complete data set using the proposed GPR-based GAN framework. The provided framework is based on the symmetry properties. First, the missing values are replaced with the mean value of the appropriate attribute. Then, the missing value estimates are refined by applying GPR. The data characteristics from this GPR process are linked to the GAN. Finally, the GAN is applied to generate further refinements to generate new data that are similar to the real data. The generated data are used as training data and help to overcome any data imbalances in the input data set.
In order to prove the performance of the proposed framework, it is compared with existing classification models using a real industrial data set related to APS failure in Scania trucks. Numerical analysis shows that the proposed framework has higher accuracy and lower fall-out than existing classification models. Through numerical analysis, it was confirmed that the proposed framework is effective compared with existing classification models.
The proposed framework can be used to estimate missing values in a data set with a high frequency of missing data. In addition, industrial data sets that are highly distorted with large data imbalances can be successfully analyzed using the proposed framework. In future studies, we hope to incorporate more efficient computation methods to handle data from multiple industrial areas.