Manufacturing Quality Prediction Using Intelligent Learning Approaches: A Comparative Study

: Under the international background of the transformation and promotion of manufacturing, the Chinese government proposed the “Made in China 2025” strategy, which focused on the improvement of a quality-based innovation ability. Moreover, predicting manufacturing quality is one of the crucial measures for quality management. Accurate prediction is closely related to the feature learning of manufacturing processes. Therefore, two categories of intelligent learning approaches, i.e., shallow learning and deep learning, are investigated and compared for manufacturing quality prediction in this paper. Speciﬁcally, the feed forward neural network (FFNN) with one hidden layer and the least squares support vector machine (LSSVM) with no hidden layers are selected as the representatives for shallow learning, and the deep restricted Boltzmann machine (DRBM) and the stack autoencoder (SAE) are chosen as the representatives for deep learning. The manufacturing data is collected from a competition about manufacturing quality control in the Tianchi Data Lab of China. The experiments show that the deep framework overwhelms the shallow architecture in terms of mean absolute percentage error, root-mean-square error, and threshold statistics. In addition, the prediction results also indicate that the performances depend on the length of the training data. That is, the bigger the sample size is, the better the performance is.


Introduction
To achieve the transformation and upgrade of China's manufacturing, the "Made in China 2025" plan [1] proposed a basic guideline with innovation-driven, quality first, green development, structure optimization, and talent-oriented objectives.Therefore, quality, as the lifeline in manufacturing, has attracted the attention of manufacturers and researchers.To control and improve manufacturing quality, many techniques are implemented into the manufacturing process.Among them, manufacturing quality prediction, as one of the effective ways to control and improve manufacturing quality, has been developed using various data mining techniques.
Statistical quality control [2] based on cause-effect relationships, e.g., linear regression [3], non-linear regression [4], inference learning [5], and expert systems [6], has been widely used to assess the quality performance of manufacturing processes.The successful application of these approaches is attributed to certain stable or constant production processes, which thus makes them unsuitable for the fast-increasing complexity and high-dimensionality of modern manufacturing.To address this issue, artificial intelligence (AI) is stepping into the academic field of these researchers due to its self-learning ability without taking into account manufacturing processes [7][8][9][10].Artificial neural networks (ANNs) and machine learning (ML) are two typical representatives of AI techniques, and have achieved successful application in manufacturing quality prediction, e.g., self-organizing neural networks [11], back propagation neural networks (BPNNs) [12], radial basis function neural networks [13], probability neural networks [14], support vector machines (SVMs) [15], and extreme learning machines [16].Affected by multiple parameters from multi-stage manufacturing processes, ANN and ML modeling exhibit feature learning difficulties and network calculation complexities due to their "shallow" architecture, i.e., the model has one hidden layer or none at all (a traditional ANN has one hidden layer and classical ML is based on a kernel function without a hidden layer).To improve prediction accuracy, it is thus imperative to enhance the feature learning capability using a "deep" representation technique.
In 2006, the deep learning (DL) technique was proposed [17] and it has become a hot research topic in AI.It has been proven to be effective for many fields, e.g., fault diagnosis [18], pattern recognition [19], and time series forecast [20,21].Compared with the "shallow" models, DL has many hierarchical levels in a hidden layer, that is, the information representation is delivered from lower levels to higher levels, which makes the information representation more abstract and nonlinear for the higher levels.Through representations by the hierarchical levels, the "deeper" feature of multi-parameter manufacturing quality can be fitted by regression models sufficiently [22].To our best knowledge, there has been little literature that has reported on applications for manufacturing quality prediction using the deep framework.Therefore, the DL technique can provide a possibility for manufacturing quality prediction.
This paper attempts to make a comparison of two feature learning patterns to investigate their performances for predicting manufacturing quality, including the feed forward neural network (FFNN), the least squares support vector machine (LSSVM), the deep restricted Boltzmann machine (DRBM), and the stack autoencoder (SAE).To reveal the feature learning capacity of the four models, two kinds of manufacturing data with multiple parameters are involved.
The rest of the paper is organized as follows.Section 2 introduces the FFNN, the LSSVM, the DRBM, and the SAE, respectively.Section 3 presents the application data.Section 4 gives the results with relevant discussion.Section 5 concludes this study.

Methodologies
As stated in the Introduction, both shallow and deep learning belong to the ANN and related machine learning algorithms.The significant difference is the structure depth (Figure 1), i.e., shallow learning includes only one hidden layer or none at all, and deep learning contains more than one hidden layer.
Sustainability 2018, 10, 85 2 of 14 neural networks (ANNs) and machine learning (ML) are two typical representatives of AI techniques, and have achieved successful application in manufacturing quality prediction, e.g., self-organizing neural networks [11], back propagation neural networks (BPNNs) [12], radial basis function neural networks [13], probability neural networks [14], support vector machines (SVMs) [15], and extreme learning machines [16].Affected by multiple parameters from multi-stage manufacturing processes, ANN and ML modeling exhibit feature learning difficulties and network calculation complexities due to their "shallow" architecture, i.e., the model has one hidden layer or none at all (a traditional ANN has one hidden layer and classical ML is based on a kernel function without a hidden layer).To improve prediction accuracy, it is thus imperative to enhance the feature learning capability using a "deep" representation technique.
In 2006, the deep learning (DL) technique was proposed [17] and it has become a hot research topic in AI.It has been proven to be effective for many fields, e.g., fault diagnosis [18], pattern recognition [19], and time series forecast [20,21].Compared with the "shallow" models, DL has many hierarchical levels in a hidden layer, that is, the information representation is delivered from lower levels to higher levels, which makes the information representation more abstract and nonlinear for the higher levels.Through representations by the hierarchical levels, the "deeper" feature of multi-parameter manufacturing quality can be fitted by regression models sufficiently [22].To our best knowledge, there has been little literature that has reported on applications for manufacturing quality prediction using the deep framework.Therefore, the DL technique can provide a possibility for manufacturing quality prediction.
This paper attempts to make a comparison of two feature learning patterns to investigate their performances for predicting manufacturing quality, including the feed forward neural network (FFNN), the least squares support vector machine (LSSVM), the deep restricted Boltzmann machine (DRBM), and the stack autoencoder (SAE).To reveal the feature learning capacity of the four models, two kinds of manufacturing data with multiple parameters are involved.
The rest of the paper is organized as follows.Section 2 introduces the FFNN, the LSSVM, the DRBM, and the SAE, respectively.Section 3 presents the application data.Section 4 gives the results with relevant discussion.Section 5 concludes this study.

Methodologies
As stated in the Introduction, both shallow and deep learning belong to the ANN and related machine learning algorithms.The significant difference is the structure depth (Figure 1), i.e., shallow learning includes only one hidden layer or none at all, and deep learning contains more than one hidden layer.From Figure 1, one can clearly find that deep learning adopts a cascade of many hidden layers for feature extraction and transformation, and higher level features are derived from lower level From Figure 1, one can clearly find that deep learning adopts a cascade of many hidden layers for feature extraction and transformation, and higher level features are derived from lower level features to form a hierarchical representation.Hence, deep learning can be regarded as an intensified version of shallow learning.To investigate learning performance, four typical approaches are introduced briefly in the following subsections, i.e., FFNN with one hidden layer, LSSVM with no hidden layers, and DRBM and SAE with many hidden layers.

Feed Forward Neural Network
The classical FFNN propagates inputs through a network with one input, one hidden, and one output layer to make a prediction (Figure 1a).In the FFNN architecture, the artificial neurons are organized as layers, the information strictly flows forward, and the errors of the network are propagated backwards.The expressions of the FFNN are as follows [23] where x i (i = 1, 2, . . ., m) represents the inputs, h j (j = 1, 2, . . ., n) represents the outputs of the hidden layer, y k (k = 1, 2, . . ., p) represents the outputs, w ij and w jk represent the weight matrix between two adjacent layers, respectively, and f hidden (.) and f output (.) are transfer functions in the hidden layer and the output layer, respectively.To update the weights w effectively, a back propagation algorithm (BP), a well-known method, is used for training the FFNN [24].

Least Squares Support Vector Machine
For a given dataset, the goal of the LSSVM for regression is to find an optimal relationship between inputs x and outputs y in the feature space y = ω T ϕ(x) + b (Figure 1a), where ϕ(x) denotes the nonlinear mapping function, ω is the weight vector, and b is the bias vector.Moreover, the objective function of the LSSVR is given by where ξ is the error variance, and γ > 0 is the penalty coefficient.
Transforming this quadratic programming problem to its corresponding dual optimization problem and introducing the kernel function in order to achieve non-linearity yields an optimal regression function as [25] where q is the length of dataset, α i is the Lagrange multiplier, and k(.) represents the kernel function.
Generally, the radial basis function (RBF) is chosen as the kernel function, and is given by where λ is the kernel bandwidth.

Deep Restricted Boltzmann Machine
As introduced above, a DRBM is a stack of restricted Boltzmann machines (RBMs).After an RBM (Figure 2) has been learned, the activities of its hidden units can be used as the data for learning a higher-level RBM.Note that when l = 1, h • = x (also called visible nodes v in RBM).For an RBM, the energy function E(v, h| θ) taking consideration of the real data normalized into [0, 1] is given by [26]  (5) where θ = (w, b, a) is the parameter set, w is the symmetric weight between the hidden layers l-1 and l, b and a are their bias, σ is the standard deviation, and V and H denote the number of visible and hidden units, respectively.
The conditional probability distributions P are as follows: where Z(b, σ) represents a Gaussian probability density function.
To solve these functions above, Hinton [27] proposed a contrastive divergence algorithm: (1) initialize v using the input data, and compute h according to the conditional probability distributions (Equation (6)); (2) obtain reconstruction state v' based on Equation (7) using h, and repeat Equation (6) to update the hidden nodes using v', obtaining h'.The update in a weight is given as follows: where ƞ is the learning rate, and < .> refers to the expectation of the training data.
Then, one can stack several RBMs together into a DRBM following the structure in Figure 1b, and this process is continued until a prescribed number of hidden layers in the DRBM have been trained.

Stack Autoencoder Network
Training an SAE for regression is similar to the DRBM [28]: (1) from the lower to top layers (layer 1 to layer l), operate generative unsupervised learning layer-wise on the autoencoder (AE) (Figure 3); (2) from the top to lower layers (layer l to layer 1), fine-tune by a supervised learning method (back propagation algorithm) to tweak the parameter sets (w, b); and (3) from the hidden (top) to output layer, perform regression using the pre-training parameter sets (w, b).For an RBM, the energy function E(v, h| θ) taking consideration of the real data normalized into [0, 1] is given by [26] where θ = (w, b, a) is the parameter set, w is the symmetric weight between the hidden layers l-1 and l, b and a are their bias, σ is the standard deviation, and V and H denote the number of visible and hidden units, respectively.The conditional probability distributions P are as follows: where Z(b, σ) represents a Gaussian probability density function.
To solve these functions above, Hinton [27] proposed a contrastive divergence algorithm: (1) initialize v using the input data, and compute h according to the conditional probability distributions (Equation ( 6)); (2) obtain reconstruction state v based on Equation (7) using h, and repeat Equation (6) to update the hidden nodes using v , obtaining h .The update in a weight is given as follows: where η is the learning rate, and < .> refers to the expectation of the training data.Then, one can stack several RBMs together into a DRBM following the structure in Figure 1b, and this process is continued until a prescribed number of hidden layers in the DRBM have been trained.

Stack Autoencoder Network
Training an SAE for regression is similar to the DRBM [28]: (1) from the lower to top layers (layer 1 to layer l), operate generative unsupervised learning layer-wise on the autoencoder (AE) (Figure 3); (2) from the top to lower layers (layer l to layer 1), fine-tune by a supervised learning method (back propagation algorithm) to tweak the parameter sets (w, b); and (3) from the hidden (top) to output layer, perform regression using the pre-training parameter sets (w, b). 5 of 14 According to Figure 3, the AE model is described as follows briefly [29].The purpose of the AE is to reconstruct inputs h l−1 (h° = Y) into new representations r with a minimum reconstruction error To solve this problem, the encoder fe(.) and decoder fd(.) functions are operated step-by-step until they achieve the optimal parameter sets (w, b) based on a minimal loss function (Equation ( 11)).

Dataset
The data is collected from a competition about manufacturing quality control in the Tianchi Data Lab of China (https://tianchi.aliyun.com/competition/gameList.htm).They have the same technique parameters (19 process parameters as shown in Table 1) with a different setting, thus the quality index (one key-quality index with range [0, 1] as shown in Figure 4) exhibits diversity in different batches.There are two kinds of samples, one is a small sample including 100 batches (total sample (19 + 1) × 100, as shown in Figure 4a), and the other is a big sample including 1000 batches (total sample (19 + 1) × 1000, Figure 4b).These data are divided into two categories, 80% for training and 20% for testing.Note that all the data have been desensitized.According to Figure 3, the AE model is described as follows briefly [29].The purpose of the AE is to reconstruct inputs h l−1 (h • = Y) into new representations r with a minimum reconstruction error To solve this problem, the encoder f e (.) and decoder f d (.) functions are operated step-by-step until they achieve the optimal parameter sets (w, b) based on a minimal loss function (Equation ( 11)).

Dataset
The data is collected from a competition about manufacturing quality control in the Tianchi Data Lab of China (https://tianchi.aliyun.com/competition/gameList.htm).They have the same technique parameters (19 process parameters as shown in Table 1) with a different setting, thus the quality index (one key-quality index with range [0, 1] as shown in Figure 4) exhibits diversity in different batches.There are two kinds of samples, one is a small sample including 100 batches (total sample (19 + 1) × 100, as shown in Figure 4a), and the other is a big sample including 1000 batches (total sample (19 + 1) × 1000, Figure 4b).These data are divided into two categories, 80% for training and 20% for testing.Note that all the data have been desensitized.

Model Development
In this subsection, the investigated models are developed using the real manufacturing data.Note that all of the data are normalized into [0, 1] firstly according to the following equation where datamin and datamax denote the minimum and maximum of each parameter in the dataset shown in Table 1.Then, the experimental method is applied to establish four models, and the details are listed in Table 2.The optimal model with the simplest structure is identified based on the paired t-test results [30] except for the LSSVM (it has no hidden layers).For convenience, the models of the DRBM and the SAE are named with a sequence number (18 models in total), e.g., 1 (l = 2, hidden nodes = 10), 2 (l = 2, hidden nodes = 20), 6 (l = 2, hidden nodes = 60), 7 (l = 3, hidden nodes = 10), 12 (l = 3, hidden nodes = 60), 13 (l = 3, hidden nodes = 10), and 18 (l = 3, hidden nodes = 60).All of the results in the following experiments are the best values of ten independent runs.In addition, the computation software is Matlab 2014 with the computation environment Intel Core i5-2450M CPU @2.50 GHz, and Memory 4.00 GB.

Model Development
In this subsection, the investigated models are developed using the real manufacturing data.Note that all of the data are normalized into [0, 1] firstly according to the following equation Normalization = data − data min data max − data min (12) where data min and data max denote the minimum and maximum of each parameter in the dataset shown in Table 1.Then, the experimental method is applied to establish four models, and the details are listed in Table 2.The optimal model with the simplest structure is identified based on the paired t-test results [30] except for the LSSVM (it has no hidden layers).For convenience, the models of the DRBM and the SAE are named with a sequence number (18 models in total), e.g., 1 (l = 2, hidden nodes = 10), 2 (l = 2, hidden nodes = 20), 6 (l = 2, hidden nodes = 60), 7 (l = 3, hidden nodes = 10), 12 (l = 3, hidden nodes = 60), 13 (l = 3, hidden nodes = 10), and 18 (l = 3, hidden nodes = 60).All of the results in the following experiments are the best values of ten independent runs.In addition, the computation software is Matlab 2014 with the computation environment Intel Core i5-2450M CPU @2.50 GHz, and Memory 4.00 GB.

Performance Criteria
Three criteria, mean absolute percentage error (MAPE), root-mean-square error (RMSE), and threshold statistics (TS), are employed to assess the forecasting performances.The definitions of the three criteria are listed as follows: where N is the length of the prediction, ob i and pr i represent the i-th observation and prediction, respectively, and n a is the number of data predicted having relative error in forecasting less than a%.
In this paper, TS a is calculated for five levels of 1%, 5%, and 10%.Moreover, a Pearson correlation analysis [31] is employed to evaluate the correlation degree of the observation and prediction.

FFNN Results
Figure 5 plots the MAPE using the FFNN with different hidden nodes of two cases, respectively.As shown in Figure 5, the hidden nodes with the lowest MAPE are 10 (Case 1) and 4 (Case 2) respectively, regarding the control models based on the multiple comparison procedures [31].Through carrying out the paired t-test, one can choose the simplest model's structure that is not significantly different from the control model so as to obtain better generalization ability.Table 3 gives the results of the paired t-test at the confidence level of 5%.Note that the models in Table 3 n TS a a (13) where N is the length of the prediction, obi and pri represent the i-th observation and prediction, respectively, and na is the number of data predicted having relative error in forecasting less than a%.In this paper, TSa is calculated for five levels of 1%, 5%, and 10%.
Moreover, a Pearson correlation analysis [31] is employed to evaluate the correlation degree of the observation and prediction.

FFNN Results
Figure 5 plots the MAPE using the FFNN with different hidden nodes of two cases, respectively.As shown in Figure 5, the hidden nodes with the lowest MAPE are 10 (Case 1) and 4 (Case 2) respectively, regarding the control models based on the multiple comparison procedures [31].Through carrying out the paired t-test, one can choose the simplest model's structure that is not significantly different from the control model so as to obtain better generalization ability.Table 3 gives the results of the paired t-test at the confidence level of 5%.Note that the models in Table 3 are remarked as the hidden nodes.From Table 3, one can find that for Case 1, the models with 11-15 hidden nodes are considered not significantly different from the control model (Significance >0.05), and those with 4-9 hidden nodes are significantly different from the control model (Significance <0.05).Therefore, the model with 10 hidden nodes should be selected as the optimal model in this paper.The training time is 2.92 s.For Case 2, the models with 4-5 hidden nodes are not significantly different, and the models with 6-15 are significantly different from the control model.The model with four hidden nodes should be selected as the optimal model in this paper.The training time is 4.04 s. Figure 6 shows the prediction results using the optimal FFNN for two cases, respectively.From Table 3, one can find that for Case 1, the models with 11-15 hidden nodes are considered not significantly different from the control model (Significance > 0.05), and those with 4-9 hidden nodes are significantly different from the control model (Significance < 0.05).Therefore, the model with 10 hidden nodes should be selected as the optimal model in this paper.The training time is 2.92 s.
For Case 2, the models with 4-5 hidden nodes are not significantly different, and the models with 6-15 are significantly different from the control model.The model with four hidden nodes should be selected as the optimal model in this paper.The training time is 4.04 s. Figure 6 shows the prediction results using the optimal FFNN for two cases, respectively.

LSSVM Results
Figure 7 plots the prediction results using the LSSVM optimized by the 10-cross validation method for two cases, respectively.The training times of the two cases are 5.65 min and 12.22 min, respectively.

DRBM Results
Figure 8 plots the MAPE using the DRBM with different hidden structures of two cases, respectively.According to Figure 8, model numbers 7 (Case 1) and 8 (Case 2) have the lowest MAPE, thus the hidden structures 10-10-10 and 20-20-20 are chosen as the control model for the paired t-test.Table 4 gives the results of the paired t-test at the confidence level of 5%.

LSSVM Results
Figure 7 plots the prediction results using the LSSVM optimized by the 10-cross validation method for two cases, respectively.The training times of the two cases are 5.65 min and 12.22 min, respectively.

LSSVM Results
Figure 7 plots the prediction results using the LSSVM optimized by the 10-cross validation method for two cases, respectively.The training times of the two cases are 5.65 min and 12.22 min, respectively.

DRBM Results
Figure 8 plots the MAPE using the DRBM with different hidden structures of two cases, respectively.According to Figure 8, model numbers 7 (Case 1) and 8 (Case 2) have the lowest MAPE, thus the hidden structures 10-10-10 and 20-20-20 are chosen as the control model for the paired t-test.Table 4 gives the results of the paired t-test at the confidence level of 5%.    4 gives the results of the paired t-test at the confidence level of 5%.

DRBM Results
Figure 8 plots the MAPE using the DRBM with different hidden structures of two cases, respectively.According to Figure 8, model numbers 7 (Case 1) and 8 (Case 2) have the lowest MAPE, thus the hidden structures 10-10-10 and 20-20-20 are chosen as the control model for the paired t-test.Table 4 gives the results of the paired t-test at the confidence level of 5%.As shown in Table 4, for Case 1, the control model is significantly different from the models 1-6, 9-10, 14, and 16, hence model 7 has the simplest structure.The training time is 3.08 s.For Case 2, the As shown in Table 4, for Case 1, the control model is significantly different from the models 1-6, 9-10, 14, and 16, hence model 7 has the simplest structure.The training time is 3.08 s.For Case 2, the control model is not significantly different from models 12 and 13, hence model 8 has the simplest structure.The training time is 8.28 s. Figure 9 shows the prediction results using the optimal DRBM for two cases, respectively.9 shows the prediction results using the optimal DRBM for two cases, respectively.

SAE Results
Figure 10 plots the MAPE using the SAE with different hidden structures of two cases, respectively.According to Figure 10, model numbers 9 (Case 1) and 2 (Case 2) have the lowest MAPE, thus the hidden structures 30-30-30 and 20-20 are chosen as the control model for the paired t-test.Table 5 gives the results of the paired t-test at the confidence level of 5%.

SAE Results
Figure 10 plots the MAPE using the SAE with different hidden structures of two cases, respectively.According to Figure 10, model numbers 9 (Case 1) and 2 (Case 2) have the lowest MAPE, thus the hidden structures 30-30-30 and 20-20 are chosen as the control model for the paired t-test.Table 5 gives the results of the paired t-test at the confidence level of 5%.

SAE Results
Figure 10 plots the MAPE using the SAE with different hidden structures of two cases, respectively.According to Figure 10, model numbers 9 (Case 1) and 2 (Case 2) have the lowest MAPE, thus the hidden structures 30-30-30 and 20-20 are chosen as the control model for the paired t-test.Table 5 gives the results of the paired t-test at the confidence level of 5%.As shown in Table 5, the control models (model 9 for Case 1 and model 2 for Case 2) have the simplest structure following the selection principle aforementioned (The training times of the two cases are 6.19 s and 12.08 s, respectively).Figure 11 shows the prediction results using the optimal SAE for two cases, respectively.As shown in Table 5, the control models (model 9 for Case 1 and model 2 for Case 2) have the simplest structure following the selection principle aforementioned (The training times of the two cases are 6.19 s and 12.08 s, respectively).Figure 11 shows the prediction results using the optimal SAE for two cases, respectively.

Comparison Studies
As shown in Figures 6,7,9, and 11, one can find that: (1) the performances of the four models have clear differences, illustrating that the results are not related to the multi-parameter inputs, but related to the inputs' feature learned by different patterns; (2) the predictions using the deep learning technique have smaller fluctuations than those using the shallow learning technique, illustrating that the parameters have little impact on the deep learning framework; and (3) all four models fail at the peak values, demonstrating that both shallow and deep learning have insufficient ability in peak information learning.To compare the models' performances from the quantification, a residual analysis and a statistical analysis are employed in the following text.
The residual analysis is plotted in Figure 12.From Figure 12a, one can find that the range of the

Comparison Studies
As shown in Figures 6,7,9, and 11, one can find that: (1) the performances of the four models have clear differences, illustrating that the results are not related to the multi-parameter inputs, but related to the inputs' feature learned by different patterns; (2) the predictions using the deep learning technique have smaller fluctuations than those using the shallow learning technique, illustrating that the parameters have little impact on the deep learning framework; and (3) all four models fail at the peak values, demonstrating that both shallow and deep learning have insufficient ability in peak information learning.To compare the models' performances from the quantification, a residual analysis and a statistical analysis are employed in the following text.
The residual analysis is plotted in Figure 12.From Figure 12a, one can find that the range of the residual errors is [−0.2, 0.2] of two cases, and there is 1 (accounting for 5%) prediction outlier (Case 1) and 15 (accounting for 7.5%) outliers (Case 2) shown in the triangle because the interval around the residual errors does not contain zero.This implies that the five residual errors caused by the unfortunate fitting, beyond the 95% confidence interval, account for 5% of the testing data.As shown in Figure 12b, one can find that the ranges of the residual errors are [−0.2,0.15] and [−0.15, 0.2], respectively, and there is 1 (accounting for 5%) prediction outlier (Case 1) and 16 (accounting for 8%) outliers (Case 2).From Figure 12c, one can find that the ranges of the residual errors are [−0.15,0.1] and [−0.1, 0.2], respectively, and there are 2 (accounting for 10%) prediction outliers (Case 1) and 11 (accounting for 5.5%) outliers (Case 2).As shown in Figure 12d, one can find that the ranges of the residual errors are [−0.15,0.1] and [−0.1, 0.15], respectively, and there are 2 (accounting for 10%) prediction outliers (Case 1) and 12 (accounting for 6%) outliers (Case 2).Compared with the shallow learning architecture, the deep learning framework has smaller error fluctuations in the two cases, illustrating that deep learning has better performance over the entire testing dataset.However, the exhibition in the prediction outliers is different, that is, shallow learning is better than deep learning for small samples (Case 1) in terms of the number of the outliers, and deep learning is better than shallow learning for big samples (Case 2).This phenomenon can be attributed to the sample size, demonstrating that the feature learning ability of the deep technique is closely related to the sample size.That is, the bigger the sample size is, the better the performance is.
The evaluation criteria are summarized in Table 6.Note that PCC refers to Pearson correlation coefficient, and the labels ** and * represent 0.01 and 0.05 levels of significant correlation, respectively.As shown in Table 6, the statistical indexes of the two case applications demonstrate the following.First, in terms of the lowest MAPE and RMSE, the deep framework (DRBM and SAE) has a strong capacity for capturing the features of the manufacturing parameters and the quality sufficiently.However, the shallow architecture (FFNN and LSSVM) has a weaker capacity for feature learning and regression.Second, in terms of the highest TS, the error distributions of the deep framework are concentrated in the range of less than 5% (accounting for 90%) and 10% (accounting for 100%) for Case 1, and 5% (accounting for 92%, 92.5%) and 10% (accounting for 99.5%, 100%) for Case 2. However, the shallow architectures have good performance in TS 1 and worse performance in TS 5 and TS 10 compared with deep learning.Third, in terms of the PCC, the degree of correlation is higher using the deep framework (passed the correlation test at 0.01 (SAE) and 0.05 (DRBM) levels) than that using the shallow architecture.The evaluation criteria are summarized in Table 6.Note that PCC refers to Pearson correlation coefficient, and the labels ** and * represent 0.01 and 0.05 levels of significant correlation, respectively.As shown in Table 6, the statistical indexes of the two case applications demonstrate Additionally, although deep learning overwhelms shallow learning according to Table 6, the network complexity and computing burden increases.Therefore, the paired t-test is also applied for evaluating the significant difference to investigate its feasibility.Table 7 gives the significant differences of the four models at the 5% level.As shown in Table 7, one can find that shallow learning is significantly different from deep learning, and the two sets of models (the FFNN and the LSSVM, and the DRBM and the SAE) have no significant difference.Therefore, the deep framework can be regarded as an effective approach for multi-parameter manufacturing quality prediction.In conclusion, according to the qualitative analysis and the quantitative analysis, deep feature learning is beneficial to explore sophisticated relationships between multiple parameters of manufacturing and quality, and display better prediction capacity for manufacturing quality.Moreover, sample size is a vital factor affecting the deep framework's performance.

Conclusions
The capability of shallow and deep learning to predict manufacturing quality is tested and compared in this paper.The candidates include the FFNN with one hidden layer, the LSSVM with no hidden layers, the DRBM, and the SAE.For this purpose, the trial and error method is adopted to select the optimal model with the simplest structures (except for the LSSVM), which are specified by the paired t-test results.Two cases, i.e., small samples (100 batches) and big samples (1000 batches), are investigated.The comparison of the model results has shown that: (1) the performances of the deep framework consisting of two or three hidden layers are better than those of the shallow architectures in terms of the MAPE, the RMSE, the TS, and the PCC criteria; (2) the performances of the deep framework depend on the sample size in terms of the number of the prediction outliers, i.e., the bigger the sample size is, the better the performance is; and (3) the deep framework and the shallow architecture are significantly different statistically.Based on the findings of this study, it can be stated that the deep learning techniques considered can be successfully applied to establish accurate manufacturing prediction models, especially for big data.In a future study, the authors will focus on the popularization and application of the deep learning techniques in other manufacturing enterprises.

Figure 1 .
Figure 1.Different structure schematic diagrams for feature learning.(a) shallow learning framework, and (b) deep learning framework.

Figure 1 .
Figure 1.Different structure schematic diagrams for feature learning.(a) shallow learning framework, and (b) deep learning framework.

Figure 4 .
Figure 4. Manufacturing quality of different batches.(a) Small samples with 100 batches, and (b) big samples with 1000 batches.

Figure 4 .
Figure 4. Manufacturing quality of different batches.(a) Small samples with 100 batches, and (b) big samples with 1000 batches.

Figure 5 .
Figure 5. Experimental results using the FFNN with different hidden nodes.MAPE: mean absolute percentage error.

Figure 5 .
Figure 5. Experimental results using the FFNN with different hidden nodes.MAPE: mean absolute percentage error.

Figure 8
Figure 8 plots the MAPE using the DRBM with different hidden structures of two cases, respectively.According to Figure 8, model numbers 7 (Case 1) and 8 (Case 2) have the lowest

Figure 8 .
Figure 8. Experimental results using the DRBM with different hidden structures.

Figure 8 .
Figure 8. Experimental results using the DRBM with different hidden structures.

Figure 10 .
Figure 10.Experimental results using the SAE with different hidden structures.

Figure 10 .
Figure 10.Experimental results using the SAE with different hidden structures.

Table 1 .
Statistical information of the multiple parameters in different processes.

Table 1 .
Statistical information of the multiple parameters in different processes.

Table 2 .
Experimental design of each approach.

Table 2 .
Experimental design of each approach.
are remarked as the hidden nodes.

Table 3 .
Paired t-test results of the FFNN.

Table 4 .
Paired t-test results of the DRBM.model is not significantly different from models 12 and 13, hence model 8 has the simplest structure.The training time is 8.28 s. Figure control

Table 4 .
Paired t-test results of the DRBM.
Sample Control Model Paired Model Significance (Asymptotic) Paired Model Significance (Asymptotic)

Table 5 .
Paired t-test results of the SAE.

Table 5 .
Paired t-test results of the SAE.

Table 6 .
Comparison of the prediction performances using different models.

Table 7 .
Paired t-test results between each model.