Determination of the Optimal Neural Network Transfer Function for Response Surface Methodology and Robust Design

: Response surface methodology (RSM) has been widely recognized as an essential estima ‐ tion tool in many robust design studies investigating the second ‐ order polynomial functional rela ‐ tionship between the responses of interest and their associated input variables. However, there is scope for improvement in the flexibility of estimation models and the accuracy of their results. Alt ‐ hough many NN ‐ based estimations and optimization approaches have been reported in the litera ‐ ture, a closed functional form is not readily available. To address this limitation, a maximum ‐ likeli ‐ hood estimation approach for an NN ‐ based response function estimation (NRFE) is used to obtain the functional forms of the process mean and standard deviation. While the estimation results of most existing NN ‐ based approaches depend primarily on their transfer functions, this approach often requires a screening procedure for various transfer functions. In this study, the proposed NRFE identifies a new screening procedure to obtain the best transfer function in an NN structure using a desirability function family while determining its associated weight parameters. A statistical simulation was performed to evaluate the efficiency of the proposed NRFE method. In this partic ‐ ular simulation, the proposed NRFE method provided significantly better results than conventional RSM. Finally, a numerical example is used for validating the proposed method.


Introduction
Among the many quality engineering methodologies, robust design (RD) based on statistical design and analysis methods and optimization methods have contributed significantly to the improvement of product/process quality for more than 20 years. The main objective of RD is to identify the optimal factor settings that can minimize both variability and bias (namely, the deviation between the process mean and desired target value) in a process/product. To demonstrate this RD principle, Taguchi introduced a two-step model based on new design and analysis approaches, particularly orthogonal array and signalto-noise ratios [1]. However, orthogonal arrays, statistical analysis, and signal-to-noise ratios are highly controversial [2][3][4][5]. Consequently, Vining and Myers [6] improved Taguchi's model by proposing dual-response (DR) estimation and optimization approaches, in which the separate functions of the mean and variance responses are estimated using response surface methodology (RSM) based on the least-squares method (LSM). Their proposed optimization approach can calculate robust factor settings by minimizing process variability while maintaining the process mean at the target value. This results in the primary procedure of RD being identified in three sequential steps: design of experiment (DoE), response function estimation, and optimization. However, Lin and Tu [7] pointed out that process bias and variability are not considered simultaneously in the DR model and proposed a mean square error (MSE) model as an alternative approach providing more model flexibility when compared with that provided by the DR approach. These DR and MSE models were further extended by Del Castillo and Montgomery [8], Copeland and Nelson [9], Cho et al. [10], Ames et al. [11], Borror [12], Kim and Lin [13], Koksoy and Doganaksoy [14], Ding et al. [15], Shin and Cho [16][17][18], Robinson et al. [19], Fogliatto [20], Goethals and Cho [21], Truong and Shin [22,23], Nha et al. [24], Baba et al. [25], and Yanıkoğlu et al. [26].
In the response function estimation step of the RD procedure, RSM is used extensively to obtain the estimated function associated with an output response and its associated input factors based on the DoE results. RSM is applied extensively in real-world industrial situations, mainly where several input factors potentially affect a product or process's performance measures or quality characteristics [27]. The relationship between several input factors and one or more associated output responses can be estimated using RSM. Conventionally, LSM can estimate the unknown coefficients of a regression model with many specific error assumptions, such as the errors in the output responses should be followed by a normal distribution (i.e., zero mean and constant variance) identically and independently. However, these assumptions are often violated in many practical problems. Several alternative approaches, such as transformation, weighted least squares (WLS), maximum-likelihood estimation (MLE), Bayesian, and inverse problems, can be used to address this problem. For example, the WLS method has been applied to RSM to estimate the model parameters for unbalanced data [28,29], and Lee and Park [30] and Cho et al. [31] have integrated an MLE method and an expectation-maximization algorithm for unknown parameter estimations using incomplete data. In addition, Goegebeur et al. [32] and Chen and Ye [33,34] proposed a Bayesian approach to estimate the coefficients of response functions where the variances follow a log-normal distribution. Truong and Shin [22,23] proposed a new estimation method that used an inverse problem approach based on Bayesian perspectives. Most of the existing estimation methods in the RSM and RD literature attempt to generate functional forms in input factors for output responses (namely, the process mean, process variance, and quality characteristic). However, relaxing the error assumptions and increasing the estimation precision would result in further improvements.
Artificial neural networks (ANNs) or neural networks (NNs) with nonlinear mapping structures using the human brain function have been widely used as powerful tools in data classification, forecasting, clustering, function approximation, and optimization in recent decades. Irie and Miyake [35], Hornik et al. [36], Cybenko [37], and Funahashi [38] recently addressed the question of approximation using feed-forward NNs and proved that NNs are universal functional approximators with the desired accuracy. ANNs belong to a class of self-adaptive and data-driven techniques, and therefore the undefined relationships between input factors and outputs of a process/product can be ascertained effectively. ANNs can provide a linear or nonlinear relationship between inputs and several outputs without any assumptions based on generalizing capacities of the activation function. Consequently, the functional relationships between input factors and their related output responses in an RD procedure can be estimated without any specific assumptions effectively. Most of the literature on NN function approximation has concentrated on their structure, hidden layers, hidden neurons, and training algorithms. On the contrary, less of them focus on the transfer or activation function, which can strongly affect the complexity and performance of NNs. In addition, most of the transfer functions reported in the literature can provide only a particular configuration to transfer the information from inputs to the associated outputs of a neuron.
The desirability function (DF) approach is one of the most commonly utilized techniques for multiple response optimization problems, in which each quality characteristic is converted into an individual DF that takes a value ranging from 0 to 1. The DF concept was introduced by Harrington [39] for the simultaneous optimization of multiple response problems in the industry field. This approach can be used to determine those optimal factors settings of input variables that result in the most desirable output values, while none of these response values are beyond the specified boundaries. Derringer and Suich [40] implemented a further extension of Harrington's DF. Depending on whether the response will minimize, maximize, or specify the desired target value, there are three associated DFs: STB (smaller is better), NTB (nominal is better), and LTB (larger is better). Based on the weight parameters, the DF family may provide a flexible transfer function configuration.
Rowlands et al. [41] integrated an NN approach into RD and used the NN to perform the DoE. Su and Hsieh [42], Cook et al. [43], Chow et al. [44], Chang [45], and Chang and Chen [46] combined an NN approach as an estimation method with a genetic algorithm to determine the optimal process parameter settings or the optimal costs in both static and dynamic characteristics in Taguchi's formula without considering the process mean and variance functions. Recently, Arungpadang and Kim [47] presented a feed-forward NN structure based on RSM in the RD concept.
In the robust design (RD) concept, the process mean and standard deviation of an output response in experiment results can be estimated as a function by many different statistical approaches (i.e., LSM, MLE, and so on). These processes mean and standard deviation functions can be optimized simultaneously based on their desired targets using existing optimization methods. The significant objective of this paper is to propose an alternative estimation method to DR functions (i.e., mean and standard deviation functions) by integrating NN approaches for RSM and RD modeling and then to compare this method to the conventional LSM-based estimation method. First, a feed-forward NN is integrated into the RD procedure as a new DR estimation approach to estimate the process mean and standard deviation response functions. In order to develop the new NN-based estimation method, many different types of transfer functions that can affect the quality of estimation results should be considered.
For this reason, DFs can be utilized as a transfer function family because DFs can represent all three different types (i.e., L-type, S-type, and N-type) of quality characteristics. Among these transfer functions based on DFs, the best transfer function for experimental data using the identified learning and validation approaches is then investigated in this proposed approach. This proposed best transfer function determination procedure for a given NN structure based on DF is derived for the proposed DR estimation method. DFs with their associated weight parameters can provide a variety of transfer function families. In addition, the process mean is often kept at the target value, and the process standard deviation is minimized to the least possible in the RD methodology. Therefore, the "nominal is best" and "smaller is better" DFs are proposed as new NN transfer functions. Next, the best transfer functions for the process mean response function and standard deviation response function are determined based on the optimal weight parameters of the proposed transfer functions using MLE. The associated confidence intervals for the optimal weight parameters are also determined. The results of a case study are presented to support the view that better solutions are obtained from the proposed feed-forward NN-based DR estimation method compared with those from the conventional LSM method and the feed-forward NN estimation approach by using the conventional logsigmoid transfer function. An overview of the proposed NN-based DR function estimation method is presented in Figure 1.  [48], consists of mathematical and statistical techniques based on the fittings of empirical models obtained from an experimental design. RSM can estimate the functional empirical relationship between the input factors and their related output responses. The basic theory, estimation methods, and analytical techniques of RSM can be found in the study by Myers and Montgomery [27]. Myers [49] and Khuri and Mukhopadhyay [50] provide insights into the various developmental stages and future directions of RSM. In general, the output response can be identified as a function of the input factor as follows:

RSM, developed by Box and Wilson
, where , , and denote the vector of control factors, column vector of model parameters, and random error. The estimated second-order models for the process mean and standard deviation functions are represented as follows: where and are the estimators of the unknown parameters in the process mean and standard deviation functions, respectively. These coefficients are estimated using the conventional LSM, as follows: where , , and are the observation average, the standard deviation of replicated responses, and design matrix, respectively.

Feed-Forward NN Structure
NNs may include a broad class of flexible nonlinear regression, discriminant and data reduction models, and nonlinear dynamical systems [51]. An NN consists of simple, highly interconnected computational units called artificial neurons. Zainuddin and Pauline [52] demonstrated that feed-forward NNs from the input to the output are the most popular NNs for use in function approximation. In addition, Cybenko [37], Hornik et al. [36], Funahashi [38], and Hartman et al. [53] demonstrated that a feed-forward NN including one hidden layer could approximate a continuous, arbitrary nonlinear, and multidimensional function. Both processes mean and standard deviation can be two output response characteristics of interest in the RD methodology. The proposed feed-forward NN-based method for estimating the functional relationship associated with input factors and output responses in the RD is depicted in Figure 2. A vector comprising control factors ,…, , …, , named as the input layer, is the input to the hidden neuron in the hidden layer. The summation of inputs of neurons with identified weights (i.e., coefficients) and bias is transformed using the transfer function to generate the associated output of the hidden neuron. The outputs of the hidden neuron are used as inputs to the neuron in the output layer. Assume that there are h hidden nodes 1, … , , … , ℎ, , , , and denotes the weight connecting input factor to hidden node , the weight connecting hidden node to the final output, the bias at the hidden node , and the bias at the output, respectively. Then, the general form of the response function obtained using the proposed NN-based estimation method is as follows:

Back-Propagation Learning Algorithm
Among the learning algorithms reported in the literature for performing NN estimation, the back-propagation algorithm is commonly used for training a network owing to its simplicity and applicability [54]. This algorithm comprises two sequential steps: first, the weights of the NN are initialized randomly, and then, the output of the NN is computed and compared with the actual output. The error at the output layer, that is, the network error, is calculated by comparing the actual output to the desired value to iteratively adjust the weights of the hidden and output layers in accordance with the following function: Epoch is a hyperparameter that defines the number of times the entire training vectors are used to update the weights. During the training process, the error is optimally minimized. The iterative step of the gradient descent algorithm changes weights based on the following relationship: The parameter 0 is called the learning rate. There are numerous variations in the basic algorithm based on other optimization techniques (namely, conjugate gradient and Newton methods).
The Marquardt-Levenberg algorithm (trainlm) approximates Newton's method, which is designed to approach the second-order training speed without applying the Hessian matrix.
The resilient back-propagation training algorithm (trainrp) is a local adaptive learning technique, which performs supervised batch learning in a feed-forward NN. Trainrp removes the negative impact of the dimension of the partial derivative. Therefore, only the sign of the derivative implies the direction of a weight update process.

DF
scales the possible response (y) value to the range of [0, 1], where 0 and 1 represent completely undesirable and desirable values, respectively. Among the three DFs proposed by Derringer and Suich [40] for three situations of a particular response, the "nominal is best" and "smaller is better" DFs are utilized as new transfer functions in the hidden NN layers in order to estimate the process mean and standard deviation functions. Assume that there are runs, and hence, the summation at each hidden neuron represents vector consisting of , … , , … , values, where the input of each hidden neuron can then be calculated as ∑ at each run order. Therefore, the associated DF output can be represented as , … , , … , . When a response belongs to the "nominal is best" case, the individual DF is defined as When a response needs to be minimized, the individual DF is where , , and denote the lower specification, target, and upper specification values, respectively. Superscripts and in Equation (8) and superscript in Equation (9) represent weight parameters of the "nominal is best" and the "smaller is better" DFs, respectively. When the values of , , and are set to 1, the DFs become linear. For 1, 1, and 1 the functions are convex, and for 1, 1, and 1 the functions become concave. The integration of the "nominal is best" and "smaller is better" DFs as new transfer functions in the hidden layers to estimate the process mean and standard deviation is depicted in Figure 3.
where ℎ_ _ , ℎ_ _ , , * , and * denote the number of hidden neurons, bias at the hidden node , bias at the output neuron, weights connecting hidden node to the output responses, weights connecting input factors to hidden node of the proposed feed-forward NN, and the optimal DFs for the process mean and standard deviation functions, respectively. In addition, the log-sigmoid and linear functions are usually used as conventional transfer functions in hidden and output layers, respectively. The conventional log-sigmoid function and linear function with input layer are given as and .
Similarly, the estimated process mean and standard deviation functions are given, respectively, as follows: where ℎ_ , ℎ_ , , , , , , , , and denote the number of hidden neurons, bias at hidden node , bias at the output neuron, weight connecting hidden node to the final output, and the weight connecting input factor to hidden node of the feed-forward NNs for the process mean and standard deviation functions, respectively.

Estimation of Weight Parameters for NN Transfer Functions
The DF can provide a flexible NN transfer function family based on various values of the weight parameter. Among the likelihood-based estimation methods, the MLE method was proposed to estimate the optimal weight parameters for NN transfer functions. On the premise that DF values follow a normal distribution , , the distribution of summation can be inferred.

Estimation of Parameter
Using Equation (8), the parametric family in this case can be given as , , . Based on the principles of MLE, parameters , , can be identified when log-likelihood functions ℓ , , , or ℓ , are the highest. The log-likelihood function in relation to summation y is obtained by multiplying the normal probability density function by the Jacobian of the DF, , : where Therefore, the log-likelihood function is For a fixed value , obtaining the derivative of ℓ , with respect to , setting this derivative function equal to zero, and solving this function by yields Further, obtaining the derivative of ℓ , with respect to , assuming it equal to zero, and solving for yields Accordingly, the log-likelihood function can be represented as follows: Optimal value ̂ is specified using the statistical simulation method.

Estimation of Parameter
Similarly, from Equation (8), the log-likelihood function for parameter is The profile log-likelihood function used to estimate parameter can be defined as follows: Optimal value ̂ is also identified using the statistical simulation method.

Estimation of Parameter
Using Equation (9), the log-likelihood function for parameter is The profile log-likelihood function can then be defined as follows: Optimal value ̂ is also determined using the statistical simulation method.

Confidence Intervals for Weight Parameters
The asymptotic distribution of a maximum likelihood estimator is proposed to identify confidence intervals for weight parameters because the distribution of parameter estimated using MLE is asymptotically normal [55]. An approximate 1 α 100% confidence interval for a parameter θ can be given as follows: * SE , where * refers to the 1 100 percentile of the standard normal distribution and SE represents the estimated value of the standard error of . Based on the estimates of the parameters , the standard errors of the parameters can be defined by the square root of the diagonal elements of the estimated covariance matrix. First, the distribution of is stated, followed by the derivation of the approximate asymptotic distribution of parameters , , or .
The first derivative of log-likelihood function ℓ | can be given as The observed Fisher's total information matrix is given as Therefore, Fisher's total information matrix can then be defined as follows: .
The distribution of as → ∞ converges to a normal distribution. In particular, where is Fisher's total information matrix [55]. is typically not known, but by substituting a consistent estimator, for in Equation (29), the same result can be obtained. Accordingly, ⎯ 0, and ~ , .

Confidence Intervals for Parameter s
The general parameter can be determined as . From Equation (27), can be given as From the log-likelihood function in Equation (17), can be calculated as follows (see Appendix A.1): The square root of the diagonal elements of represents an estimate of the standard error for each estimator. An estimator for the standard error of ̂ is of particular interest and can be found using the inverses of the partitioned matrices. First, the partition is as follows: . (33) Next, one estimator for Var ̂ can be obtained as follows: The associated standard error of ̂ can be calculated by Therefore, an approximation 1 100% confidence interval of can be obtained by using

Confidence Intervals for Parameter
Similarly, a confidence interval of can be obtained by using The standard error of ̂ can be defined as follows (see Appendix A.2):

Confidence Intervals for Parameter
Similarly, an confidence interval of can be obtained by using The standard error of ̂ can be defined as follows (see Appendix A.3):

Case Study
This case study was performed using the printing data example proposed by Vining and Myers [6] and Lin and Tu [7]. The impacts of three input factors, that is, speed ( ), pressure ( ), and distance ( ), on the capabilities of a printing machine that applies colored inks to package levels (y) were investigated. Each input factor had three levels, resulting in a total of 27 runs. Three replications were performed for each data combination.
In comparative studies between methods, the expected quality loss (EQL) is usually applied as a critical optimization criterion in RD. The expectation of the loss function is defined as where represents a positive loss coefficient. ̂ , , and are the estimated process mean function, desirable target value, and estimated standard deviation function, respectively. In this case study, the target value was 500. Using MATLAB, the process-estimated mean and standard deviation functions obtained using the LSM based on RSM were as follows:  Table 2. Based on the structures, training algorithms, transfer functions, and the number of epochs of the proposed feed-forward NNs used to estimate the functions of process mean and standard deviation, optimal weight parameters , , and ̂ were calculated using Equations (20), (22), and (24), respectively. As shown in Table 3, twenty hidden neurons in the hidden layer are used to estimate mean function, and twenty hidden neurons in the hidden layer are used to estimate standard deviation function. For each hidden neuron, a desirability function is proposed as a transfer function. In addition, weight parameters (i.e., s, t, and r) for DFs identified in Equations (8) and (9) and associated confidence intervals are demonstrated in Table 3. The respective 95% confidence intervals for these weight parameters calculated by using Equations (35), (38), and (40) are presented in Table 3. The weight and bias matrix values of the feed-forward NNs using the proposed transfer functions for the process mean and standard deviation functions are presented in Table 4.   The estimated process mean and standard deviation functions obtained by applying the conventional LSM-RSM, the NN-based estimation method using the conventional logsigmoid transfer function, and the proposed DR approach with the estimated optimal weight parameters listed in Table 3, along with the associated coefficients of determination of each estimated response function, are illustrated in Figures 4-6 as contour plots, respectively.   The optimal control factor settings, the optimal process parameters (i.e., process mean, bias, and variance), and their associated EQL values obtained by the three different approaches (i.e., the conventional LSM based on RSM, the NN-based estimation method using the conventional log-sigmoid transfer functions, and the proposed NN-based DR estimation method) are presented in Table 5. As demonstrated in Table 5, the process variance obtained from the NN-based estimation methods is remarkably smaller than that of the conventional statistical LSM approach. Therefore, NN-based estimation methods can provide considerably smaller EQL values (i.e., significant criteria to evaluate the level of quality) than those of the conventional LSM approach in this particular example, although the same RD optimization model is utilized. Of the two NN-based estimation methods, the proposed DR approach provides better optimal solutions when compared with that of the NN-based estimation method that uses the conventional log-sigmoid transfer function because the variance obtained from the proposed DR approach is almost zero in this particular example. Therefore, in terms of EQL, the best optimal solutions can be obtained using the NN-based DR estimation method. The criterion spaces (squared bias vs. variance) of the three estimated functions by using the conventional LSM approach based on RSM, the N-based estimation method using the log-sigmoid transfer function, and the proposed DR approach are illustrated in Figure 7. A green cross marks the optimal settings in each figure.

Conclusions and Further Studies
This study proposes a feed-forward NN-based DR estimation method to estimate the process mean and standard deviation functions for RSM and RD modeling. The proposed method can avoid basic error assumptions of the conventional LSM approach in this particular situation. Further, integrating the DFs into the NN structures in the proposed method results in a general and flexible transfer function family for transforming data based on the proposed DFs. The optimal weight parameters of the DF family can be determined using the MLE and the confidence intervals of the optimal weight parameters using the proposed asymptotic distribution of maximum-likelihood estimators. The best transfer function in the NN structure was achieved using the proposed method. The proposed NN-based DR estimation method provides better solutions for EQL than those of the conventional LSM approach and the NN-based estimation method that uses the conventional log-sigmoid transfer function in the particular case study example. In order to improve the reliability of the proposed methods, different types of experimental data (i.e., small and large data sets, different DoE results, and small and large sets of replications) should be considered.
In further studies, the assumption that the error follows a normal distribution and whether the DF values can follow any distribution will be investigated. Additionally, the optimal weight parameters of the DF-based transfer functions were estimated using a Bayesian approach or the Newton-Raphson algorithm. We then plan to determine the confidence intervals of the optimal weight parameters using a t-distribution. In addition, a more comprehensive comparative study between the proposed NN models and higherorder models of conventional RSM can be a significant further research issue. Based on this comparative study and a Bayesian approach, we may develop a new optimal estimation system by integrating the proposed NN models and conventional RSM models.   The square root of the diagonal elements of represents the standard error for each estimator. An estimator for the standard error of ̂ is of particular interest and can be obtained using the inverses of the partitioned matrices. First, the partition is . (A9) Next, one estimator for Var ̂ can be obtained as follows: The associated standard error of ̂ can be defined as