This section is divided into three parts. In the first part, the results of conventional-based FFNN training in selecting neural network parameters are described. The RSM-based FFNN training (FFNN-RSM) results are presented in the second part. All data were normalized. Holdout validation and K-fold cross validation were used to ensure the robustness of the network parameters and to avoid overtraining. Finally, the model validations for permeate flux using optimal parameters from both techniques (conventional FFNN and FFNN-RSM) are presented and discussed. The accuracy of the FFNN models was measured using training and testing regressions.
3.1. Conventional-Based FFNN Training
This section discusses the selection of the optimum number of training parameters for the conventional FFNN. The training parameters include the number of neurons, the learning rate, the momentum coefficient, and the number of epochs.
Figure 5 shows the MSECV with a varying number of neurons for FFNN-lm and FFNN-gdm. The determination of the optimum number of network parameters is based on the lowest MSECV value. The number of neurons determines the number of connections between inputs and outputs and may vary depending on specific case studies. If too many neurons are used, the FFNN becomes over-trained, causing it to memorize the training data, which affects finding the best prediction model [
30,
35,
45,
46]. In this case, the number of neurons is varied from 1 to 30, which required 30 runs. It can be seen that the optimum number of neurons was obtained at 29 and 8 for FFNN-lm and FFNN-gdm, respectively, with the lowest MSECV values of 0.0238 and 0.0533 for FFNN-lm and FFNN-gdm, respectively. The MSECV values presented by FFNN-gdm tend to fluctuate as the number of neurons varies compared with the FFNN-lm, which is more consistent, as shown in
Figure 5.
Figure 6 and
Figure 7 present the MSECV with varying values of learning rate and momentum, respectively. In this case, both the learning rate and momentum were trained with values set at 0.1 to 1, with 10 runs required for each parameter. As depicted in
Figure 6a, it was found that the learning rates of 0.3 (MSECV = 0.0238), 0.5 (MSECV = 0.0239), and 0.7 (MSECV = 0.0232) would probably produce good results for FFNN-lm. However, the learning rates for conventional-based FFNN were obtained at 0.7 and 0.2 for lm and gdm, respectively, which gave the lowest MSECV of 0.0232 and 0.3403 for lm and gdm, respectively.
Figure 7 shows the optimum values of the momentum coefficient for FFNN-lm and FFNN-gdm. Both lm and gdm provide same value of momentum at 0.5. At this momentum value, the lowest MSECVs of 0.0243 and 0.0297 were obtained for lm and gdm, respectively. It can be observed in
Figure 7b that the MSECV values of FFNN-gdm are slightly consistent at minimum values for the number of momentums at 0.1 until 0.7, with MSECVs of 0.0313, 0.0334, 0.0308, 0.0351, 0.0297, 0.0334, and 0.0303, respectively. Then, the increase in the momentum values for FFNN-gdm showed a sudden increment in MSECV to almost 1.
Figure 8 shows the MSECV with varying numbers of epochs for FFNN-lm and FFNN-gdm. In this case, the number of epochs is varied from 100 to 1000, and 10 runs are required. As illustrated in
Figure 8b, it was found that the number of epochs that may produce a good outcome is 600 and 1000 with MSECVs of 0.0310 and 0.0302, respectively. Therefore, the optimum values of each epoch are obtained at 300 for lm, and 1000 for gdm was selected, which gave the lowest MSECV values of 0.0234 and 0.0302, respectively.
Table 6 summarizes the optimal values of the training parameters obtained for the conventional-based FFNN (FFNN-lm and FFNN-gdm).
3.2. RSM-Based FFNN Training
This section presents the optimum number of training parameters obtained using the RSM-based FFNN training. In this case, the proposed FFNN-RSM method utilized a face-central composite design (CCD) of four numerical factors (number of neurons, learning rate, momentum, and number of epochs) and one categorical factor (training function), and a two-level (
trainlm and
traingdm) design matrix was selected. The experimental design matrix is shown in
Table 3. There are 60 sets of conditions (run) which consist of 32 factorial points, 16 axial points, and 12 center points. The different conditions of the neural network parameters were designed and trained to model the best MSE performance on the
Tvalidation dataset using the Design-Expert software.
In this work, the quadratic model was chosen to yield the correlation between the neural network effective factors (input) and the response of MSECV (output). Using the Box-Cox method, the MSECV response is transformed to a natural log (ln(MSECV)), with α equal to 1. The transformation will make the distribution of the response closer to the normal distribution and improve the fit of the model to the data [
35]. The quadratic model in terms of coded factors for ln(MSECV) of the FFNN-RSM model is given in Equation (10):
where the A, B, C, D, and E parameters are the code values of the number of neurons, the learning rate, the momentum, the number of epochs, and the training function, respectively, as presented in
Table 3. Equation (10) is used for predictions of the response at a given level of each factor. The coded equation is useful for identifying the relative impact of the factors by comparing the factor coefficients. Negative and positive values of the coefficients represent antagonistic and synergistic effects of each model term, respectively. The positive value causes an increase in the response, while the negative value represents a decrease in the response. The values of the coefficients are relatively low, due to the low values of the MSECVs responses of the system [
47].
The accuracy of the RSM model is determined using ANOVA. The ANOVA contains the set of evaluation terms such as the coefficient of determination (R
2), adjusted R
2, predicted R
2, adequate precision, F-value, and
p-value, which are used to explain the significance of the model.
Table 4 illustrates the ANOVA for the response surface of the quadratic model. The statistical test factor, F-value, was used to evaluate the significance of the model at the 95% confidence level [
30]. The
p-value serves as a tool to ensure the importance of each coefficient at a specified level of significance. Generally, a
p-value of less than 0.050 showed the most significance and contributes largely toward the response. The smaller the
p-value, the more significant the corresponding coefficient. Other values that are greater than 0.050 are less significant.
Table 4 presents the ANOVA for the response surface of the quadratic model. The highest F-value (15.35) had a
p-value lower than 0.0001, confirming that the model is statistically significant. The lack-of-fit test for the model showed insignificance with an F-value of 0.7680 and a
p-value of 0.7262. This indicates that the model adequately fitted the experimental data. The value of R
2 of 0.8794 showed a good correlation between the predicted and actual values of the responses. The value of predicted R
2 of 0.7183 is in reasonable agreement with the adjusted R
2 of 0.8221, also indicating the significance of the model [
33]. The closer the R
2 value is to unity, the better the model will be, as it will yield predicted values closer to the actual values. The adequate precision measures the signal-to-noise ratio, and in this analysis, (14.8721) indicates an adequate signal. A ratio greater than 4 is desirable. Thus, the model can be used to navigate the design space [
30].
The p-values less than 0.050 (A, B, C, E, AE, BE, CE) indicate significant effects of the prediction process. The statistical analysis showed that the first order effect or linear term of training functions (E) is the most significant term in the ln(MSECV) response, followed by the number of neurons, the momentum, and the learning rate. The number of epochs depicted a less significant effect on the response, with an F-value = 0.0464 and a p-value = 0.8305. In addition, the p-values of AB, AC, AD, BC, BD, CD, DE, A2, B2, C2, and D2 are greater than 0.050 and, hence, less important in the ANN training process.
Figure 9 shows the plot of response (ln(MSECV)) and the interaction factors (number of neurons, learning rate, momentum, and number of epoch) obtained from the model graph of the Design-Expert version 12.0 software.
Figure 9a shows the interaction plot of the number of neurons versus the ln(MSECV) of the training function with the other interaction factors, which are constant at their midpoints. In
Figure 9a, the different shapes of curves depend on the type of training function (E). It can be observed that, with an increased number of neurons (from 1 to 30) and with
traingdm as a training function, the ln(MSECV) increases up to 0.7. This indicates that too many hidden neurons yield more flexibility for the weight adjustment and, hence, a better learning process, particularly with noise present in the system.
Almost similar trends can be seen for an increased learning rate and momentum. However, by increasing the number of epochs (i.e., from 700 to 1000), it showed less of an effect on the ln(MSECV) response from the curves depicted in
Figure 9d. These findings confirmed the statistical results obtained in
Table 4—the number of neurons, the learning rate, and the momentum were significant variables for ln(MSECV), while the epoch number is not trivial. There is very little change or interaction shown by ln(MSECV) for FFNN-lm compared with ln(MSECV) for FFNN-gdm. This is because the MSECV value produced by FFNN-lm is too small, since Levenberg–Marquart (lm) has advantages such as the fastest training function and a good function fitting (non-linear regression) with a lower mean square error (MSE).
The comparisons of the actual and predicted ln(MSECV) responses based on 60 runs of various conditions of the network parameters using CCD are given in
Table 5. From
Table 5, the overall actual values of ln(MSECV) matched with the predicted values of ln(MSECV). This indicates that the quadratic model in Equation (10) can be established to identify the relationship between the MSECV and the network parameters.
Figure 10a,b show the perturbation plots for the lm and gdm training functions, respectively. The perturbation graph is required to see how the response changes to the changes of its factor from the reference point, with other factors held at constant reference values. In this case, the reference points default at the middle of the design space (the coded zero level of each factor).
Figure 10a presents good interaction variables of the ln(MSECV) response with FFNN-lm. At the center point, factors A (number of neurons), B (learning rate), and C (momentum) produce a relatively higher effect for changes in the reference point, while only a small effect is produced by factor D (number of epoch).
Notice that the optimum values of ln(MSECV) for the network parameters (A, B, C, and D) can be found at 0.0000 (coded value), as shown in
Figure 10a for the lm training function and in
Figure 10b for the gdm training function. In this case, the optimum values for A, B, C, and D are, respectively, 16, 0.16, 0.75, and 850, as presented in
Table 6. It can be observed that both
Figure 9 and
Figure 10 present similar plots of ln(MSECV), but the former plot represents the interaction graph which uses the actual value of the variables, while the latter represents the perturbation graph which uses coded values. Both graphs describe the relationship of the network parameters. As shown in
Table 6, the best network parameters obtained from the conventional and proposed methods are presented for all network parameters. Therefore, the optimum values suggested by RSM were as follows: number of neurons = 16, learning rate = 0.16, momentum = 0.75, and number of epochs = 850, and
trainlm has been selected as a training function. The best network parameters for FFNN-lm had minimum MSECVs when the number of neurons, the learning rate, the momentum, the number of epoch and the training function were 29, 0.7, 0.5, 300, and
trainlm, respectively. Furthermore, the minimum MSECV was obtained by employing the following optimum condition for the FFNN-gdm network parameters: number of neurons = 8, learning rate = 0.2, momentum = 0.5, number of epochs = 1000, and training function =
traingdm. These optimum values were applied for predicting the permeate flux of POME.
It may be concluded that too few hidden neurons limit the ability of the FFNN-lm to model the process well. Too many hidden neurons cause over-fitting and increase the computation time. The learning rate determines the time needed to find the minimum in the weight space. Too small of a learning rate leads to smaller steps being taken in the weight space, a slow learning process, and the network being less capable of escaping from the local minima in the error surface. Too high of a learning rate leads to an increased magnitude of the oscillations for the mean square error and a resulting slow convergence to the lower error state. Moreover, if the momentum is too small, then it will lengthen the learning process.
3.3. Model Validation
Figure 11a shows the training results for the permeate flux outputs for the FFNN-lm, FFNN-gdm, and FFNN-RSM models. These models are plotted based on the best ANN training parameters. From
Figure 11a, it can be observed that the predicted datasets for all ANN models have similar trends to the actual or measured datasets. The permeate flux models for FFNN-lm have a slightly different shape with FFNN-gdm, but it is almost similar to FFNN-RSM. This is because FFNN-RSM also uses the
trainlm training function as a model setting.
The dotted lines in
Figure 11b–d represent the (perfect result—outputs = targets), and the solid line represents the best fit linear regression between the target and the output for FFNN-lm, FFNN-gdm, and FFNN-RSM using training data, respectively. Moreover, the FFNN model trained with
trainlm using the conventional method (FFNN-lm) showed the highest accuracy with the correlation coefficient, with the R and MSE at 0.9888 and 0.0223, respectively, followed by FFNN-RSM (0.9881 and 0.0237) and FFNN-gdm, which were 0.9851 and 0.0296, respectively (refer to
Figure 11b–d). In terms of accuracy, FFNN-lm and FFNN-RSM are comparable.
The training model was then validated using the testing dataset, and good agreement with the actual dataset was achieved, as shown in
Figure 12a–d.
Figure 12a shows the plot of FFNN-lm, FFNN-gdm, and FFNN-RSM permeate flux models for the SMBR filtration system during testing datasets. From
Figure 12b–d, it can be seen that all the models demonstrated good prediction, with a slightly higher performance of accuracy for FFNN-lm, followed by FFNN-RSM and FFNN-gdm. The FFNN-lm model resulted in 0.9873 and 0.0253 for R and MSE, respectively. The R and MSE for FFNN-RSM are, respectively, 0.9866 and 0.0265, while the performance of the FFNN-gdm testing model was 0.9847 and 0.0303 for the R and FFNN-RSM models during the training and testing datasets.
From
Table 7, it was found that all FFNN models produced comparable results for both the training and testing accuracy performance. Nevertheless, in terms of the amount of repetition and training time of the proposed method and the conventional method, the proposed method (FFNN-RSM) only required 60 runs in 233 s (00:03:53) to determine the optimal value of ANN training parameters for the FFNN model. The conventional method required 60 runs for each model (a total of 120 runs), with a total training time of 543 s for both models, which is 151 s (00:02:31.02) for FFNN-lm and 392 s (00:09:03.07) for FFNN-gdm. It is proven that the RSM technique depicted a high performance and the fastest model training technique when compared with the conventional method.
Despite the well-known advantage of the neural network in predicting larger datasets, these results show that the combined FFNN-RSM model can predict well and provide a comparable result in relation to the conventional method in this case. The FFNN-RSM shows a robust generalization ability with a small generalization error. With a smaller number of repetitions, RSM is also effective in avoiding monotonous tasks, where several different network parameters must be constructed, trained, and tested. Moreover, RSM has an ability to analyze the significant parameters and interaction effects of the parameters that affect the output response, which is the MSECV of the model. Hence, RSM is observed as satisfying the requirement for the optimization of ANN training parameters in order to obtain a good prediction model.