Machine Learning Alternatives to Response Surface Models

: In the Design of Experiments , we seek to relate response variables to explanatory factors. Response Surface methodology (RSM) approximates the relation between output variables and a polynomial transform of the explanatory variables using a linear model. Some researchers have tried to adjust other types of models, mainly nonlinear and nonparametric. We present a large panel of Machine Learning approaches that may be good alternatives to the classical RSM approximation. The state of the art of such approaches is given, including classiﬁcation and regression trees, ensemble methods, support vector machines, neural networks and also direct multi-output approaches. We survey the subject and illustrate the use of ten such approaches using simulations and a real use case. In our simulations, the underlying model is linear in the explanatory factors for one response and nonlinear for the others. We focus on the advantages and disadvantages of the different approaches and show how their hyperparameters may be tuned. Our simulations show that even when the underlying relation between the response and the explanatory variables is linear, the RSM approach is outperformed by the direct neural network multivariate model, for any sample size (<50) and much more for very small samples (15 or 20). When the underlying relation is nonlinear, the RSM approach is outperformed by most of the machine learning approaches for small samples ( n ≤ 30).


Introduction
Design of Experiments (DoE) is used in any manufacturing process where, on the one hand, input parameters called factors affect a process or a formula and can be controlled, and on the other hand, output parameters of this process, which therefore represent the results, are called responses. For example, input parameters can be a temperature or a pH, and output parameters can be a yield or an impurity rate. Experiments must be performed to determine a relationship between factors and responses. DoE allows us to select these experiments in such a way that the position of the experimental points in the space to be studied is optimal. In particular, spatial modeling can be used to better understand the process and to predict the variation of responses throughout the variation range of the factors. In these experiments, responses and factors may be quantitative, qualitative, or a mix of both. The number of experiments is often very small, because each may need several days or even weeks, and the cost can be considerable. Among the important tasks that are used in the DoE approach, response surface methodology (RSM) estimates the values of all responses for any combination of factors' values. This is achieved by means of a regression of the responses on the factors. The most common approach considers each response separately and then uses polynomial regression, including or excluding interactions between factors. Polynomial regression suffers from several limitations, as follows: (i) the model is linear in the coefficients; (ii) to get a good approximation of the surfaces, higher degrees of polynomials may be necessary together with interactions, and this considerably increases the number of variables to include in the model and therefore the number of experiments to be performed; (iii) the model may only be estimated if the number of experiments is larger than the number of factors included in the model; and (iv) polynomial models are inappropriate in the case of nonlinear systems [1].
Several papers have suggested machine learning (ML) approaches as an alternative to polynomial regression. The approaches that are considered in the literature include Support Vector Machines (SVM) [1][2][3][4][5][6], Neural Networks (NN) [1,2,4,[6][7][8][9], Random Forests (rf) [1,3,4,10], Boosting and its extension [1,3], Extra Tree regression (ET) [3,4] and Classification And Regression Trees (CART) [1]. In these papers, the authors compared some of these approaches only on real datasets, using specific settings with respect to the nature and dimensions of the data. As well as focusing on a specific dataset, the comparisons considered very few approaches at the same time and were based on very different metrics. In this work, we will mainly consider the contribution of ML with respect to the RSM; these approaches have been shown to be very efficient in various fields, such as ecology [11] and epidemiology [12].
In this paper, we give a state of the art of the ML approaches that have often been used as alternatives to polynomial regression in RSM, including direct multi-output approaches. We will briefly describe these approaches and compare them using simulation models in various situations with different sample sizes and over a real DoE case study. The simulations were actually limited to numerical variables; including categorical ones as input or output did not have any significant effect on our main results or conclusions. The advantage and the contribution of our work lies, on one hand, in the choices of the simulations; some responses are in favor of the RSM approach and others are not. On the other hand, to our knowledge, direct multivariate output approaches have never been used in this context. This paper is organized as follows. Section 2 gives a summary of the polynomial RSM-based approach and a brief description of the most well-known ML approaches. Section 3 gives the state of the art for the applications of these approaches in various domains where DoE is practiced. In Section 4, we compare all of the ML approaches to polynomial regression on simulated multiple output datasets, varying sample sizes and optimizing the models' hyperparameters. A real dataset is used for similar experiments. The last section includes our discussions and conclusion.

Response Surface Methodology
Among the many objectives of experimental design, Response Surface Methodology (RSM) [13] is used to determine the value of one or more outputs at any point in the experimental domain of interest [14] without carrying out an infinite number of experiments. This is achieved using a regression model of the form: where f is the unknown regression function. To estimate f , we need to make some hypothesis about its shape. Polynomial models are most commonly used to address this issue. Using high-degree polynomials, f may be approximated correctly. This is a linear model in the coefficients. With K factors and degree d, the number of coefficients of the model is According to the type of study, polynomial models of degree 1 or 2 are used because the number of observations is, in general, very small (often less than 20).
Let U k be the original K controlled factors, which are also called the natural variables, whose space is called the experimental domain . These variables are generally on different scales, and they may bias the modeling results. Thus, they are often normalized and linearly transformed into codified variables X 1 , X 2 , . . . , X K , which are dimensionless quantities and with the same range. The most commonly used transformation is is called the central value and ∆U k = (maxU k −minU k ) 2 is the step. The transformed variables (X k ) k=1,...,K lie in the interval [−1, 1], and the n × K matrix X = (X k ) k=1,...,K is called the experimental design, where n is the sample size. A polynomial model of degree 2 may be written as: where Y is the response and β are the unknown coefficients. Once the design of the experiments is carried out and the data are available, the coefficients of the model are estimated by ordinary least squares, and their estimate is given by : Test points may be used to validate the model, or an analysis of the variance may be performed. After validation, the model is used to predict the variation of responses throughout the experimental domain represented by a response surface ( Figure 1). These predictions can then be used to construct a Design Space, which is a region in which the response specifications will be achieved with a fixed probability [15]. Nevertheless, due to the small sample sizes available in experimental designs, polynomial models of degree 2 may give very poor estimation for surface responses. Alternative models have been proposed for several years. ML models present more flexible methods of estimating a response surface function: they are nonparametric and nonlinear models, and may even be efficient when the number of experiments is small compared to the number of factors.

Machine Learning Approaches
In this section, we will give a brief description of the most commonly used approaches in ML, as follows: k-nearest neighbors (knn), CART, Ensemble models (Bagging, Boosting, and rf), SVM, and NN.

knn
This is one of the simplest ML approaches that may be used in regression and in classification [17]. It is also quite different from all of the other approaches, in that the model estimation and prediction are embedded. Once the number of neighbors k is fixed, for any new observation x, the method seeks its k nearest neighbors within the learning set. The prediction for x is the average of the neighbors' outputs in regression, or the majority vote in classification. The decision rule may be refined using, for instance, weighted averages or weighted majority votes, where weights may be inversely proportional to the distances of the neighbors from x. Note that the distance used to identify neighbors is a hyperparameter for this approach, together with k.

CART
Classification and Regression Trees [18] are based on recursive partitioning of the input space using rectangles. These partitions are represented by a sequence of binary splitting rules of the form X m < s, where X m is one of the input variables and s is a threshold over it. The split at each stage of the partitioning is optimized by an exhaustive search looking for the couple variable and threshold that minimizes the heterogeneity of the obtained subsamples. Heterogeneity is measured with respect to the output y. When the output is continuous, the deviance is used (the variance multiplied by the sample size). For discrete output, entropy or Gini criteria are used.
CART is is the most popular nonlinear and nonparametric method when one wants to understand how the model relates the outputs to the explanatory variables. It is one of the few methods which is graphically representable.

Ensemble Methods
The idea of ensemble methods, in their simplest approach, is to build K bootstrap samples of the dataset at hand, adjust a model chosen from within a class of functions (e.g., decision trees) for each bootstrap sample (denoted f k ), and then use the K models for predictions. In regression, the prediction by the ensemble is a weighted average of the K predictions given from each f k . In classification, it is a weighted majority vote over the K predictions given by the models f k .
Several variants of ensemble methods exist, differing either in the way in which the bootstrap samples are generated, in the estimation of each f k , in the way in which the individual models f k are combined, or in the algorithm used to estimate the whole process.
Random Forests (rf; [19] are among the most well-known and used approaches. This approach combines trees built over bootstrap samples by simple averages in regression and majority vote in classification. The trees that are built in rf may be very big, because they are not optimized, and the choice of the splits at each node in the trees is made on a small subset of variables chosen at random. Because bootstrap samples are independent, random forests may be parallelized. Boosting [20] is quite different because the trees built over bootstrap samples are obtained sequentially; weights w i are associated to each observation in the original dataset, and at each step of the algorithm, a tree f k is built and tested over the original dataset. Observations whose prediction at step k is wrong will have their weight increased. The modified weights are used at the next step k + 1 for the next bootstrap sample. Each tree that is built in the ensemble has a weight W k related to its performance, and the final output is a weighted majority vote in classification and a weighted average in regression. The trees' estimates and their weights may be obtained using a gradient approach, as is applied in the gradient boosting method (gbm) [21].
Extreme gradient boosting (xgboost [22]) is another gradient boosting algorithm, which employs some additional tricks to estimate the parameters (i.e., trees and their weights). The optimization is carried out globally at the level of each split in the different trees and, as in rf, splits are optimized over a randomly sampled subset of covariates. The loss function optimized in the algorithm uses regularization over the weights and, optionally, a shrinkage of each tree added to the ensemble.

Support Vector Machines
Support vector machines [23] seek a linear separation of observations, like linear regression, but minimize a loss function based on the margins. In binary classification, the optimal regression is the hyper-plane, which perfectly separates the two classes and stays the farthest distance possible from its nearest points from the two classes. This hyper-plane may be easily found if the classes are linearly separable. If they are not, then the dataset is projected using nonlinear transformations in a much higher space where linear separation may be guaranteed. The linear transformations that are used in this case are expressed using kernel functions, among which the most commonly used are polynomial and radial kernels. These kernels, together with the optimization algorithm used to find the optimal hyper-plane, depend in general on several hyperparameters that must be fixed or tuned by the user. We will denote svmPoly (resp. svmRadial) as the support vector regression using polynomial (resp. radial) kernel. Support Vector Machines may show high performance when the data are linearly separable, and are very efficient for large datasets. They also have a very solid mathematical justification.

Neural Networks
NNs, and specifically multi-layer perceptron (mlp) [17], are designed to handle multiple outputs. An mlp is composed of successive layers, each having a fixed number of neurons. Each neuron of a layer receives information (its inputs) from all of the neurons of the previous layers and outputs a nonlinear transformation φ of a linear combination of its inputs. In Figure 2, the left-hand panel shows a simple neuron having d inputs x 1 , x 2 , . . . , x d . This neuron will output the value: where the w j are the weights w 0 = b, x 0 = 1, and φ is a nonlinear function. The right-hand panel in Figure 2 shows an mlp with one hidden layer containing three neurons. Various types of layers may also be used to design a NN. We shall mainly use the fully connected layers (also called dense layers). Once the structure of the network is specified (number of layers and number of neurons per layer), all of the weights of the NN that are unknown are estimated. To achieve this, a loss function should be specified (mean squared error for regression), together with an optimization algorithm (typically gradient-based algorithms). In some situations, specific activation layers are added to the networks, mainly for the last layer (softmax for multi-class learning, sigmoid for multi-label learning).
Neural networks are among the most difficult to tune because of the large number of hyperparameters that may be involved in the structure of the network and in the choice of the optimization algorithms. For small datasets, they may be unstable due to random initializations of the weights within the training.

Multidimensional Output Approaches
In many situations, the output variable to be predicted may be multidimensional in dimension q; multi-target regression, multi-label learning, distribution learning, semantic segmentation (for images), or time series. Various approaches may be used to tackle this situation. Direct multi-output regression is the simplest approach, which consists of using one model per output component. In this case, one assumes that the output components are independent. Chained regression is another approach that accounts for dependence among output variables. The output variables should first be ranked, and one model is learnt for each output, as follows: for output variable j, the original explanatory variables X are used as input, together with the predictions of all of the j − 1 outputs already modelled. In this approach, the result depends strongly on the order used for the responses.
Some supervised learning algorithms also have the ability to jointly consider the q output variables within the same model, which is the case for the example of NN, and also regression trees [24].
In classical regression trees, the output variable y is used to compute the deviance, that is, the splitting criterion of a node t: whereȳ t is the empirical mean of y in node t (which corresponds to a subset of the original sample). When y is multidimensional, y ∈ R q , the deviance is easily generalized [24]: where . 2 stands for the Euclidean norm in R q . In this case, the values assigned to each node and each leaf of the tree are the q dimensional vectorsȳ t . The rest of the algorithm is similar to the one-dimensional output. We will denote this approach as CARTmv.
The main advantage of direct multiple-output approaches is that they may account for dependencies between the different outputs. CARTmv and its variants have been widely used in various applications.

Hyperparameters and Their Tuning
The performance of any ML algorithm highly depends on the choice of its hyperparameters. Each approach uses several hyperparameters, from one to almost ten. In most cases, these parameters have to be optimized, which is often achieved using crossvalidation. Tuning may require excessive computation times, depending on the number of hyperparameters to be adjusted. Details about the tuning process and the hyperparameters used will be given in the experiments section.

Metrics Used to Compare the Models
Analysis of variance is generally used to evaluate the performance of polynomial models in RSM. Depending on the problem under study, it can be observed that the variation in outputs may not be due to the variation in inputs. Different criteria may be used to assess and compare the accuracy of the prediction obtained using any regression model. Let n be the size of the sample used to evaluate the models, y i be the observed value of the response for observation i,ȳ i be the average of responses in the sample, andŷ i be the predicted value of the response for observation i. Among the most used metrics, we can find: • The Nash and Sutcliffe Efficiency is equivalent to the R 2 but uses absolute differences rather than quadratic, The agreement index d is a standardized measure of the degree of model prediction The average absolute deviations from a central point (this metric is defined and used in [8], but the name given by the authors does not seem appropriate for us as it is not in line with their definition), The objective is to minimize RMSE, MAE, MAPE, AIC, and AAD, and to maximize R 2 , EV, NSE, and d. The results encountered in the literature show that, in general, when different metrics are used for comparisons, they are very often concordant with respect to the objectives.
When the output y is multidimensional, any of these criteria may be used by computing its average over all the dimensions.

Comparisons between RSM and ML Approaches
A comparison of these response surface models with ML has been made, mainly in the field of mechanics and materials development.
ML models were compared to the traditional RSM polynomial approximation on a mechanical engineering case study [1]. The authors studied 63 continuous factors and 8 continuous responses combined to produce a one-dimensional output, with 56,512 observations. Polynomial models of degree 1 and 2, Least Absolute Shrinkage Selection Operator [25], Generalized Linear Model [26] and nonlinear ML models (Random Forest, Gradient Boosting Decision Tree [17], Multiple Layer Perceptrons and Support Vector Regression) were tested. Three evaluation criteria were used for the comparisons, as follows: Explained Variance (EV), Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE). The results obtained show that all ML models outperform the polynomial models of the RSM approach. The authors also note that for larger training set sizes, nonlinear ML models were more accurate. A simulation of the data by the MC method is proposed to improve the estimation of the response surface function. Comparisons made in this work are based on a real dataset with large samples; the true underlying relation between the response and the factors is unknown. Meanwhile, eight original responses were combined to get a one-dimensional output.
In [2], ML models were used to predict coating thickness in a nonlinear electrostatic spray deposition process and compared with the conventional RSM model. Three continuous factors and one continuous response were considered with 30 observations. A polynomial model of degree 2 (RSM), NN (Back-Propagation algorithm [27]) and Support Vector Machine models were used. MAPE was calculated to evaluate the performance of these models. The results suggest that the SVM model gives the best prediction accuracy.
The objective of [3] was to predict the viscosity of nano-polymers used in an enhanced oil recovery method, using the response surface methodology and supervised machine learning approaches. Five continuous factors and one continuous response were available, with 57 observations. The authors applied a polynomial model of degree 2 and analysis of variance, which did not show inadequacy for this model. However, the covariance matrix analysis showed that there is no linear relationship between the factors. Therefore, the authors also used nonlinear ML models, namely ridge regression [28], lasso regression, Support Vector Machine, Decision Tree, random forest, Extra tree regression [29], Gradient Boosting Regression and eXtreme Gradient Boosting [22]. Akaike Information Criterion (AIC), Mean Absolute Error (MAE), RMSE and coefficient of determination were used as evaluation and comparison criteria. The results show that the ensemble models give more accurate results than the RSM. Furthermore, the eXtreme Gradient Boosting model was the best model for response prediction.
In another study, where the objective was to predict the microhardness of a synthesized electroless Ni-P-TiO 2 coated aluminium composite, five continuous factors and one continuous response were considered with 36 observations [4]. The authors tested a polynomial model of degree 2 and four ML models (NN, SVM, rf and Extra Trees). Mean Square Error (MSE), MAE and determination coefficient were used as evaluation criteria to compare these models. The results show that the ET model presents the lowest MSE and MAE values. This model also had the greatest R 2 .
Hybrid regression and ML models were tested to predict the ultimate condition of fiber-reinforced polymer-confined concrete [5]. The authors used an open database with eight continuous factors, two continuous responses and 765 observations. The authors compared some existing physical empirical models developed by other authors with the RSM and SVR models, together with a hybrid model combining SVR and RSM models. The following metrics were used to evaluate these models: RMSE, MAE, Nash and Sutcliffe Efficiency (NSE) and agreement index d. The authors show that the hybrid model presents a lower RMSE, lower MAE, higher NSE and higher d than other models.
Similar comparisons with ML approaches may also be found in other fields, such as pharmaceutical development. Ref. [6] worked on the effect of the core/shell technique on improving powder compactability. In this paper, two continuous factors, two binary factors and two continuous responses were measured in 28 experiments. The authors used the RSM approach, adjusting one model for each combination of the two binary factors levels, thus, four models for each response. For the ML approach, the authors compared an SVR model and four types of NN models: Backpropagation Neural Network (BPNN), Genetic Algorithm Based BPNN (GA-BPNN) [30], Mind Evolutionary Algorithm Based BPNN (MEA-BPNN) [31]), and Extreme Learning Machine (ELM) [32]. The evaluation criteria were the variance coefficient of the RMSE and RMSE for all models, and AIC for NN models only. The results show that NN models provide better prediction accuracy than other models.
The objective of the authors of [10] was to apply RSM and ML combined with data simulation to estimate metal recovery from freshwater sediments. In this work, three continuous factors were studied, six continuous responses were measured and 18 experiments were carried out. A polynomial model of degree 2 was estimated and used for data simulation. ML models, namely Lazy KStar [33] and rf algorithms, were also tested. A comparison of the models was made by means of RMSE and coefficient of determination. The results show that RSM models overestimate the observed responses by 19% compared to ML models.
Ref. [8] compared RSM and ML approaches (NN) to predict the efficient extraction of artemisinin (a precursor molecule of the most powerful antimalarial drugs on the market). In this work, three continuous factors were studied, one continuous response was measured and a central composite design was constructed with 20 experiments. For the RSM approach, ANOVA was calculated and shows that the variability of the response cannot be adequately predicted by the RSM model depending on the factors studied. This result can be explained by a complex relationship between variables. Multiple Layer Perceptrons (feed forward NN) were also tested. A comparison between these two models was performed by means of Absolute Average Deviation (AAD). The authors concluded that the NN model has better prediction accuracy.
Ref. [9] analysed NN models as an alternative modeling technique for datasets showing nonlinear relationships, using data from a tablet compression study. In this work, six continuous factors were studied, two continuous responses were measured, and 102 experiments were performed. The authors calculated a polynomial model of degree 2 with only important terms (p-value < 0.05) for the RSM approach and calculated a "generalized feed forward multiple layer perceptron network" for the ML approach. A comparison of the determination coefficients of these two models suggests that the NN model can more accurately predict the variation of the response.
A summary of these papers comparing RSM and ML approaches is given in Table 1.

Experiments
In this section, we compare all of the ML approaches described in Section 3 to the polynomial model over a simulated dataset and a real dataset. We describe the datasets that we used, and how the different models are tuned and compared. All of the experiments were conducted using R software together with the RStudio Team [34] and the package "caret" [35].

Simulated Model
The simulated model that we use is based on two independent continuous factors, X 1 and X 2 , following a uniform distribution over [−1, 1]. The three responses-denoted as Y 1 , Y 2 , and Y 3 -are computed as follows: where is a random noise with zero mean and standard deviation σ. The responses were chosen deliberately to present various types of function, as follows: polynomial of degree 2 (Y 1 ), nonlinear (Y 2 ), and nonlinear with discontinuities (Y 3 ). A graphical representation of these responses is shown in Figure 3. Using this simulated model, we generated a training sample of size n and a test sample of the same size. Different values of n were tested, as follows: 15, 20, 30 and 50 observations. Small samples were used for conformity with true DoE applications. For each simulated dataset of size n, we computed the RMSE metric. Simulations were repeated K = 50 times, and RMSE values were averaged over the different runs. We also varied the variance of the random noise, testing two values, σ = 1 and σ = 2.

Tuning the Models
We compared the RSM approach with all of the ML approaches described in Section 3: knn, CART, rf, gbm, xgboost, svmPoly, svmRadial, NN, CARTmv and NNmv. The models were trained using the model matrix as input, thus, the true factors together with their squares and interactions.
All ML models depend on several hyperparameters. To optimize the values of the hyperparameters, we applied the grid search approach. Thus, a grid was defined for each model's hyperparameter, and three-fold cross-validation with five repeats was used for the optimization. The optimal values of the hyperparameters were then used to learn the model over the training set and evaluate it on the test sample. Table 2 lists the hyperparameters for each method, together with a short description and the grid used for its optimization. The average of the optimal values of the hyperparameters and their standard deviation for each approach and each response are given in Appendix A.1.

Results
Tables 3-6 show the average and standard deviation of the RMSE scores computed over the test samples for each response and averaged over K = 50 repetitions (columns, S 1 , S 2 and S 3 ), and the average RMSE over the responses (column S ave ). We also reported the rank of each model with respect to its RMSE for each response (R 1 , R 2 and R 3 ) and for the average over responses (R ave ). A model has rank 1 for a response if it has the smallest RMSE, and thus, is the best. Each table corresponds to a different sample size n. Boxplots of the RMSE over the K = 50 runs for all the models and the outputs are shown in Appendix A.2.
For n = 50 (Table 3), the best models for the three responses were multivariate NN (for Y 1 ), svm with polynomial kernel (for Y 2 ) and NN (for Y 3 ). The linear model had positions (2,5,5). For n = 30 (Table 4), the best models for the three responses were, respectively: multivariate NN (for Y 1 ) and NN (for Y 2 and Y 3 ). With respect to the average of all response errors, svm with polynomial kernel is the best model in this case. Linear regression is at position 2 for Y 1 , but at position 8 for the other two responses. These first results seem to be in agreement with the state of the art: ML approaches may provide better prediction estimates for the responses. We obtained similar results for smaller sample sizes, n = 20 (Table 5), and n = 15 (Table 6): multivariate NN (for Y 1 ), knn (for Y 2 ) and NN (for Y 3 ) were the best models. Linear regression gets the worst results as the sample size decreases, particularly for Y 2 and Y 3 , which are nonlinear responses.
Globally, the RSM approach seems to be inefficient when compared to ML approaches, mainly with small sample sizes and for responses that are originally nonlinear in the factors. Even for large samples (n = 50, compared to what is available in DoE), RSM is outperformed in our experiments for linear response. Finally, multivariate NNs show excellent performance in various situations, mainly for the linear response (Y 1 ) where they outperform the RSM for any sample size.

Use Case
We used the pharmaceutical application that is described in [16]. This article is based on the development of the high-performance liquid chromatography analytical method to quantify verapamil hydrochloride (which is a chemical molecule involved in headaches) and its impurities in tablets. In this use case, three continuous factors (buffer pH, ammonium hydroxide concentration in mobile phase and injection volume of test solutions) and five continuous responses (capacity factor for the first eluted peak denoted Y 1 , resolution between two impurities denoted Y 2 , signal-to-noise for an impurity denoted Y 3 , resolution between two other impurities denoted Y 4 , and retention time difference between oxidative impurity peak and the closest impurity denoted Y 5 ) were available. The authors used the RSM approach to determine optimal chromatographic conditions.
For the real case study, we conducted similar experiments as for the simulated dataset, replacing random generation with random splitting. Thus, for each of the 50 repetitions, the original dataset was randomly split using proportions 0.7 and 0.3 for learning and testing, respectively. The results that we obtained are given in Table 7. The best models with respect to the RMSE were xgb for Y 1 , gbm for Y 2 , multivariate NN for Y 3 , NN for Y 4 and knn for Y 5 . For the overall error, multivariate NN had the smallest error. For this real dataset, and for all responses, ML models gave better results than the RSM approach.

Conclusions
In this work, using extensive simulations and a real use case, we have demonstrated that ML approaches present a very interesting alternative to response surface modeling in the DoE. We have tested a large panel of ML approaches, together with direct multi-output regression models (decision trees and NNs), which had never previously been used in this context. Clearly, various ML approaches outperformed RSM in all our experiments. RSM is very simple to use compared to ML approaches, due to the fact that several hyperparameters must be tuned correctly in ML algorithms. This requires optimization procedures over grids. The caret package is a good wrapper for this task and for most of the approaches that we have used, except for multi-output regression, where we implemented the grid search algorithm directly. Another constraint of caret is that not all the hyperparameters involved in each approach may be tuned, and thus it may be necessary to use other packages for this issue.
In our implementation, we used the factors together with their squares and interactions in the ML approaches. Very few papers are clear about this choice. We also tested the models including only the factors, and in some cases a loss in performance appeared for some ML algorithms, but the main conclusion was observed: several ML approaches outperformed classical RSM.
Many ML and deep learning approaches were not explored in this work (like [36][37][38][39]) and may have some advantages and drawbacks compared to those we have used. Due to the very large choice of such approaches, it is impossible to be exhaustive, so we have selected the ones most used and cited in the literature. In addition, further higher-polynomial-order factors could be included as input in the different models. Together with feature selection approaches, this may give better performance for various ML approaches.

Appendix A. Supplementary Material
Appendix A. 1

. Mean and Standard Deviation of Hyperparameter Values
In each of the K = 50 runs, the optimized values of the parameters were computed. The following tables give the mean and standard deviation of the hyperparameters for each response.