Solving Regression Problems with Intelligent Machine Learner for Engineering Informatics

: Machine learning techniques have been used to develop many regression models to make predictions based on experience and historical data. They might be used singly or in ensembles. Single models are either classiﬁcation or regression models that use one technique, while ensemble models combine various single models. To construct or ﬁnd the best model is very complex and time-consuming, so this study develops a new platform, called intelligent Machine Learner (iML), to automatically build popular models and identify the best one. The iML platform is benchmarked with WEKA by analyzing publicly available datasets. After that, four industrial experiments are conducted to evaluate the performance of iML. In all cases, the best models determined by iML are superior to prior studies in terms of accuracy and computation time. Thus, the iML is a powerful and efﬁcient tool for solving regression problems in engineering informatics. J.-S.C.; investigation, J.-S.C., D.-N.T. C.-F.T.; methodology, J.-S.C. and C.-F.T.; project administration, J.-S.C.; resources, J.-S.C. and C.-F.T.; software, D.-N.T.; supervision, J.-S.C.; validation, J.-S.C., D.-N.T. and C.-F.T.; visualization, J.-S.C. D.-N.T.; writing—original J.-S.C., C.-F.T.; and J.-S.C. D.-N.T.


Introduction
Machine Learning (ML)-based methods for building prediction models have attracted abundant scientific attention and are extensively used in industrial engineering [1][2][3], design optimization of electromagnetic devices, and other areas [4,5]. The ML-based methods have been confirmed to be effective for solving real-world engineering problems [6][7][8]. Various supervised ML techniques (e.g., artificial neural network, support vector machine, classification and regression tree, linear (ridge) regression, and logistic regression) are typically used individually to construct single models and ensemble models [9,10]. To construct a series of models and identify the best one among these ML techniques, users need a comprehensive knowledge of ML and spend a significant effort building advanced models.
The primary objective of this research is to develop a user-friendly and powerful ML platform, called intelligent Machine Learner (iML), to help its users to solve real-world engineering problems with a shorter training time and greater accuracy than before. The iML can automatically build and scan all regression models, and then identify the best one. Novice users with no experience of ML can easily use this system. Briefly, the iML (1) helps users to make prediction model easily; (2) provides an overview of the parameter settings for the purpose of making objective choices; and (3) yields clear performance indicators, facilitating reading and understanding of the results, on which decisions can be based.
Four experiments were carried out to evaluate the performance of iML and were compared with previous studies. In the first experiment, empirical data concerning enterprise resource planning (ERP) for software projects by a leading Taiwan software provider over the last five years were collected and analyzed [1]. The datasets in the other three experiments were published on the UCI website [11][12][13]. Specifically, the purpose of the second experiment was to train a regression model of comparing the performance of CPU processors by using some characteristics as input. The third experiment involved forecasting the demand supporting structured productivity and high levels of customer service, and the fourth experiment involved estimating the total bikes rented per day.
The rest of this paper is organized as follows. Section 2 reviews application of machine learning techniques in various disciplines. Section 3 presents the proposed methodology and iML framework. Section 4 introduces the evaluation metrics to measure accuracy of the developed system. Section 5 demonstrates iML's interface. Section 6 shows benchmarks between iML and WEKA (a free, open source program). Section 7 exhibits the applicability of iML in numerical experiments. Section 8 draws conclusions, and provides managerial implications and suggestions for future research.

Literature Review
Numerous researchers in various fields, such as ecology [14,15], materials properties [16][17][18], water resource [19], energy management [20], and decision support [21,22], use data-mining techniques to solve regression problems, and especially project-related problems [23,24]. Artificial neural network (ANN), support vector machine/regression (SVM/SVR) classification and regression tree (CART), linear ridge regression (LRR), and logistic regression (LgR) are the most commonly used methods for this purpose and are all considered to be among the best machine learning techniques [25][26][27]. Similarly, four popular ensemble models, including voting, bagging, stacking and tiering [28][29][30], can be built based on the meta-combination rules of aforementioned single models. Chou (2009) [31] developed a generalized linear model-based expert system for estimating the cost of transportation projects. Dandikas et al. (2018) [32] assessed the advantages and disadvantages of regression models for predicting potential of biomethane. The results indicated that the regression method could predict variations in the methane yield and could be used to rank substrates for production quality. However, least squaresbased regression usually leads to overfitting a model, failure to find unique solutions, and issues dealing with multicollinearity among the predictors [33], so ridge regression, another type of regularized regression, is favorably integrated in this study to avoid the above problems. Additionally, Sentas and Angelis (2006) [34] investigated the possibility of using some machine learning methods for estimating categorical missing values in software cost databases. They concluded that multinomial logistic regression was the best for imputation owing to its superior accuracy.
The general regression neural network was originally designed chiefly to solve regression problems [24,35]. Caputo and Pelagagge (2008) [36] compared the ANN with the parametric methods for estimating the cost of manufacturing large, complex-shaped pressure vessels in engineer-to-order manufacturing systems. Their comparison demonstrated that the ANN was more effective than the parametric models, presumably because of its better mapping capabilities. Rocabruno-Valdés et al. (2015) [37] developed models based on ANN for predicting the density, dynamic viscosity, and cetane number of methyl esters and biodiesel. Similarly, Ganesan et al. (2015) [38] used ANN to predict the performance and exhaust emissions of a diesel electricity generator. SVM was originally developed by Vapnik (1999) for classification (SVM) and regression (SVR) [39,40]. Jing et al. (2018) [41] used SVM to classify air balancing, which is a key element for heating, ventilating, air-conditioning (HAVC), and variable air volume (VAV) system installation, and is useful for improving the energy efficiency by minimizing unnecessary fresh air to the air-conditioned zones. The results demonstrated that SVM achieved 4.6% of relative error value and is a promising approach for air balancing. García-Floriano et al. (2018) [42] used SVR to model software maintenance (SM) effort prediction. The SVR model was superior to regression, neural networks, association rules and decision trees, with 95% confidence level.
The classification and regression tree method (CART), introduced by Breiman et al. (2017) [43], is an effective method to solve classification and regression problems [42]. Choi and Seo (2018) [44] predicted the fecal coliform in the North Han River, South Korea by CART models, the test results showed the total correct classification rates of the four models ranged from 83.7% to 93.0%. Ru et al. (2016) [45] used the CART model to predict cadmium enrichment levels in reclaimed coastal soils. The results showed that cadmium enrichment levels had an accuracy of 78.0%. Similarly, Li (2006) [16] used CART to predict materials properties and behavior. Chou et al. ( , 2017 [26,46] utilized the CART method to modeling steel pitting risk and corrosion rate and forecasting project dispute resolutions. In addition to the aforementioned single models, Elish (2013) [47] used voting ensemble for estimating software development effort. The ensemble model outperformed all the single models in terms of Mean Magnitude of Relative Error (MMRE), and achieved competitive percentage of observations whose Magnitude of Relative Error (MRE) is less than 0.25 (PRED (25)) and recently proposed Evaluation Function (EF) results. Wang at el. (2018) demonstrated that ensemble bagging tree (EBT) model could accurately predict hourly building energy usage with MAPE ranging from 2.97% to 4.63% [48]. Comparing to the conventional single prediction model, EBT is superior in prediction accuracy and stability. However, it requires more computation time and is short of interpretability owing to its sophisticated model structure.
Chen et al. (2019) [49] showed that the stacking model outperformed the individual models, achieving the highest R 2 of 0.85, followed by XGBoost (0.84), AdaBoost (0.84) and random forest (0.82). For the estimation of hourly PM2.5 in China, the stacking model exhibited relatively high stability, with R 2 ranging from 0.79 to 0.92. Basant at el. (2016) [50] proposed a three-tier quantitative structure-activity relationship (QSAR) model. This model can be used for the screening of chemicals for future drug design and development process and safety assessment of the chemicals. In comparison with previously studies, the QSAR models on the same endpoint property showed the encouraging statistical quality of the proposed models.
According to the reviewed literature, various machine learning platforms have been developed for the past decades, such as the Scikit-Learn Python libraries, Google's Ten-sorFlow, WEKA and Microsoft Research's CNTK. Users can find it easy to use a machine learning tool and/or framework to solve numerous problems as per their needs [51]. MLbased approaches have been confirmed to be effective in providing decisive information. Since there is no best model suitable to predict all problems (the "No Free Lunch" theorem [52,53]), a comprehensive comparison of single and ensemble models embedded within an efficient forecasting platform for solving real-world engineering problems is imperatively needed. The iML platform proposed in this study can efficiently address this issue.

Classification and Regression Model
Neural networks (or artificial neural networks) comprise information-processing units, which are similar to the neurons in the human brain, except that a neural network is composed of artificial neurons ( Figure 1) [54]. Particular, back-propagation networks (BPNNs) are widely used, and are known to be the most effective network models [55,56]  Equation (1) uses sigmoid function to activate each neuron in a hidden output layer, and the Scaled Conjugate Gradient Algorithm is used to calculate the weights of the network. BPNNs will be trained until the stopping criteria is reached by default settings in MATLAB.
where is the activation of the th neuron; j is the set of neurons in the preceding layer; is the weight of the connection between neuron k and neuron j; is the output neuron j; and is the sigmoid or logistic transfer function.

Support Vector Machine (SVM) and Support Vector Regression (SVR)
Developed by Cortes and Vapnik (1995) [57], SVM is used for binary classification problems. The SVM was created based on decision hyper-planes that determine decision boundaries in an input space or a high-dimensional feature space [40,58]. Binary classification can only classify samples into negative and positive while multi-class classification problems are complex ( Figure 2). In this study, One Against All (OAA) is used to solve multiple classification problems.   Equation (1) uses sigmoid function to activate each neuron in a hidden output layer, and the Scaled Conjugate Gradient Algorithm is used to calculate the weights of the network. BPNNs will be trained until the stopping criteria is reached by default settings in MATLAB.
where net k is the activation of the k th neuron; j is the set of neurons in the preceding layer; w kj is the weight of the connection between neuron k and neuron j; O j is the output neuron j; and y k is the sigmoid or logistic transfer function.

Support Vector Machine (SVM) and Support Vector Regression (SVR)
Developed by Cortes and Vapnik (1995) [57], SVM is used for binary classification problems. The SVM was created based on decision hyper-planes that determine decision boundaries in an input space or a high-dimensional feature space [40,58]. Binary classification can only classify samples into negative and positive while multi-class classification problems are complex ( Figure 2). In this study, One Against All (OAA) is used to solve multiple classification problems.  Equation (1) uses sigmoid function to activate each neuron in a hidden output layer, and the Scaled Conjugate Gradient Algorithm is used to calculate the weights of the network. BPNNs will be trained until the stopping criteria is reached by default settings in MATLAB.
where is the activation of the th neuron; j is the set of neurons in the preceding layer; is the weight of the connection between neuron k and neuron j; is the output neuron j; and is the sigmoid or logistic transfer function.

Support Vector Machine (SVM) and Support Vector Regression (SVR)
Developed by Cortes and Vapnik (1995) [57], SVM is used for binary classification problems. The SVM was created based on decision hyper-planes that determine decision boundaries in an input space or a high-dimensional feature space [40,58]. Binary classification can only classify samples into negative and positive while multi-class classification problems are complex ( Figure 2). In this study, One Against All (OAA) is used to solve multiple classification problems.   The OAA-SVM constructs m SVM models for m-class classification problems, and the i th SVM model is trained based on the dataset of the i th class which includes a positive class and a negative class. In training, a set of l data points (x i , y i ) l i=1 , where x i ∈ R n the input data, and y i ∈ (1, 2, . . . , m) is the class label of x i ; the i th SVM model is solved using the following optimization problem equation [59].
subject to : When the SVM models have been solved, the class label of example x is predicted as follows: where i is the i th SVM model; w i is a vector normal to the hyper-plane; b i is a bias, ϕ(x) is a nonlinear function that maps x to a high-dimension feature space, ξ i is the error in misclassification, and C ≥ 0 is a constant that specifies the trade-off between the classification margin and the cost of misclassification.
To train the SVM model, radial basic function (RBF) kernel maps samples non-linearly into a feature space with more dimensions. In this study, the RBF kernel is used as SVM kernel function.
where σ is a positive parameter that controls the radius of RBF kernel function. Support vector regression (SVR) [40] is one version of SVM. SVR computes a linear regression function for the new higher-dimensional feature space using ε-insensitive loss while simultaneously reducing model complexity of the model by minimizing w 2 . This process can be implemented by introducing (non-negative) slack variables ξ i , ξ * i to measure the deviation in training samples outside the ε-insensitive zone. The SVR can be formulated as the minimization of the following equation: subject to : When SVR model has been solved, the value of example x is predicted as follows.
where K(x i , x) is the kernel function and α * i , α i are Lagrange multipliers in the dual function.

Classification and Regression Tree (CART)
Classification and regression tree technique is described as a tree on which each internal (non-leaf) node represents a test of an attribute, each branch represents the test result, and each leaf (or terminal) node has a class label and class result ( Figure 3) [60]. The tree is "trimmed" until total error is minimized to optimize the predictive accuracy of the tree by minimizing the number of branches. The training CART is constructed through the Gini index. The formulas are as follows.
where and are the categorical variables in each item; ( ) is the recorded number of nodes in category ; and is the recorded number of the root nodes in category ; and ( ) is the prior probability value for category .

Linear Ridge Regression (LRR) and Logistic Regression (LgR)
Statistical models of the relationship between dependent variables (response variables) and independent variables (explanatory variables) are developed using linear regression ( Figure 4). The general formula for multiple regression models is as follows.
where is a dependent variable; is a constant; is a regression coefficient ( = 1,2, … , ), and is an error term.
Linear ridge regression (LRR) is a regularization technique that can be used together with generic regression algorithms to model highly correlated data [61,62]. Least squares method is a powerful technique for training the LRR model, which denotes to minimize where i and j are the categorical variables in each item; N j (t) is the recorded number of nodes t in category j; and N j . is the recorded number of the root nodes in category j; and p(j) is the prior probability value for category j.

Linear Ridge Regression (LRR) and Logistic Regression (LgR)
Statistical models of the relationship between dependent variables (response variables) and independent variables (explanatory variables) are developed using linear regression ( Figure 4). The general formula for multiple regression models is as follows.
where y is a dependent variable; β o is a constant; β j is a regression coefficient (j = 1, 2, . . . , n), and ε is an error term. Linear ridge regression (LRR) is a regularization technique that can be used together with generic regression algorithms to model highly correlated data [61,62]. Least squares method is a powerful technique for training the LRR model, which denotes β to minimize the Residual Sum Squares (RSS)-function. Therefore, the cost function is presented as below.
where λ is a pre-chosen constant, which is the product of a penalty term and the squared norm in the β vector of regression method, and y is the predicted values.
Mathematics 2021, 9, x FOR PEER REVIEW 7 of 25 the Residual Sum Squares (RSS)-function. Therefore, the cost function is presented as below.
where is a pre-chosen constant, which is the product of a penalty term and the squared norm in the vector of regression method, and ′ is the predicted values.

Y Linear Regression
Linear Ridge Regression Statistician David Cox developed logistic regression in 1958 [63]. An explanation of logistic regression begins with an explanation of the standard logistic function. Equation (17) mathematically represents the logistic regression model.
where ( ) is the probability that the dependent variable equals a "success" or "case" rather than a failure or non-case. and are found by minimizing cost function defined in Equation (18).
where is the observed outcome of case , having 0 or 1 as possible values [64]

Ensemble Regression Model
In this study, several ensemble schemes, including voting, bagging, stacking, and tiering were investigated using the input data and described as below.
where p(x) is the probability that the dependent variable equals a "success" or "case" rather than a failure or non-case. β o and β j are found by minimizing cost function defined in Equation (18).
where y i is the observed outcome of case x i , having 0 or 1 as possible values [64]

Ensemble Regression Model
In this study, several ensemble schemes, including voting, bagging, stacking, and tiering were investigated using the input data and described as below.

•
Voting: The voting ensemble model combines the outputs of the single models using a meta-rule. The mean of the output values is used in this study. According to the adopted ML models, 11 voting models are trained in this study, including (   • Tiering: Figure 5d illustrates the tiering ensemble model. There are two tiers inside a tiering ensemble model in this study. The first tier is to classify data into k classes on the basis of T value [18]. Machine learning technique in the first tier for classifying data needs to be identified. After classifying the data, the regression machine learning is used to train each data (Sub Data) of each class (second tier) to predict results. In the iML, we developed three types of models, including 2-class, 3-class, and 4-class.
The equation for calculating T value is: where T is standard value, k is the number of classes, and y max and y min are the maximum and minimum of actual values, respectively.

K-Fold Cross Validation
K-fold cross validation is used to compare two or more prediction models. This method randomly divides a sample into a training sample and a test sample by splitting into K subsets. K-1 subsets are selected to train the model while the other is used to test, and this training process is repeated K times ( Figure 6). To compare models, the average of performance results (e.g., RMSE, and MAPE) is computed. Kohavi (1995) stated that K = 10 provides analytical validity, computational efficiency, and optimal deviation [65]. Thus, K = 10 is used in this study. Performance metrics will be explained in details Section 4.
Mathematics 2021, 9, x FOR PEER REVIEW 9 of 25 • Tiering: Figure 5d illustrates the tiering ensemble model. There are two tiers inside a tiering ensemble model in this study. The first tier is to classify data into k classes on the basis of T value [18]. Machine learning technique in the first tier for classifying data needs to be identified. After classifying the data, the regression machine learning is used to train each data (Sub Data) of each class (second tier) to predict results. In the iML, we developed three types of models, including 2-class, 3class, and 4-class. The equation for calculating T value is: where T is standard value, is the number of classes, and and are the maximum and minimum of actual values, respectively.

K-Fold Cross Validation
K-fold cross validation is used to compare two or more prediction models. This method randomly divides a sample into a training sample and a test sample by splitting into K subsets. K-1 subsets are selected to train the model while the other is used to test, and this training process is repeated K times ( Figure 6). To compare models, the average of performance results (e.g., RMSE, and MAPE) is computed. Kohavi (1995) stated that K = 10 provides analytical validity, computational efficiency, and optimal deviation [65]. Thus, K = 10 is used in this study. Performance metrics will be explained in details Section 4. Test data Training data  Figure 7 presents the structure of iML. In stage 1 (data preprocessing), the data is classified distinctly for particular use in the Tiering ensemble model. Meanwhile, all data is divided into two main data groups, namely, learning data and test data, and the learning data is duplicated for training ensemble models.

Intelligent Machine Learner Framework
At the next stage, all retrieved data is automatically used for training models, which include single models (ANN, SVR, LRR, and CART), and ensemble models (voting, bagging, stacking, and tiering). Notably, the tiering ensemble model needs to employ a classification technique to assign a class label to the original input at the first tier. A corresponding regression model for the particular class is then adopted at the second tier to obtain the predictive value [17,26].  Figure 7 presents the structure of iML. In stage 1 (data preprocessing), the data is classified distinctly for particular use in the Tiering ensemble model. Meanwhile, all data is divided into two main data groups, namely, learning data and test data, and the learning data is duplicated for training ensemble models.

Intelligent Machine Learner Framework
At the next stage, all retrieved data is automatically used for training models, which include single models (ANN, SVR, LRR, and CART), and ensemble models (voting, bagging, stacking, and tiering). Notably, the tiering ensemble model needs to employ a classification technique to assign a class label to the original input at the first tier. A corresponding regression model for the particular class is then adopted at the second tier to obtain the predictive value [17,26].  Finally, in stage 3 (find the best model), the predictive performances of all the models learned (trained) in stage 2 using test dataset are compared to identify the best models. Section 4 describes the performance evaluation metrics in detail.

Mathematical Formulas for Performance Measures
To measure the performance of classification models, the accuracy, precision, sensitivity, specificity and the area under the curve (AUC) are calculated. For the regression models, five-performance measures, (i.e., correlation coefficient (R), mean absolute error (MAE), mean absolute percentage error (MAPE), root mean squared error (RMSE), and total error rate (TER)) are calculated. Table 1 presents a confusion matrix and  Table 2 exhibits those performance measures [17,66]. In Table 2, MAE is the mean absolute difference between the prediction and the actual value. MAPE represents the mean percentage error between prediction and actual value, the smaller value of MAPE, the better prediction result achieved by the model. The MAPE is the index typically used to evaluate the accuracy of prediction models. RMSE represents the dispersion of errors by a prediction model. The statistical index that shows the linear correlation between two variables is denoted as R. Lastly, TER is the total difference of predicted and actual values [17]. Finally, in stage 3 (find the best model), the predictive performances of all the models learned (trained) in stage 2 using test dataset are compared to identify the best models. Section 4 describes the performance evaluation metrics in detail.

Mathematical Formulas for Performance Measures
To measure the performance of classification models, the accuracy, precision, sensitivity, specificity and the area under the curve (AUC) are calculated. For the regression models, five-performance measures, (i.e., correlation coefficient (R), mean absolute error (MAE), mean absolute percentage error (MAPE), root mean squared error (RMSE), and total error rate (TER)) are calculated. Table 1 presents a confusion matrix and Table 2 exhibits those performance measures [17,66]. In Table 2, MAE is the mean absolute difference between the prediction and the actual value. MAPE represents the mean percentage error between prediction and actual value, the smaller value of MAPE, the better prediction result achieved by the model. The MAPE is the index typically used to evaluate the accuracy of prediction models. RMSE represents the dispersion of errors by a prediction model. The statistical index that shows the linear correlation between two variables is denoted as R. Lastly, TER is the total difference of predicted and actual values [17].   The goal is to identify the model that yields the lowest error of test data. To obtain a comprehensive performance measure, the five statistical measures (RMSE, MAE, MAPE, 1-R, and TER) were combined into a synthesis index (SI) using Equation (20). Based on the SI values, the best model is identified.
where m p = number of performance measures; P i = i th performance measure; and P min,i and P max,i are the maximum and minimum of i th measure. The SI range is 0-1; the SI value close to 0 indicates a better accuracy of the predictive model.

Design and Implementation of iML Interface
The iML was developed in MATLAB R2016a on a PC with an Intel Core i5-750 CPU, a clock speed of 3.4 GHz, and 8 GB of RAM, running Windows 10. Figure 8 presents a user-friendly interface for iML. First, users select models on setting-parameters board and set the parameters for the chosen models, which will be trained and analyzed. Next, users choose whether to test with either "K-Fold Validation" or "Percentage Split" before uploading the data. Notably, if "Percentage Split" is selected, the user only has to input percentage value of learning data. Then, users click on the "Run" button to train the model. Finally, the "Make Report" function is to create a report containing performance metrics of all selected models and the identified best model. Figure 9 displays a snapshot of report file in notepad.

Setting parameters, including single models, classification and ensemble model Visualization of data for evaluation and prediction
Purpose of user: evaluation or prediction Performance or prediction results   Table 3 shows the publicly available datasets from the UCI Machine Learning Repository (https://archive.ics.uci.edu/mL/index.php; accessed 1 March 2021). The iML is benchmarked with WEKA (a free, open source program) using hold-out validation and K-fold cross-validation on the target datasets. All algorithm parameters are set default for    Table 3 shows the publicly available datasets from the UCI Machine Learning Repository (https://archive.ics.uci.edu/mL/index.php; accessed 1 March 2021). The iML is benchmarked with WEKA (a free, open source program) using hold-out validation and K-fold cross-validation on the target datasets. All algorithm parameters are set default for  Table 3 shows the publicly available datasets from the UCI Machine Learning Repository (https://archive.ics.uci.edu/mL/index.php; accessed 1 March 2021). The iML is benchmarked with WEKA (a free, open source program) using hold-out validation and K-fold cross-validation on the target datasets. All algorithm parameters are set default for both iML and WEKA platforms.

Hold-Out Validation
In this test, datasets are randomly partitioned into 80% and 20% for learning and test, respectively. Tables 4-8 show the one-time performance results on these five datasets. A model with a normalized SI value of 0.000 is the best prediction model among all the models tested by iML and WEKA. Notably, the best model can be automatically identified by iML with "one-click". To train models with WEKA, the users need to build each model individually. Moreover, iML gives better test results of single, voting and bagging models than those of WEKA. Based on the benchmark results, iML is effective to find the best model in the hold-out validation.      (8) Note: (*) is (ANN + SVR + CART + LRR); bold value denotes the best overall performance.

K-Fold Cross-Validation
Tenfold cross-validation is used to evaluate the generalized performance of WEKA and iML. Tables 9-13 show the average performance measures of five datasets, respectively. Similarly, iML identifies better models in single, voting, and bagging schemes than those trained by WEKA. The best model for each dataset is automatically determined by iML. Therefore, iML is a powerful tool to find the best model in the cross-fold validation.     Note: (*) is (ANN + SVR + CART + LRR); bold value denotes the best overall performance.

Discussion
Single, voting, bagging, and stacking models are compared using WEKA and iML, except for the tiering method, which is not available in WEKA. Additionally, unlike manual construction of individual models in WEKA interface, iML can automatically build and identify the best model for the imported datasets. Hold-out validation and tenfold crossvalidation are used to evaluate the performance results (R, MAE, RMSE, and MAPE) in each scheme (single, voting, bagging, and stacking). The analytical results of either validation show that most of the models trained by iML are superior to those trained by WEKA using the same datasets. Hence, iML is an effective platform to solve regression problems.

Numerical Experiments
This section validates iML by using various industrial datasets, including (1) enterprise resource planning data [1], (2) CPU computer performance data [12], (3) customer data for a logistics company [13], and (4) daily data bike rentals [11]. Table 14 presents the initial parameter settings for these problems.

Enterprise Resource Planning Software Development Effort
Enterprise Resource Planning (ERP) data for 182 software projects of a leading Taiwan software provider over the last five years was collected, analyzed, and tested with K-fold cross validation.

Variable Selection
Experienced in-house project managers were interviewed to identify factors that affect the ERP software development effort (SDE). There are 182 samples and 17 attributes, and Table 15 summarizes the descriptive statistical data in details. The input and output attributes are defined by Chou el at. (2012) [1].

iML Results
iML automatically trains the models and calculates the performance values. Then, it compares the SI values (SI local and SI globlal ) among the selected modeling type (singe, voting ensemble, bagging ensemble, stacking ensemble and tiering ensemble). Table  16 presents the detailed results of iML and Figure 10 plots the RMSE of best models for the studied case. Both SI local and SI global values of bagging ANN ensemble are equal to zero, which indicate that the bagging ANN ensemble is the best model in terms of prediction accuracy.   Three models (single, voting, and bagging) provided better results in terms of R (0.94 to 0.99) than the tiering and stacking ensemble models, which had the R values of 0.58 to 0.95. Among these three best models, in terms of MAPE, the bagging model exhibited the best balance of MAPE results from learning and test data (21.45% and 19.50%, respectively). The single and voting models depicted un-balanced MAPEs for training and test data (19.91% and 30.65% for the single model; 16.83% and 33.90% for the voting model). Thus, the bagging model was the best model to predict ERP.
The first experiment indicates that, the iML not only identifies the best model, but also reports the performance values of all the training models. Chou [1]. The iML yields the bagging ensemble model with MAPEs of 21.45% and 19.50%, and RMSEs of 70.28hr and 65.58 h for the same training and test data, respectively. As a result, the iML is effective to find the best model among the popular regression models.

Experiments on Industrial Datasets
Three additional experiments were performed to evaluate iML. To ensure a fair comparison, 70 % of the data was used for learning whereas the remaining 30% was utilized for testing.

Performance of CPU Processors
This experiment is about the comparison of performance of CPU processors. The data for this experiment was taken from Maurya and Gupta (2015) [12]. This dataset contained 209 samples with a total of 6 attributes ( Table 17). The descriptions of the attributes are as follows: X1: Machine cycle time in nanoseconds (integer, input); X2: Minimum main memory in kilobytes (integer, input); X3: Maximum main memory in kilobytes (integer, input); X4: Cache memory in kilobytes (integer, input); X5: Minimum channels in units (integer, input); X6: Maximum channels in units (integer, input); and Y: Estimated relative performance (integer, output).  [1]. The iML yields the bagging ensemble model with MAPEs of 21.45% and 19.50%, and RMSEs of 70.28hr and 65.58 h for the same training and test data, respectively. As a result, the iML is effective to find the best model among the popular regression models.

Experiments on Industrial Datasets
Three additional experiments were performed to evaluate iML. To ensure a fair comparison, 70 % of the data was used for learning whereas the remaining 30% was utilized for testing.

Performance of CPU Processors
This experiment is about the comparison of performance of CPU processors. The data for this experiment was taken from Maurya and Gupta (2015) [12]. This dataset contained 209 samples with a total of 6 attributes (Table 17). The descriptions of the attributes are as follows: X 1 : Machine cycle time in nanoseconds (integer, input); X 2 : Minimum main memory in kilobytes (integer, input); X 3 : Maximum main memory in kilobytes (integer, input); X 4 : Cache memory in kilobytes (integer, input); X 5 : Minimum channels in units (integer, input); X 6 : Maximum channels in units (integer, input); and Y: Estimated relative performance (integer, output).

Daily Demand Forecasting Orders
This experiment is about the daily demand forecasting orders. The data used in this experiment was taken from Ferreira et al. (2016) [13]. Table 18 shows a statistical analysis of the data. There were 60 samples with 12 attributes, including X 1 : Week of the month (first week, second, third or fourth week of month, input); X 2 : Day of the week (Monday to Friday, input); X 3 : Urgent orders (integer, input); X 4 : Non-urgent orders (integer, input); X 5 : Type A orders (integer, input); X 6 : Type B orders (integer, input); X 7 : Orders of type C (integer, input); X 8 : Orders from the tax sector (integer, input); X 9 : Orders from the traffic controller sector (integer, input); X 10 : Orders from the banking sector 1 (integer, input); X 11 : Orders from the banking sector 2 (integer, input); X 12 : Banking orders 3 (integer, input); and Y: Total orders (integer, output). The experiment is about the total hourly-shared bike rental per days. The data was adopted from Fanaee-T and Gama (2014) [11], and statistically analyzed in Table 19. In total, there were 731 samples and 11 attributes, defined as follows: X 1 : Season (1: spring, 2: summer, 3: fall, 4: winter, input); X 2 : Month (1 to 12, input); X 3 : Year (0:2011, 1:2012, input); X 4 : Weather day is holiday or not (input); X 5 : Day of the week (input); X 6 : Working day if day is neither weekend nor holiday is 1, otherwise is 0 (input); X 7 : Weather condition (1: Clear, Few clouds, partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog, input); X 8 : Normalized temperature in Celsius. The values are divided to 41 (max) (input); X 9 : Normalized feeling temperature in Celsius. The values are divided to 50 (max) (input); X 10 : Normalized humidity. The values are divided to 100 (max) (input); X 11 : Normalized wind speed. The values are divided to 67 (max) (input); and Y: Count of total rental bikes including both casual and registered (output). In this study, to calculate MAPE, the output was normalized and 0.1 was added to prevent a zero value.
where y i , y min , and y max are actual value, minimum and maximum of actual value, respectively. Table 20 presents the performance results of all models for the three additional datasets. Using the same dataset in the experiment No. 2, Maurya and Gupta (2015) [12] trained ANN models with the maximum R-learn and R-test values of 0.98146 and 0.98662, respectively. Meanwhile, the iML identifies the ANN single model as the best model with R-learn and R-test values of 0.99990 and 0.99629, respectively. The iML gives out a slightly better model than those of the previous research in this numerical experiment.  [13]. The stacking ANN ensemble also performs well with the MAPEs for the learning and test data by 0.026% and 0.010%, respectively.

Performance Results
Finally, in the experiment No. 4, iML achieves R-learn and R-test values of 0.97660 and 0.94790, with bagging ANN as the best model. In contrast, Fanaee-T and Gama (2014) obtained a maximum R value of 0.91990 [11].
As shown in the above numerical experiments, iML trains and identifies the best models which are better than those in the previous studies.

Conclusions and Future Work
This study develops an iML platform to efficiently operate data-mining techniques. The iML is designed to be user-friendly, so users can get the results with only "One-Click". The numerical experiments have demonstrated that iML is a powerful soft computing to identify the best prediction model by automating comparison among diverse machine learning techniques.
To benchmark the effectiveness of iML with WEKA, five datasets collected from the UCI Machine Learning Repository were analyzed via hold-out validation and tenfold cross validation. The performance results indicate that iML can find a more accurate model than that of WEKA in the publicly available datasets. The best prediction model identified by iML is also the best model among all the models trained by iML and WEKA. Notably, iML requires minimal effort from the users to build single, voting, bagging, and stacking models in comparison with WEKA.
Four industrial experiments were carried out to validate the performance of iML. The first experiment involved training a model for prediction of ERP development effort, in which iML yielded an RMSE for learning data with 70.28 h and for testing data with 65.58 h, by using the bagging ANN ensemble (best model). In contrast,   [1] obtained training and testing RMSE values of 234.0157 h and 97.2667 h, respectively.
In the second experiment on performance of CPU processors, iML yielded 0.99990 for R-learning and 0.99629 for R-testing, which are better than those reported in Maurya and Gupta (2015) [12], and confirmed that single ANN was the best model. In the third experiment of daily demand forecasting orders, iML achieved MAPE values of 0.026% (learning) and 0.010% (testing). The results are as excellent as those obtained in Ferreira et al. (2016) [13]. In the fourth experiment for total hourly-shared bike rental, R-learning and R-testing values of 0.97660 and 0.94790 were reached using iML. The test performance was 6% better than that obtained by Fanaee-T and Gama (2014) [11]. In addition to the enhanced prediction performance, the iML possesses ability to determine the best models on the basis of multiple evaluation metrics.
In conclusion, the iML is a powerful and promising prediction platform for solving diverse engineering problems. Since the iML platform can only deal with regression problems, future research should upgrade iML for solving complex classification and time series problems by automatically presenting the alternative models for practical use in engineering applications, as well as adding some other advanced ML methods (such as deep learning models). Moreover, metaheuristic optimization algorithms could be integrated with the iML to help the users finetune the hyperparameters of chosen machine learning models.

Data Availability Statement:
The data that support the findings of this study are available from the UCI Machine Learning Repository or corresponding author upon reasonable request.