Machine Learning-Based Predictive Modelling of Biodiesel Production—A Comparative Perspective

: Owing to the ever-growing impetus towards the development of eco-friendly and low carbon footprint energy solutions, biodiesel production and usage have been the subject of tremendous research efforts. The biodiesel production process is driven by several process parameters, which must be maintained at optimum levels to ensure high productivity. Since biodiesel productivity and quality are also dependent on the various raw materials involved in transesteriﬁcation, physical experiments are necessary to make any estimation regarding them. However, a brute force approach of carrying out physical experiments until the optimal process parameters have been achieved will not succeed, due to a large number of process parameters and the underlying non-linear relation between the process parameters and responses. In this regard, a machine learning-based prediction approach is used in this paper to quantify the response features of the biodiesel production process as a function of the process parameters. Three powerful machine learning algorithms—linear regression, random forest regression and AdaBoost regression are comprehensively studied in this work. Fur-thermore, two separate examples—one involving biodiesel yield, the other regarding biodiesel free fatty acid conversion percentage—are illustrated. It is seen that both random forest regression and AdaBoost regression can achieve high accuracy in predictive modelling of biodiesel yield and free fatty acid conversion percentage. However, AdaBoost may be a more suitable approach for biodiesel production modelling, as it achieves the best accuracy amongst the tested algorithms. Moreover, AdaBoost can be more quickly deployed, as it was seen to be insensitive to number of regressors used. AdaBoost regression predictive model is developed using 100 regressors. A sensitivity study shows that the number of regressors has a negligible impact on the predictive power of the AdaBoost regression predictive model. Figure 6a shows that the R 2 of the AdaBoost regression predictive model is approximately 91%, indicating that the AdaBoost regression predictive model can successfully model 91% variance in the training data. The residuals plot in Figure 6b shows almost a similar pattern in negative and positive residuals, indicating a balanced prediction for most of the cases, without much underprediction or overprediction. h of while out. the several times in the deionized to remove acid catalyst and residual To purify the crude biodiesel according to international standard, reﬁning was done.


Introduction
Biodiesel is a kind of nontoxic and biodegradable fuel, which is aromatic and sulfur free. The production of biodiesel is due to the transesterification reaction between the waste oil and different alcohols like methanol or ethanol [1,2]. In recent years, the importance of biodiesel has been increasing due to the reduction of total petroleum reserves in the world and the increase of global warming issues. It has been observed that each year, the use of petroleum is increasing exponentially, which can lead to complete depletion of its reserves by 2042 [3]. This alarming situation is boosting the interest of researchers to actively work on the substitute of petroleum products. Biodiesel can be considered as an alternative to fossil fuels used in the diesel engine without any major modification. Nowadays, researchers are showing interest in the use of vegetable oil to produce biodiesel, which is The general formula for the linear regression is: (1) where a and b represent y-intercept and slope of the fitting line of the linear regression, respectively. In case the best fit line passes through origin, the equation becomes Generalized linear regression model for any response (y) and its predictors (x 1 , x 2 , . . . , x n ), i.e., biodiesel production process parameters in this case can be stated as:

Random Forest Regression
Random forest regression is an ensemble technique based on the concept of bagging and random subspace methods. Bagging or bootstrap aggregation is responsible for creating an ensemble of learner trees. These learner trees are trained on separate and independent bootstrap samples drawn from the original training dataset. From an original training dataset, D with N examples a bootstrap sample (D b ) is constituted by randomly drawing n examples from D with replacement. D b is approximately two-third of D, without any duplicate examples. K number of independent regression trees are created for the bootstrap samples with input vector x. These regression trees generally have low bias and high variance. The random forest ensemble prediction is then obtained by calculating the mean of the prediction of the K regression trees, h k (x) [17].
Bagging reduces the variance in the ensemble model as compared to the individual decision trees. It also prevents overfitting of the model. It is important that the regression trees are not correlated. Samples other than those selected for training the kth regression tree during bagging are grouped as another subset called out-of-bag data (OOB). OOB data constitute roughly one-third of D. The kth regression tree's performance is evaluated using the OOB data using the following equation, where MSE OOB is the mean squared error, y i is the ith prediction and y i OOB is the mean of ith prediction from all the trees. Similarly, the R 2 OOB is computed using the following equation, where V ar y is the total variance of the response.

Adaptive Boosting Regression
AdaBoost or adaptive boosting is a sequential ensemble technique which is based on the principle of developing several weak learners using different training sub-sets drawn randomly from the original training dataset. During each training, weights are assigned which is used when learning each hypothesis. The weights are used for computation of the error of the hypothesis on the dataset and is an indicator of the comparative importance of each instance. The weights are recalculated after every iteration, such that incorrectly classified instances by the last hypothesis receive higher weights. This enables the algorithm to focus on more difficult to learn instances. Assigning revised weights to the incorrectly classified instances is the most vital task of the algorithm. Unlike in classification, in regression, the instances are not correct or incorrect, rather they constitute a real-value error. By comparing the computed error to a predefined threshold prediction error, it can be labelled as an error or not error and thus, the AdaBoost classifier can be used. Instances with larger error on previous learners are more likely (i.e., higher probability) to be selected for training the subsequent base learner. Finally, weighted average or median is used to provide an ensemble prediction of the individual base learner predictions [17].

Predictive Model Performance Evaluation Metrics
The residue, ε i between the ith original value, y i and predicted value,ŷ i is calculated as, The coefficient of determination, R 2 is computed using the y i ,ŷ i and the mean of the dataset, y.
The mean-absolute-error, MAE, is computed based on the number of the samples, n as The mean-squared-error, MSE, is computed as The root-mean-squared-error, RMSE, is computed as The maximum error is computed as The median error is computed as Med. Error (y,ŷ) = median (|y 1 −ŷ 1 |, . . . , |y n −ŷ n |)

Example 1: Biodiesel Production Yield Estimation
The experimental data (presented in Appendix A, Table A1) are taken from Ahmad et al. [13,18], where alkali (KOH) is used as a catalyst. The production of biodiesel from flaxseed oil followed by two steps transesterification reaction. The flaxseed oil was purchased from Bale-Robe, Ethiopia. KOH, sulfuric acid (H 2 SO 4 ) and methanol (CH 3 OH) were purchased from Sigma Aldrich. To remove the moisture content, the oil was pre-heated for 30 min at 110 • C. For the preparation of various percentage of catalyst, a certain amount of KOH pellets was dissolved in the methanol. The preheated flaxseed oil was cooled in a specific temperature (30, 50 and 65 • C) and after that potassium methoxide (CH 3 KO) solution was mixed. The CH 3 KO solution was mixed with the pre-heated oil in a 250 mL three-neck glass reactor and fixed over a magnetic stirrer with constant stirring at the above specific temperature. Then, the mixture was kept at three different reaction times (30, 45 and 60 min). For proper settling, the mixture was kept overnight in a different funnel. For the purification of the biodiesel, water washing method was considered. To prevent the formation of foam, the mixture was stirred slowly. After that, the mixture was put down Energies 2021, 14, 1122 5 of 16 overnight to settle into two different phases i.e., biodiesel phase and water impurity phase. To ensure the elimination of contaminates from the biodiesel, the above process was carried out for three times. After that, the biodiesel was heated for 1 h at 100 • C for the evaporation of residual water. For quantifying the biodiesel titration of biodiesel, fraction was used with 0.1 N sulphuric acid [19]. For the physicochemical characterization of flaxseed oil and biodiesel, standard procedures were considered, and it has been explained by Ahmad et al. [13,18]. For finding the fatty acid constituents, flaxseed oil was analysed, and the complete procedures have been discussed by Ahmad et al. [13,18]. In the present case, central composite design (CCD) has been considered for the design of experiments. Two different ways are there to perform CCD; one is face-centred central composite design (FCCD) and the other is rotatable central composite design (RCCD). Here, FCCD is used to find the influence of different process parameters for the maximum biodiesel yield. For the design of the experiment, two levels and four factors with five different centred point values have been considered. A total of 29 experiments have been performed by considering the FCCD. The process parameter chosen for the experiments is reaction time, methanol/oil molar ratio (M r ), reaction temperature and catalyst weight percentage, and this has been discussed by Ahmad et al. [13,18].

Effect of Process Parameters
The training data based on the experiments conducted by Ahmad et al. [13,18] are shown in the form of scatter plots in Figure 1 and in Table A1. Since a central composite design was used for the experimental design, the process parameters for biodiesel production are seen to be distributed primarily across three levels in each case. The yield versus methanol/oil molar ratio (M r ) scatter plot indicates that the yield is better at higher M r . However, from the other three scatter plots of yield versus process parameters, it is nonconclusive of the effect of various levels of the process parameters. The histogram of the yield indicates that the median of all the experimentally calculated yields is around 96%.
The correlation between the various process parameters and the yield is studied in Figure 2. There is a negligible correlation among the process parameters, which confirms the lack of multicollinearity. Multicollinearity, if present, could lower the statistical power of the regressions. The methanol/oil molar ratio (M r ) shows a moderately strong positive correlation (0.68) to yield. The yield is also seen to have a negligible correlation with catalyst weight and reaction time. However, as seen from Figure 2, a very weak positive correlation (0.19) exists between the temperature and yield. This is indicative that low levels of temperature, on an average, will not lead towards very high biodiesel yield. Thus, the analysis of the experimental data on biodiesel yield, as shown in Figures 1 and 2, suggests that a brute force approach to find the best process parameter combination to maximize the yield will be extremely cost intensive. Due to the lack of clear trend in the scatter plots in Figure 2, an exhaustive experimental analysis of all possible process parameter combinations may be needed. This justifies the need for advanced machine learning algorithms to build predictive models to quantify the yield as function of the process parameters. versus methanol/oil molar ratio ( ) scatter plot indicates that the yield is better at higher . However, from the other three scatter plots of yield versus process parameters, it is non-conclusive of the effect of various levels of the process parameters. The histogram of the yield indicates that the median of all the experimentally calculated yields is around 96%.  The correlation between the various process parameters and the yield is studied Figure 2. There is a negligible correlation among the process parameters, which confirm the lack of multicollinearity. Multicollinearity, if present, could lower the statistical pow of the regressions. The methanol/oil molar ratio ( ) shows a moderately strong positi correlation (0.68) to yield. The yield is also seen to have a negligible correlation with ca lyst weight and reaction time. However, as seen from Figure 2, a very weak positive c relation (0.19) exists between the temperature and yield. This is indicative that low lev of temperature, on an average, will not lead towards very high biodiesel yield. Thus, t analysis of the experimental data on biodiesel yield, as shown in Figures 1 and 2, sugge that a brute force approach to find the best process parameter combination to maxim the yield will be extremely cost intensive. Due to the lack of clear trend in the scatter plo in Figure 2, an exhaustive experimental analysis of all possible process parameter com nations may be needed. This justifies the need for advanced machine learning algorithm to build predictive models to quantify the yield as function of the process parameters.

Linear Regression Predictive Model
Based on the training data presented in Table A1, a linear regression predictive mod of the yield % is developed. An empirical relation describing the yield % as a function the methanol/oil molar ratio ( ), catalyst weight ( ), temperature ( ) and reaction tim ( ) is developed as shown in Equation. (14).
(1 Figure 2. Correlation between the various biodiesel production process parameters and yield.

Linear Regression Predictive Model
Based on the training data presented in Table A1, a linear regression predictive model of the yield % is developed. An empirical relation describing the yield % as a function of the methanol/oil molar ratio (M r ), catalyst weight (W c ), temperature (T) and reaction time (T r ) is developed as shown in Equation (14). Figure 3 shows the predictive performance of the linear regression predictive model. The actual yield % is plotted against linear regression predictive model-based predicted yield % in Figure 3a. The data points lying above the identity line indicate that the linear regression predictive model has overpredicted the yield %, while data points lying below the identity line indicate underprediction. The best line fit for the data points is generated and the R 2 is calculated which indicates the goodness of fit of the predictive model on the training data. It is seen that the predictive power of the linear regression predictive model is moderate, with an R 2 of about 52%. The performance of the linear regression predictive model is further analysed by using a predicted versus residuals plot in Figure 3b. It is seen that at low predicted values, the residuals are highest.
OR PEER REVIEW 7 of 15

Random Forest Regression Predictive Model
Using the training data, random forest regression predictive models are developed to quantify the yield % as a function of methanol/oil molar ratio ( ), catalyst weight ( ), temperature ( ) and reaction time ( ). Since the number of regressors in the random forest has a significant effect on the accuracy and performance of the predictive model, a sensitivity study on the effect of regressors is carried out. The number of regressors in the random forest approach is varied from 100 to 900 and the various error metrics are recorded. Figure 4 shows the effect of regressors on , MAE, MSE, RMSE, Max. Error and MedAE. It is seen from Figure 4a that the is best for 800 regressors. Similarly, for 800 regressors, lowest MAE, MSE, RMSE and Max. Error are achieved. However, the MedAE is found to be lowest for the 900 regressors random forest predictive model.

Random Forest Regression Predictive Model
Using the training data, random forest regression predictive models are developed to quantify the yield % as a function of methanol/oil molar ratio (M r ), catalyst weight (W c ), temperature (T) and reaction time (T r ). Since the number of regressors in the random forest has a significant effect on the accuracy and performance of the predictive model, a sensitivity study on the effect of regressors is carried out. The number of regressors in the random forest approach is varied from 100 to 900 and the various error metrics are recorded. Figure 4 shows the effect of regressors on R 2 , MAE, MSE, RMSE, Max. Error and MedAE. It is seen from Figure 4a that the R 2 is best for 800 regressors. Similarly, for 800 regressors, lowest MAE, MSE, RMSE and Max. Error are achieved. However, the MedAE is found to be lowest for the 900 regressors random forest predictive model. Figure 5a shows the predictive pattern of the selected random forest predictive model. A high R 2 of approximately 89% is achieved with 800 regressors random forest predictive model. Most of the data points are seen to lie on the identity line, indicating the high correlation between actual and predicted results. The residuals plot is shown in Figure 5b, showing that the error in random forest predictive model is much lower than the linear regression predictive model. random forest approach is varied from 100 to 900 and the various error metrics are recorded. Figure 4 shows the effect of regressors on , MAE, MSE, RMSE, Max. Error and MedAE. It is seen from Figure 4a that the is best for 800 regressors. Similarly, for 800 regressors, lowest MAE, MSE, RMSE and Max. Error are achieved. However, the MedAE is found to be lowest for the 900 regressors random forest predictive model.   Figure 5a shows the predictive pattern of the selected random forest predictive model. A high of approximately 89% is achieved with 800 regressors random forest predictive model. Most of the data points are seen to lie on the identity line, indicating the high correlation between actual and predicted results. The residuals plot is shown in Figure 5b, showing that the error in random forest predictive model is much lower than the linear regression predictive model.

AdaBoost Predictive Model
AdaBoost regression predictive model is developed using 100 regressors. A sensitivity study shows that the number of regressors has a negligible impact on the predictive

. AdaBoost Predictive Model
AdaBoost regression predictive model is developed using 100 regressors. A sensitivity study shows that the number of regressors has a negligible impact on the predictive power of the AdaBoost regression predictive model. Figure 6a shows that the R 2 of the AdaBoost regression predictive model is approximately 91%, indicating that the AdaBoost regression predictive model can successfully model 91% variance in the training data. The residuals plot in Figure 6b shows almost a similar pattern in negative and positive residuals, indicating a balanced prediction for most of the cases, without much underprediction or overprediction. Figure 5. Predictive performance of the random forest regression predictive model in biodiesel production yield estimation (a) R ; (b) predicted value versus residuals.

AdaBoost Predictive Model
AdaBoost regression predictive model is developed using 100 regressors. A sensitivity study shows that the number of regressors has a negligible impact on the predictive power of the AdaBoost regression predictive model. Figure 6a shows that the of the AdaBoost regression predictive model is approximately 91%, indicating that the Ada-Boost regression predictive model can successfully model 91% variance in the training data. The residuals plot in Figure 6b shows almost a similar pattern in negative and positive residuals, indicating a balanced prediction for most of the cases, without much underprediction or overprediction.

Comparison of Various ML Predictive Models
The comparison of the machine learning predictive models in Figure 7 shows that the AdaBoost predictive model performs a little better than the random forest regression and comprehensively outperforms the linear regression predictive model. The performance of the machine learning predictive models on various error metrics like MSE, MAE, Max. Error and MedAE are shown in Figure 7b. It is observed that the MSE and the MAE are lowest for AdaBoost regression predictive model. However, Max. Error and MedAE are least for the random forest regression predictive model. Thus, overall, the performance of random forest regression predictive model and AdaBoost regression predictive model is seen to be on par with each other. However, AdaBoost regression predictive model has a distinct advantage in this study over random forest regression predictive model that it is almost insensitive to the number of regressors.

Example 2: Biodiesel FFA Conversion Percentage Estimation
The experimental data (presented in Appendix A, Table A2) are taken from Karmakar et al. [20], where castor oil of commercial grade has been considered as a raw material to produce biodiesel. To remove the moisture and impurities, the oil was dried by using the hot air within the temperature range of 100-110 • C. Propan-2-ol, sulphuric acid, methanol and potassium hydroxide were considered as the solvent. To get the deionized water, Arium 611 DI ultra-pure water system of Sartorius A. G was used. By using ASTM standardized experiments, castor oil was characterized. ASTM Method D974 was used to estimate the acid value of castor. In the present case, L16 orthogonal array was used to experiment. The esterification of castor oil was done either by using a homogeneous acid catalyst (H 2 SO 4 ) or heterogeneous sulfonated carbon catalyst. During esterification, one neck was fitted to the thermometer to measure the temperature. To decrease the evaporative loss of the methanol, the other neck was fixed with the reflux condenser. To add the reagent the middle neck was used. Then, the methanol was moved to the reactor after heating the oil up to the set temperature. Due to the transfer of methanol, the temperature within the reactor dropped. The adjustment of plate temperature and mixture temperature was done with the help of magnetic stirrer hotplate and the temperature was fixed according to the experimental design. The acid catalyst was added after getting the desired temperature and then the reaction was allowed for the fixed time duration for experimental design. The Whatman series 40 filter paper is used for the removal of sulfonated carbon catalyst. The distillation process is used to retrieve the unused methanol. After that, the mixture was kept for 10 h to separate it into two different layers. The top layer of the mixture is the FAME rich phase while the bottom layer of the mixture is an aqueous phase, which was drained out. Then, the FAME was washed several times in the deionized water to remove acid catalyst and residual alcohol. To purify the crude biodiesel according to international standard, refining was done. , x FOR PEER REVIEW 9 of 15

Comparison of Various ML Predictive Models
The comparison of the machine learning predictive models in Figure 7 shows that the AdaBoost predictive model performs a little better than the random forest regression and comprehensively outperforms the linear regression predictive model. The performance of the machine learning predictive models on various error metrics like MSE, MAE, Max. Error and MedAE are shown in Figure 7b. It is observed that the MSE and the MAE are lowest for AdaBoost regression predictive model. However, Max. Error and MedAE are least for the random forest regression predictive model. Thus, overall, the performance of random forest regression predictive model and AdaBoost regression predictive model is seen to be on par with each other. However, AdaBoost regression predictive model has a distinct advantage in this study over random forest regression predictive model that it is almost insensitive to the number of regressors.

Example 2: Biodiesel FFA Conversion Percentage Estimation
The experimental data (presented in Appendix A, Table A2) are taken from Karmakar et al. [20], where castor oil of commercial grade has been considered as a raw material to produce biodiesel. To remove the moisture and impurities, the oil was dried by using the hot air within the temperature range of 100-110 °C. Propan-2-ol, sulphuric acid, methanol and potassium hydroxide were considered as the solvent. To get the deionized water, Arium 611 DI ultra-pure water system of Sartorius A. G was used. By using ASTM standardized experiments, castor oil was characterized. ASTM Method D974 was used to estimate the acid value of castor. In the present case, L16 orthogonal array was used to experiment. The esterification of castor oil was done either by using a homogeneous acid catalyst (H2SO4) or heterogeneous sulfonated carbon catalyst. During esterification, one neck was fitted to the thermometer to measure the temperature. To decrease the evaporative loss of the methanol, the other neck was fixed with the reflux condenser. To add the reagent the middle neck was used. Then, the methanol was moved to the reactor after heating the oil up to the set temperature. Due to the transfer of methanol, the temperature within the reactor dropped. The adjustment of plate temperature and mixture temperature was done with the help of magnetic stirrer hotplate and the temperature was fixed according to the experimental design. The acid catalyst was added after getting the desired temperature and then the reaction was allowed for the fixed time duration for experimental design. The Whatman series 40 filter paper is used for the removal of sulfonated carbon catalyst. The distillation process is used to retrieve the unused methanol.

Effect of Process Parameters
The training data presented in Table A2 are from the experiments conducted by Karmakar et al. [20] and are shown in the form of scatter plots in Figure 8. Since an L16 array was used for experimental design, the process parameters are considered at four equally spaced levels. The scatter of the response (i.e., FFA conversion percentage) is seen to be highly dependent on the levels of the process parameters. To further understand the effect of the process parameters on the biodiesel FFA conversion percentage, a correlation study is carried out as shown in Figure 9. It is seen that the process parameters do not correlate themselves, thus, multicollinearity is avoided. Correlation plot among the response and the process parameters show that methanol/oil molar ratio (M r ) has a moderately high correlation with the response. Agitation speed is seen to have a moderately low effect on the biodiesel FFA conversion percentage. study is carried out as shown in Figure 9. It is seen that the process parameters do n correlate themselves, thus, multicollinearity is avoided. Correlation plot among the r sponse and the process parameters show that methanol/oil molar ratio ( ) has a mode ately high correlation with the response. Agitation speed is seen to have a moderately lo effect on the biodiesel FFA conversion percentage.   study is carried out as shown in Figure 9. It is seen that the process parameters do not correlate themselves, thus, multicollinearity is avoided. Correlation plot among the response and the process parameters show that methanol/oil molar ratio ( ) has a moderately high correlation with the response. Agitation speed is seen to have a moderately low effect on the biodiesel FFA conversion percentage.

Linear Regression Predictive Model
Based on the training data listed in Table A2 from Karmakar et al. [20], a linear polynomial regression model of the following form is developed, Figure 10a shows the scatter of the predicted versus the actual biodiesel FFA conversion percentage. The scatter of the residuals versus the predicted biodiesel FFA conversion percentage is shown in Figure 10b. It is seen that the model achieves moderately high accuracy in terms of R 2 . The scatter of the residuals is random, indicating no ties in the data. Moreover, the random underprediction and overprediction i.e., data points lying randomly under and over the identity line, indicate that the model is not biased in a certain direction.
PEER REVIEW 11 of 15

Linear Regression Predictive Model
Based on the training data listed in Table A2 from Karmakar et al. [20], a linear polynomial regression model of the following form is developed, . % = 23.62 − 0.0175 − 0.9485 − 3.8585 + 1.5635 + 0.03407 (15) Figure 10a shows the scatter of the predicted versus the actual biodiesel FFA conversion percentage. The scatter of the residuals versus the predicted biodiesel FFA conversion percentage is shown in Figure 10b. It is seen that the model achieves moderately high accuracy in terms of R . The scatter of the residuals is random, indicating no ties in the data. Moreover, the random underprediction and overprediction i.e., data points lying randomly under and over the identity line, indicate that the model is not biased in a certain direction.

Random Forest Regression Predictive Model
As established in Section 3.1.3, random forest regression is sensitive to the regressors. Thus, a pilot study is carried out to determine the optimal number of regressors, which in this case is 500. The predicted biodiesel FFA conversion percentage versus the actual one in Figure 11a shows that high accuracy is obtained by the random forest regression predictive model. However, Figure 11b shows that only baring a couple of datapoints, most other residues are within ± 4. This indicates that the three remaining outliers in Figure 11b are responsible for the loss in the prediction power of the predictive model.

Random Forest Regression Predictive Model
As established in Section 3.1.3, random forest regression is sensitive to the regressors. Thus, a pilot study is carried out to determine the optimal number of regressors, which in this case is 500. The predicted biodiesel FFA conversion percentage versus the actual one in Figure 11a shows that high accuracy is obtained by the random forest regression predictive model. However, Figure 11b shows that only baring a couple of datapoints, most other residues are within ± 4. This indicates that the three remaining outliers in Figure 11b are responsible for the loss in the prediction power of the predictive model.

AdaBoost Predictive Model
Before developing predictive models with AdaBoost, a sensitivity study is undertaken on the number of regressors which indicate a negligible change in the predictive power of AdaBoost Regression predictive model to the number of regressors. Figure 12a shows the performance of the AdaBoost regression predictive model. It is seen that the predictive model has almost ideal estimation. The residuals versus the predicted biodiesel FFA conversion percentage in Figure 12b reveals that the maximum residue is ±2. Moreover, baring a couple of data points, the remaining residuals are within ±0. 5. this case is 500. The predicted biodiesel FFA conversion percentage versus the actual one in Figure 11a shows that high accuracy is obtained by the random forest regression predictive model. However, Figure 11b shows that only baring a couple of datapoints, most other residues are within ± 4. This indicates that the three remaining outliers in Figure 11b are responsible for the loss in the prediction power of the predictive model.

AdaBoost Predictive Model
Before developing predictive models with AdaBoost, a sensitivity study is undertaken on the number of regressors which indicate a negligible change in the predictive power of AdaBoost Regression predictive model to the number of regressors. Figure 12a shows the performance of the AdaBoost regression predictive model. It is seen that the predictive model has almost ideal estimation. The residuals versus the predicted biodiesel FFA conversion percentage in Figure 12b reveals that the maximum residue is ± 2. Moreover, baring a couple of data points, the remaining residuals are within ± 0.5.

Comparison of Various ML Predictive Models
The comparison of the R for the three predictive models in Figure 13a reveals that the AdaBoost regression predictive model has the best estimation power. The MSE of the AdaBoost regression predictive model is almost zero, whereas for random forest regression predictive model, it is around 13. Comparison of the predictive models based on all the error metrics, as shown in Figure 13b indicates that AdaBoost regression predictive model is comprehensively superior as compared to random forest regression predictive model and linear regression predictive model.

Comparison of Various ML Predictive Models
The comparison of the R 2 for the three predictive models in Figure 13a reveals that the AdaBoost regression predictive model has the best estimation power. The MSE of the AdaBoost regression predictive model is almost zero, whereas for random forest regression predictive model, it is around 13. Comparison of the predictive models based on all the error metrics, as shown in Figure 13b indicates that AdaBoost regression predictive model is comprehensively superior as compared to random forest regression predictive model and linear regression predictive model. The comparison of the R for the three predictive models in Figure 13a reveals that the AdaBoost regression predictive model has the best estimation power. The MSE of the AdaBoost regression predictive model is almost zero, whereas for random forest regression predictive model, it is around 13. Comparison of the predictive models based on all the error metrics, as shown in Figure 13b indicates that AdaBoost regression predictive model is comprehensively superior as compared to random forest regression predictive model and linear regression predictive model.

Conclusions
Building predictive models for the quantification of biodiesel yield based on the level of the process parameters is a realistic goal with a tremendous impact on sustainable development. In this paper, in one illustrative example, the biodiesel yield is modelled as a function of methanol/oil molar ratio (M r ), catalyst weight (W c ), reaction temperature (T) and reaction time (T r ). Similarly, in another illustrative example, the free fatty acid conversion percentage is estimated based on reaction temperature (T) and reaction time (T r ), catalyst weight (W c ), methanol/oil molar ratio (M r ) and agitation speed (S a ). A comprehensive comparative analysis is carried out to ascertain the utility of three machine learning techniques (linear regression, random forest regression and AdaBoost regression) for developing predictive models in biodiesel production. A wide range of accuracy and error metrics is used to quantify the efficacy of the machine learning algorithms. Based on the analysis it is seen that the linear regression approach is only able to achieve moderate accuracy, whereas both random forest regression and AdaBoost regression show very high accuracy in predictive modelling of the biodiesel yield. However, a sensitivity study on the effect of the regressors on the predictive performance of random forest regression and AdaBoost regression show that while AdaBoost may be non-sensitive to the number of regressors, the random forest may be significantly affected by the change in the number of regressors. Thus, AdaBoost regression may be preferred for the predictive modelling of biodiesel yield. This would lead to a significant saving in time and effort in identifying the optimum process parameters to increase the yield or FFA conversion % of biodiesel production process.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available in the article.

Conflicts of Interest:
The authors declare no conflict of interest.