A Review Unveiling Various Machine Learning Algorithms Adopted for Biohydrogen Productions from Microalgae

: Biohydrogen production from microalgae is a potential alternative energy source that is now intensively being researched. The complex natures of the biological processes involved have afﬂicted the accuracy of traditional modelling and optimization, besides being costly. Accordingly, machine learning algorithms have been employed to overcome setbacks, as these approaches have the capability to predict nonlinear interactions and handle multivariate data from microalgal biohydrogen studies. Thus, the review focuses on revealing the recent applications of machine learning techniques in microalgal biohydrogen production. The working principles of random forests, artiﬁcial neural networks, support vector machines, and regression algorithms are covered. The applications of these techniques are analyzed and compared for their effectiveness, advantages and disadvantages in the relationship studies, classiﬁcation of results, and prediction of microalgal hydrogen production. These techniques have shown great performance despite limited data sets that are complex and nonlinear. However, the current techniques are still susceptible to overﬁtting, which could potentially reduce prediction performance. These could be potentially resolved or mitigated by comparing the methods, should the input data be limited.


Introduction
Hydrogen that is produced from microalgae, either through photo-fermentation or dark fermentation, is known as microalgal hydrogen. It is a subset of biohydrogen, defined as hydrogen that is produced biologically from microorganisms using renewable biomass materials [1,2]. Microalgal hydrogen production has garnered considerable interest from academia as well as industry due to its potential as an alternative energy source. However, the nature of the complex biological processes and factors involved have made studies and process modelling very arduous. Researchers have recently employed machine learning (ML) to overcome this concern. ML is defined as building algorithms that can predict an outcome based on a statistical analysis of input data. The application of ML to studies can generate regression models that describe the relationship between independent variables and dependent variables [3,4]. These algorithms come in various forms, depending on their purposes and effectiveness.
Numerous studies have deployed ML to predict biohydrogen production. An example of this was a study that developed artificial neuron networks (ANN) to predict biohydrogen production based on dark fermentation time and volatile fatty acid production, which yielded highly accurate results (R 2 > 0.987) [5]. Next, the ANN approach was also used to model biohydrogen generation from biomass gasification based on biomass characteristics and operating conditions, where the results were in accordance with the input data (R 2 > 0.999, RMSE < 0.25) [6].
Besides, the use of ML in the field of microalgae has also been widely reported. For instance, ML was used to explore critical factors that affect algal biomass productivity and generate scenarios involving multiple combinations of these factors to yield very high biomass production [7]. Next, Bhola et al. (2017) managed to generate a fuzzy inference model that closely described the correlation between input parameters (peak biomass concentration, CO 2 uptake rate, and maximum relative electron transport rate) and the metabolic yields of Chlorella sp., achieving an R 2 > 0.985 [8].
Moving forward, the use of ML in microalgal hydrogen production has also sparked interest recently. For instance,  predicted microalgal hydrogen yield from microalgal biomass based on duration, sulfuric content, and biomass concentration [9]. They further stated that ML techniques are significantly advantageous over conventional methods such as response surface methodology for one variable at a time analysis (OVAT) [9]. Meanwhile, Salameh et al. (2022) even managed to use the ML model to optimize their microalgal biohydrogen production, resulting in an increase of 7% more biohydrogen as compared with the input parameters derived from RSM [10]. All these supported the idea and practicality of incorporating ML into a microalgal biohydrogen production study. While conventional methods such as OVAT could not take into consideration the interactive effect among variables, which are non-linear and complicated, ML algorithms are not constrained by this limitation, allowing for a better understanding of potential correlations among the variables. Considering the practicality, potential, and recent interest in using ML to predict microalgal biohydrogen production, this review paper will discuss the types of ML techniques available, along with comparative analyses of their effectiveness in fulfilling the specific research goals.

Artificial Neural Networks
ANNs are defined as information processing paradigms that are designed following the inspiration of how the human brain processes information. The complex networks are established from multiple simple processing units that function similarly to a neuron cell with three distinct layer categories, namely, the input, hidden, and output layers. The input neurons receive data through data files that are manually entered or in real-time from measuring instruments. The output layer sends out the information after the data has gone through one or multiple hidden layers, each composed of various interconnected neurons. The arrangement of layers and the connection between each processing unit are what make each ANN different from the other. There are two types of connections between neurons where the variable strength of input is either added or reduced before being output to the next neuron [11,12]. Figure 1 illustrates the structure of a neural network [13,14]. After the ANN model is established for a particular application, it requires training that involves determining the adjustable weights, akin to the process of determining the coefficients of a polynomial equation via regression. A common supervised training algorithm is known as the backpropagation neural network, where the calculated error between the outputs generated by the model and the actual results is reduced by adjusting the weights after the error is propagated backward through the network. This process is repeated until the error falls below a pre-established criterion [15,16].

Random Forest
The random forest (RF) algorithm combines individual decision trees and aggregates the results they produce by taking the average of the results. This is achieved by generating bootstrapped copies of the original data, where a single tree is grown by some form of randomization, and each tree is estimated in each bootstrap [17,18]. Bootstrapping is referred to as resampling with replacement, meaning that each bootstrapped copy has the same number of data points as the original. A decision tree is a hierarchically organized series of conditions where an instance of data is classified by following the path of satisfied conditions from the bottom (root) of the tree, passing through chains of nodes (branches), until it reaches an endpoint (leaf) that corresponds to a class label. Each node represents an attribute that may describe the particular data input [19,20]. Hence, a class label is only applied to the input after it has shown to fulfil all of its respective attributes. Combining multiple decision trees makes the RF algorithm an ensemble learning method and can be useful for large data sets. For any RF algorithm, the parameters that need to be established are the number of variables in the random subset at each node of the tree and the number of trees in the RF [14,21]. Having a sufficient number of trees within the RF allows for stable estimates of a variable's importance that provide information on the extent to which each predictor increases or decreases model accuracy compared to the actual results obtained. Figure 2 illustrates an overview of the RF algorithm [22,23].

Support Vector Machines
Support vector machines (SVM) are designed for binary classification in a multidimensional space. The working principle of SVM involves the identification of a hyperplane, a boundary that separates outcome categories to their full extent [18]. SVM applies a data transformation to the sample data and projects it to a desired dimensional space that is higher via a kernel function. A kernel function is defined as a function that returns the inner product (dot product) between the images of two data points (x, x') in the higherdimensional space. ML then takes place in this space [24]. An example of a dot product between x ij and x ij ' can be mathematically shown below: There are multiple kernel functions available depending on the data set, as it needs to have its dimensionality increased to obtain the hyperplane (Table 1) [25]. These kernel functions of two data points all aim to reach the target space T. Among these equations, Karatzoglou and Meyer (2006) stated that the Gaussian radial basis function (RBF) is the most suitable when there is no pre-existing knowledge available regarding a data set [24]. They also stated that the linear kernel function is beneficial for large and inadequate data points. The performance of SVM is based on the established regularization parameter C (box constraint) and the kernel parameter (scaling factor), which make up the hyperplane parameter. Having a high value of C will cause the SVM to create a complex prediction function to greatly reduce the misclassification of data points. In contrast, a low value of C will lead to a simple prediction function [24]. Training an SVM algorithm involves mapping the decision boundary for each outcome category and specifying the hyperplane that separates the categories. The algorithm will then attempt to find the optimal hyperplane that has the highest margin between classes, which is proportional to the classification accuracy [14]. Figure 3 shows a simple 2-D illustration of the SVM algorithm. Any SVM algorithm aims to find the maximum margin hyperplane, situated at the maximum margin between all possible positive and negative hyperplanes that can be defined, which will separate the support vectors into two distinct categories. Misclassifications occur when a data set is mapped onto the wrong side of the hyperplane, which is affected by the box constraint.  [25].

Kernel Functions
Type of Classifier Inverse multiquadric function

Regression
Another ML technique is regression analysis, a conventional method used to determine the correlation between a dependent variable and one (univariate) or multiple (multivariate) independent variables [26,27]. Since the nature of the correlation between variables exists, there are multiple types of regression being designed to cater to these relationships. These regression techniques all attempt to achieve the same objective, which is to illustrate the variable of interest as a mathematical function of independent variables that affect its value. The most straightforward type is the simple linear regression (SLR) method, which aims to fit the data into a straight line that can be expressed as follows: where y is the dependent variable, x is the independent variable, m is the slope from the established straight line, and c is the constant term of intercept. Data that can be fitted into this type of regression indicates that if the independent variable increases, the dependent variable increases in a linear fashion. Multiple linear regression (MLR) is similar to SLR in establishing a straight-line fit, with the caveat that there are multiple independent variables involved that each have a linear relationship with the dependent variable [27,28]. The MLR model with k independent variables can be written as: Furthermore, polynomial regression describes y as a function of x that is represented as a polynomial equation where x is raised to the power of n. It is considered a special case of MLR where the model can be expressed as: where n is the polynomial degree [29,30]. A relationship between the outcome and its factors that can be fitted via polynomial regression is described as curvilinear. Last but not least, non-linear regression involves describing independent variables that affect the dependent variable in a manner that is not linear or straightforward. The application and study associated with non-linear regression have been gaining traction as living organisms' population growth models are often expressed in non-linear equations [31]. This is due to the complex biological processes involving dynamic factors that drive growth. These can be observed in Table 2, which illustrates the equations, parameters, and definitions of growth models for a particular organism [32]. These equations can also be used with other biological growth models to perform non-linear regression, with the caveat that they need to be adjusted accordingly to consider the independent variables and their relationships. For instance, Wang et al. [33] uses the Gompertz equation modified for microalgal hydrogen production as shown below: where P(t) (L H 2 /kg microalgae) is the cumulative microalgal hydrogen production at time t, P max (L H 2 /kg microalgae) is the microalgal hydrogen production potential, R max (L H 2 /kg microalgae/d) is the maximum microalgal hydrogen production rate, e is the base of natural logarithms which is equivalent to 2.718, L (d) is the microalgal hydrogen production lag time, where the microalgae have not begun the microalgal hydrogen production under anaerobic conditions, and t (d) is dark fermentation of microalgal hydrogen production experimental time [33]. A comparison can be drawn with the Gompertz equation in Table 2, where fundamental components such as the double exponent, the asymptotic growth limit (P max ), and the growth rate (R max ) are retained. The modification can be seen in the removal of the natural logarithm (ln) and initial value, in this case being the microalgal hydrogen production at t = 0, as it is nonexistent, followed by the addition of new variables and constants such as L and e, respectively. Despite the changes, the representation of the individual models will look similar, with different variables being described in the process. Table 2. The equations, parameters, and definitions of growth models [32].

Importance of Machine Learning in Biohydrogen Production
The usage of machine learning (ML) techniques has become more widespread in microalgal hydrogen production and studies relating to it in recent years.  reported that ML had recently demonstrated great potential as a data-driven method. This is due to the fact that ML algorithms can handle complex multivariate data, predict nonlinear connections, and manage missing data [9]. The ML has now become a significant tool in microalgal hydrogen production studies since it is capable fo being adopted in various applications, which include studying the relationship between operation parameters and production outputs, classifying factors as being significantly impactful to the overall process, and predicting the produced microalgal hydrogen based on the set initial conditions.

Relationship Study
An essential step in optimizing microalgal hydrogen production is the modelling of its production system to study how certain parameters influence the overall process. Wang et al. [34] reported that various ANNs were utilized in correlating microalgal hydrogen productions and critical operating parameters. In the same article, Multilayer Perceptron ANN (MLPANN) was proposed as a modelling framework to illustrate the kinetics of microalgal hydrogen production from a dark fermentation process. The ML-PANN is a type of ANN that contains more than one hidden layer to accommodate the complexity of the system. It had been reported that the MLPANN was able to reliably model the metabolites, including microalgal hydrogen, with limited experimental kinetic data [34]. Hosseinzadeh et al. [35] developed multiple ML algorithms to model microalgal hydrogen production from wastewater via a dark fermentation process that included RF and SVM. The relative importance of effective factors being inserted into each algorithm was studied via the permutation variable importance (PVI) procedure, which considered the errors from developed models in predicting the results with a random permutation of a particular input [35]. This procedure highlighted the degree of importance of each factor being inserted into a particular ML algorithm, leading to better clarity on its relationship with the overall production process. The PVI procedure indicated that ethanol was of significant importance as a factor in all of the proposed ML models for microalgal hydrogen production [35,36]. This was justified as ethanol, as a solvent, has bactericidal effects, which may negatively impact microalgal hydrogen production [37]. In anaerobic fermentation, hydrogen is formed by accepting electrons from the process. Ethanol is also capable of being an electron acceptor, implying that hydrogen production is reduced as there are fewer electrons available [38].

Classification of Results
Literature involving microalgal hydrogen production has become saturated over the years as its potential has been materialized by researchers and academia. However, ML algorithms can utilize data from the literature and analyze quantitative correlations between input data and obtained outputs. This is far superior to a traditional comparative analysis as it reduces the time required to analyze the data set from each study. An example of this was the development of ANNs integrated with statistical analysis using response surface methodology (RSM) to study the enhancement of microalgal hydrogen production via chemical addition [39]. It was concluded that, in the case of the addition of Fe-based nanoparticles, the nanoparticle size together with the concentration of nanoparticles added had been classified as statistically significant to the microalgal hydrogen yield, denoting that the optimal value was approached when the nanoparticle size ranged between 81 and 100 nm. However, the same parameter was also classified as statistically insignificant for the hydrogen evolution rate. The explanation given by the authors for this finding was that nanoparticle sizes ranging between 81 and 100 nm were more thermodynamically stable during fermentation as compared with smaller sizes. Monroy and Buitron [40] used the SVM method to diagnose the undesired scenarios in microalgal hydrogen production by photo-fermentation. Five classes were set up, each with a different set of optimum values for light intensities and pH, and 250 scenarios with varying operating conditions were classified by the SVM. The 100% and 55% diagnosis performances were attained for batches where the light intensity and pH values, respectively, deviated from an optimum operating range. The poor pH diagnosis was reportedly due to the photo-fermentation process being highly sensitive to pH changes [40].

Prediction of Microalgal Hydrogen Production
The most prominent use of ML across all literature is its use in microalgal hydrogen production to predict the outcome of a particular production system. Outputs such as hydrogen yield and hydrogen evolution rate have been extensively studied to determine the most optimal values for these outputs and the required variables to achieve them. Alalayah et al. [41] developed an ANN model that was able to predict microalgal hydrogen production through a dark fermentation process based on three inputs, namely, initial substrate concentration, initial medium pH, and temperature. The ANN model performed better than a traditional Box-Wilson Design (BWD) statistical model as it provided a higher level of accuracy with fewer errors [41]. Another ANN model was constructed using feed backward propagation in conjunction with a cross-out validation approach, which was able to predict the optimal hydrogen yield (3 H 2 mol/mol substrate) based on the optimal composition of glucose (14 g/L) and acetate (1.3 g/L). The MSE value of merely 1.193 suggests that the training outcome of the ANN was good [42]. Sharma et al. [43] developed a novel ML-based optimization approach to predict microalgal hydrogen yield from microalgal biomass based on duration, sulfuric content, and biomass concentration. The validation test for this prediction model indicated an acceptable error of merely 4.52%. This approach was capable of studying multiple factors simultaneously and specifying at which point the best output of microalgal hydrogen was achieved [9,43]. Last but not least, Salameh et al. [10] designed a variation of an ANN known as the Adaptive Network Fuzzy Inference System (ANFIS), capable of predicting the most optimal microalgal biohydrogen production based on operating parameters, namely, initial pH (9.0), N/C ratio (0.1862), xylose concentration (25 g/L), and operating temperature (36.12 • C). This study highlighted that the optimum value generated for microalgal hydrogen production was 200 mL/L higher than the value attained from ANOVA, demonstrating that ML can perform better in prediction studies [10].

Comparative Analyses among ML Techniques
For each of the ML techniques mentioned earlier, there are benefits and drawbacks to employing them, depending on the nature of the study being conducted in the field of biohydrogen production. A comparison can be made among the ML techniques in terms of their advantages and disadvantages in determining which scenarios are most suitable for each technique.
The ANN is capable of managing and modelling complex interactions among components of a system. Flexibility is also an additional benefit, which allows for adaptation to new information that may change over time [12]. This makes ANN a strong technique in studies involving microalgal hydrogen, which involve complex interactive processes. De-spite these benefits, there are also a few disadvantages. Hossain et al. [44] stated that ANN required a large amount of training data in order to operate, which was time-consuming and costly to provide. Another significant setback is that they are unable to predict outputs based on inputs that are beyond the training data space [44]. This implies that the performance of the ANN is mostly based on the training data provided to the network. Based on these attributes, it can be inferred that ANN is most suitable for studies that already have a lot of training data available, whether from literature or manually attained from research works. ANN has been used in a variety of applications, indicating that it is a versatile technique in microalgal hydrogen research.
The RF algorithms share a similar strength with ANN in the aspect that they are able to estimate results that are derived from complex functions of predictors with many interactions. In addition to this, RF has the distinctive strength of being suitable for multivariate data sets that have a large number of predictors and a small number of observations [45]. This implies that in scenarios where training data is limited, the RF will outperform the ANN in terms of predicting the outputs of a particular system. A common issue suffered by ML techniques is overfitting, where a model fits exactly against its training data, making it unable to predict future observations reliably. RF has a built-in safeguard against this phenomenon by using part of the data that each decision tree in the forest has not observed to calculate its goodness-of-fit [46,47]. This attribute was highlighted in the work of Hosseinzadeh et al. (2022), where the mean squared error (MSE) attained in the training and validation phases had approximately experienced a decreasing trend, showing that there was no overfitting in the constructed RF model [37]. On the other hand, the development of an RF model can be computationally intensive. Furthermore, if the predictors within the data set are correlated, the PVI procedure may be biased [45].
In conclusion, RF algorithms are most suitable in microalgal hydrogen studies that have limited observations, provided that the computational strength is available to execute the ML technique.
The SVM is often implemented in microalgal hydrogen studies as a regression model, known as support vector regression (SVR). The working principle of SVR is similar, classifying data sets via hyperplanes. A major advantage of SVR is that it allows for the setting of tolerable errors in the model [48]. This is achieved using the box constraint variable outlined earlier. This gives the users more control over the complexity of the function, which is important as the desired result from SVR may vary depending on the study being conducted. Another strong advantage of SVR is that, by using the appropriate kernel function, it can manage highly complex and unstructured data, even in instances where the number of predictors is greater than the number of observations. The disadvantages of SVRs include being prone to overfitting as compared with other ML techniques [18]. This can be observed in the work of Hossain et al. (2022), where the R 2 value for models based on SVM developed to model the microalgal hydrogen production from palm oil mill effluents and activated sludge waste ranged between 0.01 and 0.34 [48]. Another setback from using SVR is that choosing the wrong kernel function to construct the model can give an inaccurate depiction of the results. Furthermore, training time could be time-intensive when using large data sets [49,50]. To summarize, SVMs are more suitable when more control over the results is needed from the users in terms of error tolerance and kernel function used.
The attributes of regression as an ML technique in microalgal hydrogen studies vary depending on the type of regressions being deployed. Regression models are very easy to interpret, allowing for better visualization of the relationships between the variables within the system. Similar to the other ML techniques, most regression models are also prone to overfitting, especially if the independent variables are collinear [18]. An example of a regression model that overcomes this common weakness is the Gaussian Process Regression (GPR), which can accurately evaluate its level of uncertainty. Hossain et al. (2022) reported that models based on GPR being developed to model the microalgal hydrogen production from palm oil mill effluents and activated sludge waste had R 2 values above 0.9, indicating good modelling performance [48]. Regression models are most suitable in relationship studies between predictors and predictions of microalgal hydrogen production. The advantages and disadvantages of the machine learning techniques discussed are summarized in Table 3.

Conclusions
In conclusion, the ML presents itself as an essential element in microalgal biohydrogen production. Multiple methods that had been used in the literature were evaluated in terms of their efficacies in fulfilling various applications such as relationship studies, classification of results, and prediction of microalgal hydrogen production. The specialized ML techniques developed for microalgal biohydrogen production have shown potential in illustrating the nonlinear and complex interactions among the variables involved. The RFs are very useful when data is limited, while SVMs offer more control over error tolerance for classification scenarios. Regression is effective in relationship studies and prediction, and ANNs offer the most versatility. Indeed, different studies must adopt different approaches to addressing specific problems. Future studies could look into developing ML techniques that can overcome issues that arise when current methods are employed, such as overfitting and high computational time.