Maximizing Biogas Yield Using an Optimized Stacking Ensemble Machine Learning Approach

: Biogas is a renewable energy source that comes from biological waste. In the biogas generation process, various factors such as feedstock composition, digester volume, and environmental conditions are vital in ensuring promising production. Accurate prediction of biogas yield is crucial for improving biogas operation and increasing energy yield. The purpose of this research was to propose a novel approach to improve the accuracy in predicting biogas yield using the stacking ensemble machine learning approach. This approach integrates three machine learning algorithms: light gradient-boosting machine (LightGBM), categorical boosting (CatBoost), and an evolutionary strategy to attain high performance and accuracy. The proposed model was tested on environmental data collected from biogas production facilities. It employs optimum parameter selection and stacking ensembles and showed better accuracy and variability. A comparative analysis of the proposed model with others such as k-nearest neighbor (KNN), random forest (RF), and decision tree (DT) was performed. The study’s findings demonstrated that the proposed model outperformed the existing models, with a root-mean-square error (RMSE) of 0.004 and a mean absolute error (MAE) of 0.0024 for the accuracy metrics. In conclusion, an accurate predictive model cooperating with a fermentation control system can significantly increase biogas yield. The proposed approach stands as a pivotal step toward meeting the escalating global energy demands.


Introduction
Biogas is a renewable energy source that is produced through the decomposition of organic matter in an anaerobic environment [1].It is primarily composed of methane (CH 4 ) and carbon dioxide (CO 2 ), along with small amounts of other gases such as hydrogen sulfide (H 2 S) and trace compounds [2,3].Biogas can be used as a versatile fuel for various purposes, including electricity generation and heating, and even as a transportation fuel.Biogas production is a complex process influenced by multiple interconnected factors including feedstock composition, environmental parameters, and organic loading rate [4].Different feedstocks have varying levels of biodegradability and methane potential.The common feedstocks include animal manure, agricultural residues, food waste, and wastewater sludge [5,6].Further, environmental parameters such as temperature, humidity, pH, and moisture level play a vital role during the biogas production process where the optimal temperature range is typically between 35 • C and 55 • C [7,8].Higher temperatures can accelerate digestion, but extreme temperatures can inhibit microbial activity [8].The pH level of the digester is crucial for maintaining optimal microbial activity.Most biogas production occurs in a slightly acidic-to-neutral pH range from 6.5 to 7.5 [9].The length of time the organic matter remains in the digester, known as the retention time, affects biogas production.Longer retention times allow for a more complete degradation of the feedstock and increased gas production [10].The availability of essential nutrients, such as nitrogen and phosphorus, plays a role in microbial activity and biogas production, where the carbon-to-nitrogen (C/N) ratio must be maintained in the optimum range for efficient biogas production [11].
With technology evolution, artificial intelligence (AI), and internet of things (IoT), it is feasible to predict biogas generation referring to the available influencing parameters' dataset.Feedstock composition can vary significantly, even within the same category.This makes it difficult to establish a standardized prediction model that applies to all types of organic matter.However, environmental parameters are a common factor that contributes to the overall biogas generation process.This research aimed to investigate the contribution of environmental parameters to biogas prediction and propose a new prediction algorithm that would guarantees high accuracy in estimating biogas output compared to the existing methods.
Recent studies highlighted the remarkable advancements made by AI and IoT techniques in enhancing renewable energy sectors [12,13].From a biogas perspective, research studies on AI in biogas prediction were performed to enhance the biogas generation process [14,15].For example, the support vector machine (SVM) has been presented as the most popular machine learning algorithm to predict biogas output in several studies on wastewater treatment plants.The study findings showed that SVMs were able to achieve an accuracy of 95% [16][17][18].Another researcher explored the contribution of the artificial neural network (ANN) algorithm in biogas prediction and reported the highest accuracy of 92% [19].Another paper investigated the application of the decision trees algorithm in biogas prediction by dividing the data into small groups until each group could be predicted with a high degree of accuracy; in this way, a model accuracy of 89% was achieved.Further, the RF algorithm was used, combining multiple decision trees to improve the accuracy of the prediction, and an accuracy of 91% was reported [20].
Prior research performed predictions based on single machine learning models, demonstrating their district dominance.Although a single prediction model can enhance production accuracy by adjusting parameters and choosing forecasting variables in the prediction process, it also carries uncertainties related to its structure and faces, which pose challenges when adapting to various environments [21,22].Stacking is a widely adopted ensemble learning technique that adeptly reduces bias and variance by blending less powerful models to form a more robust one and has become prevalent in the machine learning field [23,24].Recent research studies proposed the integration of multiple single prediction models to construct an ensemble model that can effectively leverage the strengths of these diverse models, ultimately enhancing the dependability and precision of biogas prediction [25,26].It appeared that the combined models exhibited superior performance compared to the single models.However, the stacking ensemble model for biogas prediction still has significant unexplored potential.
This research aimed to optimize the accuracy of predicting biogas yield, using a stacking assembler approach.In this context, the triadic assembler model combining the LightGBM, CatBoost, and evolutionary strategy models was used as a base model, since these models are considered robust and have complementary features that could enable their combination to achieve the lowest loss and processing speed.In the biogas generation process, major factors such as environmental parameters, feedstock composition, organic loading rate, and digester size, can effectively predict biogas yield.The scope of this research was to optimize the accuracy of biogas yield prediction, using environmental parameters.Accuracy metrics such as MAE and RMSE were adopted to evaluate the performance improvement of the proposed model compared to other models explored in the research.In addition, the R-squared metric was used to evaluate the fitness of the model.

Materials and Methods
This section describes the material and methodology used in this study.Regarding the materials used, data were collected through an IoT framework designed and deployed at a home digester in a previous study.The data were subjected to pre-processing procedures, involving the elimination of errors or outliers, the imputation of missing values, and normalization to ensure consistent scaling across all features.The proposed model was developed by merging three base models, i.e., LightGBM, CatBoost, and evolutionary strategy.Finally, the proposed model was compared to other machine learning techniques.Figure 1 presents the proposed method used in this research.

Data Collection
This research is part of other ongoing research works.In a previous study, an IoT framework was developed to monitor and control a biogas digester status and is considered a data collection tool in this study.The data were collected in three months from March 2023 on a home digester.The dataset contains 3000 records encompassing operational parameters data, such as digester temperature (T), digester pH (pH), moisture level, pH level, and gas volume.The description of the IoT platform and data collection process was reported in our previous study [27].The temperature is important, as it affects the production rate.The pH level is vital for determining the stability and corrosiveness of the biogas.Gas volume is a factor providing insights into the gas energy content.The moisture level measurement is important, as moisture influences the movement of microorganisms.Table 1 presents an explanation of the variables considered in the proposed biogas prediction model.
In the machine learning modeling process, data pre-processing must be conducted to ensure the accuracy of the model.In a previous study [27], data pre-processing was performed, and rows with both missing and high peak values were substituted with the mean values of all available data.Timestamp values were converted from the 12 h system to the 24 h system using the strftime function from the DateTime library to facilitate the consideration of the time factor.

Machine Learning Models
In this research, we propose a triadic ensemble machine learning model, which integrates three distinct algorithms: LightGBM, CatBoost, and evolutionary strategy.The model engages supervised machine learning regression models, where a set of input data is employed to predict the output data [28].The proposed model was compared with existing regression models, namely, random forest, KNN, and decision tree.The most effective model is recommended to predict biogas production.These predicted values can be utilized to optimize biogas plant operations or devise strategies for future biogas production.

K-Nearest Neighbor (KNN)
The KNN algorithm is a machine learning technique used for regression tasks.It relies on the idea that similar data points tend to have similar values.Throughout the training process, the KNN algorithm stores the whole training dataset as a reference to perform predictions, calculating of the distance between the input data point and the trained data by referring to Euclidean distance [29,30].

Random Forest
Random forest is a powerful machine learning (ML) algorithm.It can handle both classification and regression problems.Figure 2 shows how random forest combines the output of multiple decision trees to reach a single result.Its ease of use and flexibility have fueled its adoption.The random forest algorithm is a bagging method expansion that employs both bagging and feature randomness to produce an uncorrelated forest of decision trees [31].

Decision Tree
Decision trees (DTs) are a popular supervised learning method that can work for both regression and classification problems.Decision trees build a model that predicts the value of a target variable by inferring basic decision rules from data features.A decision tree is a hierarchical decision support model that displays options and their probable outcomes, including chance occurrences, resource expenses, and utility [32].

Light Gradient-Boosting Machine (LightGBM)
LightGBM, as suggested by Microsoft [33], is an advanced supervised algorithm built on the foundation of gradient-enhanced decision trees.It has found applications in various domains, including medicine [28], economy [34], and agriculture [35].LightGBM is a gradient-boosting framework that uses tree-based learning algorithms and relies on a loss function that measures the discrepancy between the predicted and the actual values of the target variable [36,37].
where L(Θ) is the loss function that depends on the model parameter Θ.The goal of machine learning is to find the optimal values of Θ that minimize the loss function.F(x i ) is the model output or prediction for the input x i .F is a function that maps the input space to the output space and is determined by the model parameter Θ; the sum of all training samples (x i , y i ) is denoted by Σ.The loss function l (y i , F(x i )) measures the difference between the predicted value F(x i ) and the true value y i .The regularization term Ω(F) is a function of the model output F and penalizes the complexity of the model.Additionally, there is an optional regularization term, Ψ(Θ), which is a function of the model parameter Θ and penalizes the magnitude of the parameters.

Gradient Boosting (CatBoost)
CatBoost is a gradient-boosting framework invented in 2017, with the ability to handle regression features effectively [38].CatBoost relies on a loss function that measures the discrepancy between the predicted values and the actual values of the target variable.The CatBoost algorithm minimizes the loss function by updating the ensemble in each iteration.As indicated in Equation (2), at the t-th iteration, the predicted value of the ensemble for a specific sample x i is denoted by F t−1 (x i ), and the update equation for F t (x i ) is: The learning rate γ t in Equation ( 2) corresponds to the learning rate for the t-th iteration, while h t (x i ) represents the prediction made by the t-th decision tree for the sample x i [38][39][40].

Evolutionary Strategy
Evolution strategy is a global optimization algorithm that incorporates stochastic elements, inspired by the biological principle of evolution through natural selection [41].The evolutionary strategy algorithm optimizes the parameters θ 1 , θ 2 , . .., θ n of model M to minimize the loss function L(M,θ).It generates a population of M models with the random parameters θ 1 , θ 2 , . .., θ N .It evaluates the fitness of each model in the population based on the loss function f(θi) = L(M,θi).Then, it selects the top-performing models.Its selects the top k models from a population, based on their fitness scores [42,43].

Proposed Triadic Ensemble Model
The triadic ensemble model involved the utilization of the synergies of the described base models to enhance their performance and boost their generalization capabilities.The triadic ensemble model learning was broken down into two phases: the training phase of the base models and the training phase of the merged model.During the initial phase, the original dataset was partitioned into a training set and a testing set.The training set was then used for training through k-fold cross-validation.In k-fold crossvalidation, the training set was divided into k subsets, with each subset serving as a validation set, while the remaining (k − 1) subsets were utilized for training the model and generating predictions for that specific validation subset.Table 2 indicates the proposed model workflow.

Step.5:
Utilize the predictions for optimization and planning.

Evaluation Metrics
The evaluation metrics used in this paper included RMSE, MAE, and coefficient of determination (R-squared).These metrics are used to assess the performance of regression models.The RMSE measures the average squared difference between predicted and actual values, with a lower RMSE indicating a better fit [44,45].Similarly, the MAE measures the average absolute difference between predicted and actual values, with a lower MAE also indicating a better fit [46].The R-squared gauges how well a model fits the data, with a higher R-squared value indicating a better fit.We compared different regression models based on these metrics.The model achieving the lowest RMSE and MAE as well as the highest R-squared was considered the best model for the task.The mathematical formulas are as follows: root-mean-square error (RMSE) is the predicted value of the ith sample, and   is the corresponding true value for the total  samples.

Evaluation Metrics
The evaluation metrics used in this paper included RMSE, MAE, and coefficient of determination (R-squared).These metrics are used to assess the performance of regression models.The RMSE measures the average squared difference between predicted and actual values, with a lower RMSE indicating a better fit [44,45].Similarly, the MAE measures the average absolute difference between predicted and actual values, with a lower MAE also indicating a better fit [46].The R-squared gauges how well a model fits the data, with a higher R-squared value indicating a better fit.We compared different regression models based on these metrics.The model achieving the lowest RMSE and MAE as well as the highest R-squared was considered the best model for the task.The mathematical formulas are as follows: Mean absolute error MAE(y, ỹ) = root-mean-square error (RMSE) ŷi is the predicted value of the ith sample, and y i is the corresponding true value for the total n samples.

Hyperparameter Optimization
In machine learning, it is essential to set hyperparameters before initiating the model training process.This is especially important for algorithms like LightGBM, CatBoost, and evolutionary strategy, as in this case, hyperparameters significantly impact each model's predictive accuracy.This study used hyperparameter optimization techniques to improve the performance of the proposed model.RandomizedSearchCV is a tool used to evaluate hyperparameter computation and improve prediction accuracy.RandomizedSearchCV indicated the following hyperparameters for the LightGBM model: The 'n_iter' parameter of RandomizedSearchCV was set to 10, indicating that 10 sets of hyperparameters would be randomly sampled from the identified parameter space.Moreover, the 'cv' parameter was set to 5, suggesting that nested 5-fold cross-validation would be used to measure the performance of each hyperparameter arrangement.Once the best hyperparameter configuration was known, the LightGBM model was trained on the entire training dataset using these settings.The trained model was then employed to make predictions on the unknown test dataset.Figure 3 shows how nested 5-fold cross-validation was performed for each iteration; one of the 5 folds was considered a testing dataset.

Results and Discussion
This section details the prediction result obtained from modeling biogas digester environmental data collected from March to May 2023, using the proposed stacking ensemble model.For the performance analysis, the model was compared with other models presented in this study, using performance evaluation metrics such as RMSE, MAE, adjusted R-squared.Additionally, the correlation of different variables was explored in this research.

Proposed Model Prediction Results
The dataset used in this study comprises 3000 records.These data regard environmental parameters that have an impact on biogas yield.The model was setup to predict the volume of biogas yield in the next hours referring to five input values previously measured (ambient temperature, indoor temperature, moisture, pH level, and time) in the experiment.For selecting the training and testing dataset, the k-fold cross-validation approach was used.We chose k = 5 to balance the computation cost presented by the high value of k and prevent bias caused by k = 3.Among the k-fold cross-validation methods, nested cross-validation was selected to optimize hyperparameter turning on the dataset and reduce bias, and five-fold cross-validation was carried out within each cross-validation.For each iteration, the dataset was divided into four folds (80%) of training and one fold (20%) of testing data.The performance of the models was evaluated using three metrics, as mentioned in Section 2.3.The cross-validation results showed less significant changes using different values of k, as indicated in Table 3.

Comparative Analysis of Machine Learning Models
For the performance analysis of the model, the results of the proposed model were compared to those of three machine learning models, i.e., KNN, DT, and RF, using the same five-fold cross-validation on the same dataset.The results from different values of K were computed.Table 4 shows only the average results in terms of R2, RMSE, and MAE values.The RMSE measures the average distance between the predicted values and the actual values.A lower RMSE indicates better accuracy.The MAE measures the average absolute difference between the predicted values and the actual values.Like the RMSE, a lower MAE suggests better accuracy.The average differences between the predicted values and the actual values for the biogas yield were 0.0040, 0.0055, 0.0062, and 0.0059 for the proposed model and the RF, DT, and KNN models, respectively, whereas the average absolute differences between the predicted values and the actual biogas yield were 0.0024, 0.0044, 0.0049, and 0.0047 for these models, respectively, as presented in Figure 4. Overall, the proposed method demonstrated the highest accuracy with the lowest RMSE and MAE values, followed by the random forest and KNN models, while the decision tree model showed relatively lower accuracy.

Variable Importance
Scatterplots offer a valuable visualization of the relationships between various environmental parameters that affect biogas yield.However, it is possible to enhance the design of these diagrams and the information they contain.This could be achieved by providing clearer labels for the axes and, in this study, by reducing the number of the data points reported in Figure 6.The ordinate axis represents the moisture content of the biogas, measured in percentage.A higher percentage indicates a greater amount of moisture in the biogas.On the other hand, the abscissa axis represents the temperature of the biogas, measured in degrees Celsius.A higher temperature indicates a hotter biogas.Each data point on the scatterplot represents a single measurement of moisture and temperature for a specific sample of biogas.The color of the data points represents the gaz_change value for the corresponding sample, with darker shades indicating higher values.Additionally, the size of the data points represents the pH value, with larger points indicating higher values.By analyzing the scatterplot, several observations can be made.Firstly, there is a general trend of increasing moisture content with increasing temperature.Furthermore, the gaz_change value appears to be negatively correlated with the moisture content, suggesting that biogas with higher moisture content tended to have a lower gaz_change value.Lastly, there seems to be a positive correlation between the pH value and the temperature, indicating that biogas with higher temperatures tended to have higher pH values.

Conclusions
Biogas is one of the promising sources of energy available for local communities due to its characteristics.Recently, biogas operators have been facing several challenges such as a lack of technology to monitor the indoor production process.In the biogas generation process, environmental parameters are among the major factors affecting biogas production.With artificial intelligence (AI) and internet of things (IoT) technology, it is possible to increase biogas yield using an accurate predictive model cooperating with an IoT-based fermentation control system.In prior research, an IoT-based data collection tool was designed to collect environmental data on a home digester.The dataset comprises of indoor and ambient temperature, moisture level, and biogas yield.In this research, we proposed a biogas yield prediction model that guarantees high accuracy by applying a stacking-based learning model.The triadic assembler model combines the LightGBM, CatBoost, and evolutionary strategy models was developed, and compared with other machine learning models.The stacking model outperformed the KNN, RF, and DT models, in terms of RMSE, R 2 , and MAE metrics, and the proposed model performed the best on both training and testing datasets.
The findings of this research indicated that the triadic ensemble machine learning approach significantly improved biogas yield.The proposed method outperformed all other models, achieving the lowest RMSE and MAE values of 0.0040 and 0.0024, respectively.It also showed the highest R-squared value of 0.7808, indicating superior predictive accuracy and precision.This advancement has significant implications for enhancing biogas plant design and operation, increasing energy output, and addressing environmental challenges.
Due to the testbed scenario, the model must be adapted to the newly collected IoT-based time-series data.Therefore, in the future, it will be important to integrate the transfer learning models and make an intelligent re-training pipeline with the experimented threshold.

Figure 2 .
Figure 2. The random forest model.

Figure 4 .
Figure 4. Comparison of the RMSE and MAE of the different models.

3. 2 . 2 .
Figure 5 presents a comparative analysis using R-squared metrics.The graph illustrates that different models had varying R-squared values.The random forest model achieved an R-squared value of 0.6863, indicating that approximately 68.63% of the variance in the target variable could be explained by the model.The decision tree model obtained an R-squared value of 0.6240, indicating that approximately 62.40% of the variance in the target variable could be explained by this model.The KNN model achieved an R-squared value of 0.6540, indicating that approximately 65.40% of the variance in the target variable could be explained by this model.The proposed method obtained the highest R-squared value of 0.7808, indicating that approximately 78.08% of the variance in the target variable could be explained by this model.Overall, the proposed method demonstrated the highest model fit, with the highest R-squared values, indicating that it could better explain the variance in the target variable compared to the other models.

Figure 5 .
Figure 5.Comparison of the R-squared values of the different models.

Figure 6 .
Figure 6.Comprehensive insights: scatterplot matrix analysis of the biogas dataset.

Table 1 .
Explanatory variables in the biogas prediction model.
o Initialize the population of models with random parameters: P = [M1, M2... MN].o For each generation (g = 1 to G): • Evaluate the fitness of each model in the population based on prediction accuracy.• Select the top-performing models (e.g., based on the highest fitness scores) for reproduction.• Generate offspring models through variation and crossover operations.• Replace the least fit models in the population with the offspring.o Select the best model from the final population based on fitness.Step.4:Prediction with the trained model: o Use the best model to predict biogas production for new data inputs: y_pred = best_model.predict(X_new),where X_new represents the new data inputs.

Table 3 .
Model results through the cross-validation test.

Table 4 .
Models' results comparison through cross-validation.