2. Development of an Algorithm for Determining an Effective Model for Forecasting Volumes of Biogas Production from Household Organic Waste and Data Collection and Preparation
The creation of an effective model for forecasting the volume of biogas production from household organic waste is performed based on the algorithm presented in
Figure 1. It envisages the use of machine learning methods to substantiate an effective model for forecasting the volume of biogas production from household organic waste based on the collected data. The aim is to build a model that provides accurate prediction of biogas yield (SGP) based on input data: type of waste (food FW or yard YW), total organic solids (TS), and volatile organics (TVS).
At the initial stage, the collection and preparation of the necessary data is carried out, which is the basis for the justification of an effective model for forecasting the volume of biogas production from organic household waste. At the same time, there are specific features that are reflected in the algorithm we develop (
Figure 1). The target variable that we forecast is the amount of biogas output (SGP, m
3/kg TVS) from organic household waste, which is determined by the following factors:
- –
type of waste (FW and YW)—a categorical variable that determines the type of organic waste;
- –
volume of solid organic substances (TS, kg/m3)—the amount of solid organic substances in organic waste;
- –
by the content of volatile organic substances (TVS, % of TS)—the percentage of volatile organic substances from the total amount of solid organic substances.
The data upload process is performed in a CSV (Comma-Separated Values) format, which is quite standard and simple. Using CSV files is convenient for exchanging data between programs and for storing and processing data tables.
If necessary, the next step is component data analysis, which is performed based on the method of reducing the dimensionality of the data set, increasing interpretability, and minimizing the risk of data loss [
31,
32,
33]. Thanks to this, a smaller number of sets of variables is created which belongs to the variables described above and, accordingly, contributes to the increase in variation in the data [
34,
35,
36]. This analysis ensures the creation of linear combinations of the original observed variables that explain the largest variance in the data of a given component (
Figure 2). Component analysis of the data is performed using the PCA package of the Scikit-learn library in the Python 3.11 programming language.
At the next stage, a description of the data preparation process is carried out, with the detection of gaps and the filling of missing values. Data preparation is an important process of any machine learning method [
37,
38,
39,
40,
41,
42]. It includes a number of tasks such as cleaning, standardizing, scaling, and sampling data. At the same time, one of the important tasks of data preparation is the identification and filling of gaps. Data gaps can occur for a variety of reasons, such as data entry errors, incomplete data sets, or failure to measure some values.
Data preparation is a key step in the machine learning process and involves several steps. Let us present a general overview of these steps. First of all, data collection is performed, as a result of which the desired data set
is formed with the selection of input features and the outline
of answers (labels)
:
where
—data set;
—input characters,
—answers (marks).
The next step is to perform a data cleanup:
where
—cleaned data set.
After that, processing and filling of missing values is performed:
where
—a data set with recovered (filled) missing values.
The Jupyter Notebook environment is used to prepare data on biogas production from household organic waste. This makes it possible to visualize the obtained results in a convenient format, which increases the efficiency of working with data despite performing separate data preparation operations. Based on the data analysis, we construct a histogram for two categories of waste (FW and YW) (
Figure 3). Using the Matplotlib and Seaborn libraries, we plot the distributions of the volume of solid organic substances (TS, kg/m
3), the content of volatile organic substances (TVS, % of TS), and the volume of biogas output (SGP, m
3/kg TVS) (
Figure 4).
The resulting histogram (
Figure 3) indicates the distribution of categories in the “Waste” column. There are two unique data categories in our data set, which characterize food (FW) and yard waste (YW). At the same time, there are 1.818 instances of food (FW) and 615 of yard waste (YW) data.
It is established that the characteristics of food (FW) and yard waste (YW) change in different ways, which requires their separate analysis. For this purpose, we construct diagrams of changes in the volume of solid organic substances (TS, kg/m
3), the content of volatile organic substances (TVS, % of TS), and the volume of biogas output (SGP, m
3/kg TVS) (
Figure 5).
It is established that the amount of organic solids (TS) in FW food waste has an average value of TS of 247.0 kg/m3 and a small standard deviation of 18.9 kg/m3. At the same time, this indicator in YW yard waste has an average TS value of 738.6 kg/m3, and a much larger standard deviation of 138.0 kg/m3. Yard waste (YW) is found to have significantly higher mean and maximum TS compared to food waste (FW). This is due to a greater amount of wood, branches, and leaves in the yard mass.
Regarding the content of volatile organic substances (TVS), in FW food waste, the average value of TVS is 90.0%, the standard deviation is 2.2%. In YW yard waste, the average TVS value is 89.7%, and the standard deviation is 1.4%. The TVS content of both types of waste is found to be very similar. This indicates that they have a similar propensity for biodegradation.
Regarding the volume of biogas output (SGP), in FW, the mean value of SGP is 0.848 m3/kg TVS, and the standard deviation is 0.108 m3/kg TVS. Meanwhile, in YW yard waste, the mean value of SGP is 0.249 m3/kg TVS, and the standard deviation is 0.067 m3/kg TVS. It is found that FW has a significantly higher biogas generation potential compared to YW. This may be due to the higher concentration of easily decomposable organic substances in FW.
FW presents significantly higher biogas generation potential, making it more valuable. The obtained results indicate that the characteristics of FW and YW are significantly different. YW has higher TS and greater TS variability for anaerobic processing.
In order to perform a detailed analysis, we present the statistical characteristics of numerical variables regarding the volume of biogas production from household organic waste in
Table 1.
The amount of TS using FW has an average value (mean) of 247.0 kg/m3 and a standard deviation (std) of 18.9 kg/m3. The range of changes in the volume of solid organic substances (TS) ranges from 214.1 kg/m3 to 279.9 kg/m3. As for yard waste, the mean is 738.6 kg/m3, and the standard deviation (std) is 138.0 kg/m3. The range of changes in the volume of solid organic substances for this type of waste is from 505.4 kg/m3 to 971.8 kg/m3.
The content of volatile organic substances (TVS, % of TS) using FW has an average value (mean) of 90.0% and a standard deviation (std) of 2.2%. The range of changes in the content of volatile organic substances is from 86.1% to 93.8%. As for YW, the average value (mean) of the content of volatile organic substances is 89.7%, and the standard deviation (std) is 1.4%. The range of changes in the content of volatile organic substances for this type of waste is from 87.3% to 91.9%.
The amount of biogas output (SGP, m3/kg TVS) using FW has an average value (mean) of 0.848 m3/kg TVS and a standard deviation (std) of 0.108 m3/kg TVS. The range of changes in the volume of biogas output (SGP) is from 0.657 m3/kg TVS to 1.035 m3/kg TVS. Regarding YW, the average value (mean) of the volume of biogas output is 0.249 m3/kg TVS, and the standard deviation (std) is 0.067 m3/kg TVS. The amount of biogas output for this type of waste ranges from 0.134 m3/kg TVS to 0.364 m3/kg TVS.
The obtained statistical characteristics make it possible to gain an understanding of the distribution and features of each variable for both considered types of waste which affect the projected amount of biogas production from organic household waste.
In the next step, normalization and standardization of individual data features is performed, described by the following formulas [
32]:
where
—normalized value of the indicator;
—empirical value of the indicator;
—smallest value of the indicator in the sample;
—largest value of the indicator in the sample.
where
—standardized value of the indicator;
—average value of the indicator value in the sample;
—standard deviation of the indicator in the sample.
After that, the coding of categorical features is performed [
32]:
where
—coded value of categorical features.
In our research, data preprocessing is performed using the Scikit-learn library in the Python programming language. This makes it possible to prepare data for forecasting the volume of biogas production from household organic waste, a fragment of which is shown in
Table 2.
In the next step, features are selected:
where
—data set with selected features.
After that, the input parameters of the forecasting model of biogas production volumes from household organic waste are determined. We select the attributes that have the highest correlation with the target attribute “SGP”. For this, a correlation matrix is used (
Figure 6). For each input parameter, their average value is determined (according to
Table 2):
where
—average value of the input parameter;
—current value of the parameter;
—number of data instances;
—number of data attributes.
Quantitative values of the components of correlation matrix
are determined by formula [
27]
where
—correlation between the input data and target feature
.
The correlation between the quantitative values of the input factors (
x1,
…,
x3) and the target characteristic “SGP” (
) is determined by formula [
27]
The correlation matrix (
Figure 6) represents the relationships between the input attributes and the target feature in the data set
. For the target trait “SGP”, the correlation with other traits is determined by Formula (10).
The obtained results of the calculations regarding the determination of the quantitative values of the correlation coefficients are presented in
Table 3.
Based on the obtained values of the correlation coefficients between the input attributes and the target feature “SGP”, it should be noted that the attributes “Waste” and “TS” have a strong correlation with the attribute “SGP”, and the attribute “TVS” has a correspondingly weak correlation with the vector of target values (output variable).
In the future, the data are divided into training and test samples:
We suggest using 20% of the data for testing and the rest for training. That is, the ratio between training and test samples is .
This helps assess the ability of the model to generalize knowledge on new, previously unseen data. We describe a general approach to preparing data for machine learning in order to forecast volumes of biogas production from household organic waste.
3. Results of the Substantiation of an Effective Model for Forecasting Volumes of Biogas Production from Household Organic Waste and to Evaluate Its Accuracy Indicators
In order to train the forecasting model of biogas production volumes from household organic waste, we selected the following algorithms: (1) Linear Regression; (2) Ridge Regression; (3) Lasso Regression; (4) Random Forest Regression; (5) Gradient Boosting Regression. Let us briefly describe each of these algorithms and their basic concept.
Widely used in the practice of machine learning when solving forecasting problems is Linear Regression, which is described by the following formula:
where
—target variable (biogas production);
—signs (properties of organic waste);
—intercept;
—regression coefficients;
—an error.
The following Ridge Regression is an adaptation of the popular and widely used Linear Regression algorithm. It improves upon conventional Linear Regression by slightly modifying its cost function, which is described by the following formula:
where α—parameter of regularization.
The specified Equation (13) adds a regularization term containing squared coefficients ().
The Lasso Regression algorithm, widely used for solving forecasting problems, deserves attention:
Equation (14) adds a regularization term containing the absolute values of the coefficients ().
Random forest for regression is considered a very powerful and robust machine learning algorithm because it handles multivariate data, missing values, and outliers well. Random Forest Regression (Random Forest for regression) involves a combination of many decision trees (Random Forest trees). We mark
as a prediction from the
ith tree; then,
where
—as a prediction from i tree;
—the number of trees in the ensemble.
Each tree offers, from Expression (15), its prediction, and as a result, the Random Forest averages or votes for these predictions. At the same time, the Random Forest reduces overtraining and ensures the importance of features.
Gradient Boosting Regression involves using a combination of weak models (usually decision trees). Each new model corrects the mistakes of the previous model. The forecast is formed as a weighted sum of the forecasts of all with the number of models
, with the number of models
:
Each model is added with a weight, depending on the learning rate:
where
—prediction from i of the model;
—number of models in the ensemble;
—the weight m—the model has in gradient boosting.
We use the specified algorithms to train a model for forecasting the volume of biogas production from household organic waste. As a result of our research, for each of the created models, we built graphs for residual analysis (residual) on training data (
Figure 7,
Figure 8,
Figure 9,
Figure 10 and
Figure 11).
Based on the graphs of the residual values on the training data for the researched models of forecasting the volumes of biogas production from organic household waste, it can be noted that the models provide different results in terms of their accuracy. For a more accurate assessment of the obtained models, we calculated the following indicators: (1) coefficient of determination for the training data set, ; (2) coefficient of determination for the test data set, ; (3) mean cross-validation score for the training data set, .
The coefficient of determination (
) for the training data set indicates the percentage of variation of the dependent variable:
where
—number of observations;
—real values of the target variable “SGP”;
—predicted values of “SGP” are derived from the model;
—mean value of target variable “SGP”.
The coefficient of determination (
) for the test data set indicates how effectively the model matches the test data, i.e., evaluates the model’s fit with new, previously unseen data:
The average cross-validation score for the training data set
provides an estimate of the model’s performance based on cross-validation, taking into account the average value of the accuracy scores on different subsamples:
where
—the number of convolutions in cross-validation;
—coefficient of determination of the training data set
for each convolution
.
This means that indicator helps to evaluate the model’s resistance to variability in the training set using cross-validation.
The obtained results of the calculations regarding the model training accuracy indicators are presented in
Table 4.
Considering the obtained results, it should be noted that the “Gradient Boosting Regressor” model provides the highest coefficient of determination, , on the training data, which indicates the good ability of the model to explain the variation in the training data. However, on test data, this indicator is smaller and amounts to , which indicates retraining of the model on training data and incomplete generalization on new data. At the same time, the “Random Forest Regressor” model provides the highest coefficient of determination, , on the training data, and on the test data this indicator is somewhat smaller and is , which indicates retraining of the model on training data and incomplete generalization on new data. As for the root mean square error, all the studied models have the same indicator, , on the training data.
However, the assessment of which model is best for predicting biogas production from household organic waste depending on the specific requirements and context of the problem regarding its accuracy. At the same time, one should take into account the interpretability of the model for forecasting volumes of biogas production from household organic waste and the speed of its learning.
To determine the accuracy of “Random Forest Regressor” and “Gradient Boosting Regressor” models, we chose MSE and MAE indicators. They provide an assessment of the accuracy of the specified model during its training and testing. They provide an evaluation of the model’s performance during different training epochs. The definition of MSE (mean squared error) involves the calculation of the root mean square difference between the predicted values of biogas production volumes from household organic waste using the model and the real biogas production values:
where
—value of biogas production volume from household organic waste, m
3/kg TVS;
—forecast value of biogas production volume from household organic waste, m
3/kg TVS;
—number of examples in the initial data array, units.
The next indicator that characterizes the accuracy of the model is the MAE (mean absolute error), which characterizes the average value of the difference between the predicted values of the model and the real values of the volume of biogas production from household organic waste:
On the basis of the conducted research, we constructed the dependence of the change in the MSE indicator on the number of learning epochs using the “Random Forest Regressor” and “Gradient Boosting Regressor” models to forecast the volume of biogas production from household organic waste (
Figure 12).
The obtained dependences of the change in the MSE indicator on the number of training epochs for the training and test data samples using the “Random Forest Regressor” and “Gradient Boosting Regressor” models indicate slightly different trends in the change in the indicated indicator. In particular, in the “Random Forest Regressor” model, a higher and unstable value of the indicated indicator is observed. At the same time, in the “Gradient Boosting Regressor” model, after 40 training epochs, the MSE indicator on the test sample increases compared to the training sample, which indicates that the model is retrained. We perform an analysis of the quantitative values of MSE and MAE indicators when forecasting the volume of biogas production from household organic waste using the studied models according to the data in
Table 5.
Based on the obtained data in
Table 5, it was established that for the training sample, the “Random Forest Regressor” model has a smaller value of MSE = 0.001637, which indicates a good fit for the training data. For the test sample, the “Random Forest Regressor” model also has the smallest value of MSE = 0.011698, which confirms the effectiveness of the model on new data. At the same time, both models show low MSE values, but the “Random Forest Regressor” model looks more efficient for the test sample.
For the training sample, the “Random Forest Regressor” model also has the smallest value of MAE = 0.033248, which indicates accurate prediction on the training data. For the test sample, the “Random Forest Regressor” model has the smallest value of MAE = 0.088905, which is low and shows the effectiveness of the model on new data. In both cases, the Random Forest Regressor model has a smaller MAE value, indicating better accuracy compared to the Gradient Boosting Regressor model. Overall, the Random Forest Regressor model shows better performance for both metrics on the test sample compared to the Gradient Boosting Regressor model.
On the basis of the developed models, a graph of the actual and predicted values of biogas production from household organic waste was plotted using the “Random Forest Regressor” and “Gradient Boosting Regressor” models (
Figure 13).
The resulting graph (
Figure 13) shows how the developed models predict the actual values of the volume of biogas production from household organic waste. The observed actual values and the predicted values are close to each other, which means that the proposed model does a good job of forecasting. Based on the conducted research, it can be stated that the “Random Forest Regressor” model is the most effective for forecasting the volume of biogas production from household organic waste. This is confirmed by the fact that it has a smaller MAE value on the test data, which indicates smaller absolute errors in the predictions.
4. Discussion of Research Results
The obtained research results made it possible to justify the “Random Forest Regressor” model for forecasting the volume of biogas production from household organic waste based on machine learning. It will be useful in practice for information technology professionals and project managers who develop decision support systems for organic waste management and justify energy production strategies from this waste. The proposed model analyzes the input data which include the type of waste (FW and YW), the amount of solid organic matter (TS, kg/m3), and the content of volatile organic matter (TVS, % of TS) which affect the amount of biogas output (SGP, m3/kg TVS) from household organic waste. Through a balanced analysis of these factors, the model provides forecasts of biogas yield (SGP, m3/kg TVS) from household organic waste with high accuracy, helping to optimize the use of resources and increase production efficiency.
We selected five machine learning algorithms to substantiate an effective model for forecasting volumes of biogas production from household organic waste. On the basis of the performed research, the main advantages and disadvantages of the used algorithms were determined. The advantage of the “Linear Regression” algorithm is the ease of interpretation. In particular, Linear Regression is easily interpreted, which allows understanding of the effect of each characteristic on the target variable. Also, this algorithm provides adequate learning speed, and it has wide application in various fields. Its main drawback is that the resulting model assumes a linear relationship between the features and the target variable, which may not be sufficient for complex data. Also, it has an increased tendency to overtraining if there is a large number of correlational signs. The “Ridge Regression” algorithm involves the introduction of an additional term to the model loss function in order to limit the values of the model parameters. This helps to avoid overtraining. It has stability as it works well with multi-collinearity. The main disadvantage of the specified algorithm is the need to adjust the parameters. That is, it requires selection of the regularization parameter.
The “Lasso Regression” algorithm has an advantage over the other mentioned algorithms, as it provides automatic feature selection. At the same time, regularization can automatically select important features, which helps to avoid retraining and makes it possible to obtain a general model. Its main drawback is that there is a need to determine the optimal value of the regularization parameter, which will ensure better overall model efficiency.
The “Random Forest Regressor” algorithm provides high accuracy and is well suited for complex data with a large number of features. It is less prone to overtraining compared to some other algorithms. Its main drawback is the difficulty of interpretation. That is, a more complex process of interpretation is inherent compared to linear models.
The “Gradient Boosting Regressor” algorithm provides high accuracy, especially on data sets with a large number of features. It is suitable for the automatic selection of important features. The disadvantage is that this algorithm is prone to overtraining. Overlearning may occur with insufficiently configured hyperparameters. In addition, it requires fine-tuning of hyperparameters to achieve optimal performance.
Based on the conducted research, it was established that two models, “Random Forest Regressor” and “Gradient Boosting Regressor”, show the best accuracy indicators. The other three models (Linear Regression, Ridge Regression, Lasso Regression) are inferior in accuracy and were not considered further. To determine the accuracy of “Random Forest Regressor” and “Gradient Boosting Regressor” models, we chose MSE and MAE indicators. The Random Forest Regressor model was found to be a more accurate model compared to the Gradient Boosting Regressor. This is confirmed by the fact that the MSE of the “Random Forest Regressor” model on the training data set is 7.14 times smaller than that of the “Gradient Boosting Regressor” model. At the same time, MAE is 2.67 times smaller in the “Random Forest Regressor” model than in the “Gradient Boosting Regressor” model. MSE and MAE in both models are worse on the test data set, which indicates a tendency to over-train. The Gradient Boosting Regressor model has worse MSE and MAE than the Random Forest Regressor model on both the training and test data sets.
The proposed model can determine the optimal conditions for biogas production, which allows to increase the production under appropriate conditions. Finding optimal parameters can help maximize biogas yield and reduce costs. Also, the model can serve as a tool for forecasting future volumes of biogas production based on various scenarios and variables. Accurate forecasts allow planning production, allocating resources and performing strategic planning based on expected demand.
The use of machine learning to predict the volume of biogas output from household organic waste allows for more accurate regulation of biogas production while reducing the impact on the environment. Optimum management of biogas production from these wastes can contribute to reducing procurement costs and waste generation.
So, it can be claimed that the article fulfils the purpose of the research, which is confirmed by the fact that, based on the comparison of models based on five machine learning algorithms, it is established that the best is the “Random Forest Regressor” model. It has the smallest MAE value on the training data, indicating smaller absolute errors in predictions.
Further systematic improvement of the “Random Forest Regressor” model for forecasting biogas production volumes from household organic waste based on new data will ensure its accuracy and maintain competitive advantages. Using a machine learning model to predict biogas production volumes from household organic waste is the basis for creating an efficient, sustainable, and dynamic biogas production system that maximizes benefits for household residents and reduces the negative impact on the environment.
5. Conclusions
The proposed algorithm for creating a model for forecasting the volume of biogas production from household organic waste involves the implementation of 10 main and 3 auxiliary steps including component data analysis which is performed on the basis of the method of reducing the size of the data set, increasing interpretability, and minimizing the risk of data loss. It provides a systematic collection and preparation of data on the type of waste (FW or YW), total organic solids (TSs), and content of volatile organic substances (TVS) and the search for relationships with the volumes of biogas output (SGP) from household organic waste, which ensures, on the basis of the use of machine learning methods, the performance of the justification of an effective model for forecasting the volume of biogas production from organic household waste based on the collected data. Based on the analysis of 2433 data sets that characterize the formation of biogas from food (FW) and yard waste (YW) by four features, data preparation was performed using the Jupyter Notebook environment in Python, which is the basis of machine learning to substantiate an effective volume forecasting model production of biogas from organic household waste.
On the basis of the developed algorithm using the prepared data, an effective model for forecasting the volume of biogas production from organic household waste was justified and its accuracy indicators were evaluated. For machine learning, five regression algorithms (Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regression, Gradient Boosting Regression) were selected, which were used to train forecasting models of biogas production volumes from household organic waste. The obtained research results indicate that the “Gradient Boosting Regressor” model provides the highest coefficient of determination, , on the training data, which indicates the good ability of the model to explain the variation in the training data. However, on test data, this indicator is smaller and amounts to , which indicates retraining of the model on training data and incomplete generalization on new data. At the same time, the “Random Forest Regressor” model provides the highest coefficient of determination, , which indicates retraining of the model on training data and incomplete generalization on new data. At the same time, the “Random Forest Regressor” model provides the highest coefficient of determination which indicates retraining of the model on training data and incomplete generalization on new data. At the same time, the “Random Forest Regressor” model provides the highest coefficient of determination, , which indicates retraining of the model on training data and incomplete generalization on new data. As for the root mean square error, all the studied models have the same indicator, , on the training data.
The results of the conducted research indicate that the model using the “Random Forest Regressor” algorithm is the most effective for forecasting the amount of biogas production from organic household waste. This is confirmed by the fact that it provides MAE = 0.088 on test data, which indicates the smallest absolute errors in predictions. Further systematic improvement of the “Random Forest Regressor” model for forecasting biogas production volumes from household organic waste based on new data will ensure its accuracy and maintain competitive advantages. Using a machine learning model to predict biogas production volumes from household organic waste is the basis for creating an efficient, sustainable, and dynamic biogas production system that maximizes benefits for household residents and reduces the negative impact on the environment.