Influence of Thermal Pretreatment on Lignin Destabilization in Harvest Residues: An Ensemble Machine Learning Approach

Đurđica Kovačić; Dorijan Radočaj; Danijela Samac; Mladen Jurišić

doi:10.3390/agriengineering6010011

,

and

Faculty of Agrobiotechnical Sciences Osijek, Josip Juraj Strossmayer University of Osijek, Vladimira Preloga 1, 31000 Osijek, Croatia

^*

Author to whom correspondence should be addressed.

AgriEngineering2024, 6(1), 171-184;https://doi.org/10.3390/agriengineering6010011

This article belongs to the Section Computer Applications and Artificial Intelligence in Agriculture

Version Notes

Order Reprints

Abstract

The research on lignocellulose pretreatments is generally performed through experiments that require substantial resources, are often time-consuming and are not always environmentally friendly. Therefore, researchers are developing computational methods which can minimize experimental procedures and save money. In this research, three machine learning methods, including Random Forest (RF), Extreme Gradient Boosting (XGB) and Support Vector Machine (SVM), as well as their ensembles were evaluated to predict acid-insoluble detergent lignin (AIDL) content in lignocellulose biomass. Three different types of harvest residue (maize stover, soybean straw and sunflower stalk) were first pretreated in a laboratory oven with hot air under two different temperatures (121 and 175 °C) at different duration (30 and 90 min) with the aim of disintegration of the lignocellulosic structure, i.e., delignification. Based on the leave-one-out cross-validation, the XGB resulted in the highest accuracy for all individual harvest residues, achieving the coefficient of determination (R²) in the range of 0.756–0.980. The relative variable importances for all individual harvest residues strongly suggested the dominant impact of pretreatment temperature in comparison to its duration. These findings proved the effectiveness of machine learning prediction in the optimization of lignocellulose pretreatment, leading to a more efficient lignin destabilization approach.

Keywords:

harvest residue; Klason lignin; hot air pretreatment; decision trees; leave-one-out cross-validation

1. Introduction

Because lignocellulose has a high energy potential, which is 80 times higher than the world uses annually, it is thought that its use for energy production can overcome global warming and the energy crisis as well as create pathways for different sustainable and environmentally friendly product streams [1,2]. However, lignocellulosic biomass has a very complex structure which is hard to degrade [3,4]. It consists of long cellulose microfibrils linked with hemicellulose while pectin and lignin fill the remaining spaces in the structure acting as natural cement. Because of its rigidity and strength, lignin is the main obstacle to the utilization of lignocellulose in biofuel production [5]. Therefore, scientists worldwide are striving to conceive a feasible pretreatment that will facilitate lignocellulose degradation to improve its applicability in the production of biogas [3,4]. Yet, lignocellulose pretreatment methods are expensive, time and energy-consuming, and still often insufficiently efficient. Hence, the greatest challenge nowadays is to establish sustainable low-cost methods that would simplify the pretreatment process, would not harm the environment and would enhance the biogas production process [6].

Thermal pretreatment is a physical process that involves heating biomass to a certain temperature (50 to 240 °C) and/or pressure [7]. Thermal pretreatment is effective in lignin and hemicellulose degradation. Heat causes the hydrogen bonds in crystalline complexes to break, which causes the biomass to expand, hence increasing the accessible surface area [8]. Several studies reported on the hot air pretreatment of lignocellulose aimed at biogas production all resulting in disruption of the lignocellulosic structure making the cellulose more accessible to methanogens [9]. In two similar studies of different heating processes (hot air oven, autoclave, microwave and water bath) which were applied to pretreat the pulp and paper mill sludge [10], and water hyacinth [11] to enhance the hydrolysis stage of the anaerobic fermentation. In both studies, chemical analyses showed that all the pretreatment methods affected the solubilization of organic matter. Among all pretreatments, a hot air oven had the highest impact on substrate solubilization particularly lignin suggesting enhanced delignification. Kainthola et al. [12] thermally pretreated rice straw before anaerobic co-digestion with cow dung. Applied pretreatment methods were the same as in two previous studies, while maximum solubilization of recalcitrant biomass as well as methane yield was attained after microwave pretreatment. Physicochemical analysis of pretreated rice straw showed increased porosity and reduced crystallinity index if compared to untreated rice straw. In another study [13] of hot air oven pretreatment of lignocellulose (water hyacinth) with the aim of anaerobic co-digestion enhancement, pretreated substrates showed higher level of digestibility resulting in higher biogas yields in comparison to untreated substrate.

Because systematic research on lignocellulose pretreatment, particularly determining lignocellulose component content, can be time-consuming, expensive, and often not environmentally friendly, researchers are developing computational methods to minimize experimental procedures and save money [14,15,16,17]. Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM) are among the state-of-the-art machine learning methods used for various regression and classification problems in biogas production [18,19], including lignocellulosic biomass pretreatment [14,20], as well as in a wide range of renewable energy studies [21,22]. Compared to individual techniques, ensemble machine learning has several advantages. Although each model may have its limitations, ensemble learning potentially mitigates the effects of these biases and provides more accurate predictions [23]. In addition, ensemble approaches are typically more resistant to overfitting than individual models. Ensemble learning reduces the possibility of identifying noise or outliers in the data by combining multiple models, resulting in predictions that are theoretically more accurate and widely applicable [24]. However, previous studies have found that their effectiveness depends on the characteristics of the training datasets [25] and the selection of individual methods used in the ensemble [26]. These issues do not confirm their robustness relative to individual machine learning methods regarding prediction accuracy, especially when working with smaller amounts of input samples.

The objective of this study was to apply four different methods to predict acid-insoluble detergent lignin (AIDL) content—conventional Multiple Linear Regression (LM) and three machine learning methods including RF, XGB and SVM. First, hot air thermal pretreatment of harvest residues under varying parameters was conducted in a laboratory oven. Afterward, AIDL content was determined with the gravimetric assay for the direct quantification of lignin. These four computational methods were applied to predict AIDL content under some other pretreatment conditions based on these results. Efforts are being made to develop more cost-effective and efficient pretreatment methods for the utilization of energy-rich lignocellulosic biomass. Energy input, environmental impact and economic viability need to be carefully evaluated and optimized. In that manner, innovations in biotechnology, such as using machine learning methods, aim to find a balance between efficiency, cost-effectiveness, and environmental impact in lignocellulosic biomass pretreatment, making it more accessible for biogas production without overly taxing resources or time. This study suggests that machine learning methods may more accurately predict lignin content and capture complex interactions between multiple parameters. However, challenges such as dataset quality, interpretability of models, and the need for domain expertise in feature selection and interpretation should be considered.

2. Materials and Methods

The workflow of the study is presented in Figure 1, consisting of three major steps: (1) collection of harvest residues (maize stalks, soybean straw and sunflower stalks) and their thermal pretreatment; (2) creation of four training/test datasets for the prediction of AIDL; (3) evaluation of individual and ensemble machine learning methods for AIDL prediction with accuracy assessment and relative variable importance analysis.

Figure 1. Workflow of the machine learning AIDL prediction based on thermal pretreatment of harvest residues.

2.1. Experimental Set-Up

Three harvest residue types (maize stalks, sunflower stalks, and soybean straw) were collected from the local farm fields in Slavonia and Baranja County and brought to the laboratory. All samples were chopped by shears into smaller pieces and were then oven-dried at 60 °C for 24 h to a moisture content of less than 10%. After drying, all samples were further ground into particles with an average diameter of 3 mm using a laboratory knife mill (Retsch SM 100, GmbH, Haan, Germany) and finally packed in plastic bags and stored at 4 °C until the start of the experiment.

Prior to experimental set-up, harvest residues were pretreated with hot air in a laboratory electric drying/heating oven (Memmert UFE 600, Schwabach, Germany) at two different temperatures: 121 °C and 175 °C, for 30 and 90 min. All harvest residues (pretreated and untreated) were analyzed for total solids (TS), volatile solids (VS) and acid-insoluble detergent lignin (AIDL or Klason lignin) content. TS and VS were determined according to standard methods [27]. TS were determined after drying at 105 °C in the laboratory oven (Memmert UFE 600) to constant weight. VS were determined by complete combustion in a muffle furnace during 4 h at 550 °C. Determination of the proportion of AIDL was carried out gravimetrically according to the Goering & Van Soest method [28] which consisted of three parts. In the first part, extraction was carried out using a neutral detergent solution, and in the second part, hydrolysis with 72% sulfuric acid followed by vacuum filtration. The dried residue on the sinter crucible represented the AIDL fraction.

The training/test datasets were created by combining samples per harvest residue, including three control samples and three samples per thermal pretreatment approach (121 °C during 30 min, 121 °C during 90 min, 175 °C during 30 min and 175 °C during 90 min). Each sample contained an AIDL value with two independent variables, designating temperature used for thermal pretreatment of harvest residues (“Temperature”) and the duration of the pretreatment (“Duration”). This approach produced three datasets of individual harvest residues (“Maize”, “Soybean” and “Sunflower”), each containing a total of 15 samples. Despite a relatively low amount of samples per dataset, a comprehensive study by Gilbertson and van Niekerk [29] proved that a variety of machine learning methods, including those based on decision trees, nearest neighbors and support vector machines, produced highly accurate predictions with as low as ten features per dataset. Additionally, these samples were grouped into a combined dataset “All”, evaluating the ability of machine learning algorithms to simultaneously process heterogeneous samples with regard to crop type.

2.2. Machine Learning Prediction and Accuracy Assessment

Four individual prediction methods were evaluated for AIDL prediction, including conventional LM and three machine learning methods commonly used in similar studies, including RF, XGB, and SVM. RF is an ensemble learning technique that creates predictions by combining many decision trees using bootstrap aggregation (bagging) [30]. The hyperparameter “mtry” controls the diversity of the trees and prevents overfitting by dictating how many features are randomly selected at each split [31]. A similar decision tree-based method that used a boosting approach instead of bagging, XGB, generated an ensemble of weak learners to sequentially produce a strong learner [32]. The hyperparameter “nrounds” determined the number of boosting rounds in XGB, “lambda” and “alpha” controlled the regularization to prevent overfitting, while “eta” adjusted the learning rate, affecting how much each tree contributed to the final prediction [33]. The SVM computed the best hyperplane to divide the data points into distinct groups, with the hyperparameter “sigma” defining the kernel width and affecting the flexibility of the decision boundary, while the cost hyperparameter or “C” adjusted the trade-off between optimizing the margin and reducing the prediction error [34]. The ensemble machine learning approach was evaluated by evaluating all combinations of two individual machine learning methods (RF + XGB, RF + SVM and XGB + SVM). This approach was chosen to assess its ability to exploit the strengths of the individual methods evaluated for AIDL prediction across multiple different crop residues [23]. Two libraries in R v4.2.2, “caret” [35] and “caretEnsemble” [36], were used for AIDL prediction, resulting in a total of seven prediction methods per harvest residue training/test dataset. All tuning hyperparameters were optimized using the automatic optimization approach in “caret”.

Accuracy assessment was performed using leave-one-out cross-validation, a robust approach that is completely resistant to randomness and is the most appropriate option for smaller datasets [37]. The metrics used included relative (coefficient of determination, R²) and absolute accuracy assessment metrics (root mean square error, RMSE and mean absolute error, MAE), calculated according to the following Formulas (1)–(3):

R^{2} = 1 - \frac{S S R}{S S T},

(1)

RMSE = \sqrt{\frac{\sum_{1}^{n} {(y_{pred} - y_{obs})}^{2}}{n},}

(2)

MAE = \frac{\sum_{1}^{n} |y_{pred} - y_{obs}|}{n},

(3)

where SSR indicates the sum of residual squares between predicted and observed AIDL, SST indicates the total sum of squares, n indicates sample count, y_pred indicates predicted AIDL, while y_obs indicate observed (input) AIDL values. The R² measured how well the machine learning model fit the observed variability in the AIDL data, providing a relative measure of accuracy [38]. Meanwhile, the RMSE measured the average residual between the actual and predicted AIDL values based on their square root, while the MAE operated on a similar basis but did not take into account the magnitude of the squared error, and was more robust to outliers and less sensitive to large residuals than the RMSE [39]. The higher R² and lower RMSE and MAE values indicated a higher level of accuracy of the predicted AIDL values.

3. Results and Discussion

Descriptive characteristics of the input training/test datasets by AIDL value distribution per crop residue and temperature and duration of thermal pretreatment are shown in Figure 2. Harvest residues often contain a considerable amount of lignin which is, with cellulose and hemicellulose, a key component of plant cell walls. In harvest residues, such as crop stems, stalks, leaves and other plant leftovers, lignin content can vary significantly depending on the plant species, maturity at harvest and the specific plant part [40]. Generally, depending on the type of harvest residue, the amount of lignin content may vary from around 15 to 40% of the dry weight. Certain residues, such as those originating from woody plants or specific grasses, may possess a greater concentration of lignin, whereas residues originating from leguminous plants might have comparatively lower lignin content [41]. For instance, crop residues like different straws or plant stovers typically have lignin contents in the range of 15 to 25% [42], while woody residues (from trees or shrubs) may contain higher lignin percentages, often exceeding 25% [43].

Figure 2. The boxplots of AIDL values per harvest residue and thermal pretreatment parameters, including temperature and duration, with black dots representing outliers.

Based on the median values, soybean residues exhibit the highest median (20.52), followed closely by maize (19.06) and sunflower (19.15). This suggests that, given the heat pretreatment conditions, soybean residues may have a slightly higher AIDL on average. The symmetry of the AIDL distribution can be explained by the skewness values, as soybean residues have a negatively skewed distribution (−0.37), indicating a larger left tail. The residues from sunflower and maize show positively skewed distributions, with longer right tails. All three crops had distributions with kurtosis values near zero, suggesting a somewhat normal or mesokurtic shape. However, sunflower residues have a higher platykurtic distribution (−0.94), indicating lighter tails compared to the normal distribution. Maize had the widest range of AIDL values, as represented by the largest interquartile range, while producing the smallest median AIDL among all four crop residue training/test datasets. Median AIDL values per thermal pretreatment temperature increased proportionally with increasing temperature, with a particularly large AIDL value range for the 175 °C treatment. Similarly, treatment duration showed a proportional increase in median AIDL values, although the 30-min treatment produced some of the highest AIDL values in the study.

The Kruskal-Wallis one-way analysis of variance test (Table 1) showed statistically significant differences in AIDL values according to the temperature of thermal pretreatment variations (0 °C, 121 °C, and 175 °C) at the 0.01 significance level, confirming previous observations. Although the p-value was relatively low, the thermal pretreatment duration variations (0 min, 30 min, and 90 min) did not produce statistically significant differences in AIDL values.

Table 1. The results of the Kruskal-Wallis one-way analysis of variance test for AIDL and two thermal pretreatment parameters.

The accuracy evaluation results of the evaluated machine learning methods for AIDL prediction showed a strong preference for the analysis of individual crop residues instead of their grouping in the “All” training/test dataset (Table 2). This might be influenced by the classification of six maize samples as outliers when using the interquartile range as a criterion for their detection. Previous research by Smuga-Kogut et al. [44], which evaluated mugwort and hemp as lignocellulosic biomass for bioethanol prediction, also suggested that individual machine learning methods can provide accurate predictions for individual harvest residues only due to their heterogeneity in chemical composition. The results from this study also noted challenges in using machine learning for the optimization of biofuel production as a universally applicable solution regardless of input data properties, despite its superior accuracy to conventional prediction methods. The AIDL prediction for maize showed a very high relative prediction accuracy represented by R² for the most accurate machine learning method, followed by moderately high R² values for soybean and sunflower. Meanwhile, absolute accuracy represented by RMSE and MAE showed that the highest accuracy was achieved for sunflower, followed by soybean and maize. This was confirmed by the normalized RMSE values of the most accurate machine learning method, taking into account the average AIDL values per harvest residue dataset, which allowed their mutual comparison. It also showed a very high absolute prediction accuracy of 3.65%, 3.00% and 2.19% for maize, soybean and sunflower, respectively.

Table 2. Accuracy assessment metrics of evaluated machine learning methods for AIDL prediction based on leave-one-out cross-validation.

With the exception of AIDL prediction using all crop residues for prediction, the ensemble machine learning approach was unable to outperform individual machine learning methods in terms of prediction accuracy. Previous studies provided very limited knowledge on the effectiveness of ensemble machine learning in optimizing individual components of biogas production. The available research indicated its ability to improve the prediction of individual methods [45,46], which contradicts the results of this study. However, while several broader studies generally agreed on the effectiveness of ensemble machine learning, they also provided mixed observations regarding its robustness relative to individual methods. These studies noted the dependence of their prediction accuracy on the characteristics of the input samples [25,47] and the prediction principles used by the individual methods in the ensemble [26]. XGB was shown to be a superior prediction method to RF and SVM, demonstrating robustness and resistance to overfitting as shown by the comprehensive leave-one-out cross-validation approach [48]. In particular, the machine learning approach provided superior prediction accuracy to conventional multiple linear regression, indicating a need to identify non-linear relationships between AIDL and temperature and duration of thermal pretreatment of crop residues. Both tree-based machine learning methods (RF and XGB) provided robust prediction accuracy despite small sample sizes of individual harvest residue datasets, with accuracy likely to increase further with the inclusion of additional samples in future studies [29].

A feature selection parameter (mtry = 2) for RF and a moderate learning rate (eta = 0.3) for XGB were found to be optimal for all crop residues (Table 3). At each split, the RF considered a subset of two characteristics, which reduced the possibility of overfitting and improved the generalization of the model, resulting in higher predicted accuracy. Among the remaining XGB parameters, the regularization parameters lambda and alpha showed considerable variability in the strength of L2 and L1 regularization to avoid overfitting [33], while maize required additional rounds of boosting compared to other crop residues. The SVM used a balanced cost parameter, except for soybean, which allowed for a greater regression margin with a more flexible approach to model fitting. The SVM model for soybean residues prioritized a wider margin between support vectors due to this parameterization decision, allowing for a more permissive margin to account for the intrinsic variability in soybean data.

Table 3. The optimal hyperparameters used for prediction with each machine learning method.

The scatterplots of each prediction method (Figure 3 and Figure 4) show the ability of each machine learning method to differentiate small AIDL variations by modeling non-linear relationships compared to LM. In particular, RF and XGB provided a better fit for both large-interval (maize) and small-interval (soybean and sunflower) AIDL values than SVM and the ensemble machine learning approach. The main cause of lower prediction accuracy for the dataset containing all crop residues was due to the tendency of all methods to produce the same predicted AIDL for repetitions of the same thermal pretreatment, which could not be accurately distinguished when considered together.

Figure 3. The scatterplots of predicted and observed AIDL values obtained using multiple linear regression and individual machine learning methods.

Figure 4. The scatterplots of predicted and observed AIDL values obtained using an ensemble machine learning approach.

The relative variable importance values produced results consistent with the significance levels of the Kruskal-Wallis test, providing a more detailed approach to assessing their impact per crop residue (Figure 5). This is primarily reflected in the relationship of temperature-duration importance ratio per crop residue when observed for XGB as the most accurate AIDL prediction method. Duration of thermal pretreatment was most important relative to temperature for sunflower, followed by soybean, while temperature was dominant for maize pretreatment. In addition, RF, as the second most accurate prediction method for all individual crop residues, produced the same temperature-duration relationship across crops as XGB, while taking the duration of thermal pretreatment more into account than XGB. Machine learning models varied in their ability to identify key characteristics based on how they handle non-linearity and variable interactions. The variable importance metrics were influenced by model assumptions and their resilience to noise, as variations in performance are caused by model hyperparameters, as well as the distribution and existence of outliers in the input dataset [49].

Figure 5. The relative variable importance metrics per harvest residue and individual prediction method.

The parameters for substrate treatment in the field of thermal pretreatment are naturally limited, so the temperature and pretreatment duration were selected as the only factors in this study. However, pretreatment procedures, such as chemical and biological methods, offer a wide range of options for modifying various parameters and conditions, providing a great deal of flexibility and allowing for a variety of pretreatment conditions. The research primarily focuses on the pretreatment of lignocellulose, as no studies have been conducted on its post-treatment yet. It is challenging to determine the potential outcomes of applying post-treatment to substrates that have already undergone pretreatment. Additionally, there is currently no established economic justification for the use of post-treatments.

4. Conclusions

This study aimed to provide the methodology of using machine learning methods to predict the lignin content in lignocellulosic biomass following pretreatment in a hot air oven. The data that was used for testing was comprised of pretreatment parameters temperature and duration, and lignin content in harvest residues before and after pretreatment. Chemical gravimetric analysis showed a certain degree of lignin destabilization at different pretreatment parameters in all three types of harvest residues, while the highest degree of lignin destabilization was found in maize samples particularly at a higher temperature (175 °C). The main conclusions from the optimization of thermal pretreatment of harvest residues from three sources (maize stover, soybean straw and sunflower stalk) based on the ensemble machine learning from this study were that:

all evaluated machine learning methods, including three individual methods (RF, XGB and SVM), as well as their ensembles provided a higher AIDL prediction accuracy in comparison to the conventional LM;
high prediction accuracy of AIDL was achieved only when harvest residues were considered in separate training/test datasets, while their combination resulted in a very low accuracy;
the individual machine learning algorithms based on decision trees (XGB and RF) were superior to the ensemble machine learning approach in terms of accuracy for AIDL prediction from separate harvest residue sources, achieving R² up to 0.980;
pretreatment temperature had the dominant relative variable importance in comparison to its duration on lignin destabilization, leading to the basis for the optimization of pretreatment process.

Despite this study proving the effectiveness of machine learning methods in the procedures of the pretreatment processes for lignin destabilization, future studies will further explore the impact of additional parameters of pretreatment temperature and duration. This will lead to larger input datasets, providing a more robust approach to the optimization process and further insights into the effectiveness of the machine learning approach for this purpose.

Author Contributions

Conceptualization, Đ.K. and D.R.; methodology, Đ.K. and D.R.; software, D.R.; validation, Đ.K., D.R., D.S. and M.J.; formal analysis, Đ.K.; investigation, Đ.K. and D.R.; resources, Đ.K.; data curation, Đ.K. and D.R.; writing—original draft preparation, Đ.K. and D.R.; writing—review and editing, Đ.K., D.R., D.S. and M.J.; visualization, D.R.; supervision, M.J.; project administration, M.J.; funding acquisition, D.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Soltanian, S.; Aghbashlo, M.; Almasi, F.; Hosseinzadeh-Bandbafha, H.; Nizami, A.-S.; Ok, Y.S.; Lam, S.S.; Tabatabaei, M. A Critical Review of the Effects of Pretreatment Methods on the Exergetic Aspects of Lignocellulosic Biofuels. Energy Convers. Manag. 2020, 212, 112792. [Google Scholar] [CrossRef]
Mohammad Rahmani, A.; Gahlot, P.; Moustakas, K.; Kazmi, A.A.; Shekhar Prasad Ojha, C.; Tyagi, V.K. Pretreatment Methods to Enhance Solubilization and Anaerobic Biodegradability of Lignocellulosic Biomass (Wheat Straw): Progress and Challenges. Fuel 2022, 319, 123726. [Google Scholar] [CrossRef]
Roy, R.; Rahman, M.S.; Raynie, D.E. Recent Advances of Greener Pretreatment Technologies of Lignocellulose. Curr. Res. Green Sustain. Chem. 2020, 3, 100035. [Google Scholar] [CrossRef]
Abraham, A.; Mathew, A.K.; Park, H.; Choi, O.; Sindhu, R.; Parameswaran, B.; Pandey, A.; Park, J.H.; Sang, B.-I. Pretreatment Strategies for Enhanced Biogas Production from Lignocellulosic Biomass. Bioresour. Technol. 2020, 301, 122725. [Google Scholar] [CrossRef]
Mirmohamadsadeghi, S.; Karimi, K.; Azarbaijani, R.; Parsa Yeganeh, L.; Angelidaki, I.; Nizami, A.-S.; Bhat, R.; Dashora, K.; Vijay, V.K.; Aghbashlo, M.; et al. Pretreatment of Lignocelluloses for Enhanced Biogas Production: A Review on Influencing Mechanisms and the Importance of Microbial Diversity. Renew. Sustain. Energy Rev. 2021, 135, 110173. [Google Scholar] [CrossRef]
Rahmati, S.; Doherty, W.; Dubal, D.; Atanda, L.; Moghaddam, L.; Sonar, P.; Hessel, V.; Ostrikov, K. (Ken) Pretreatment and Fermentation of Lignocellulosic Biomass: Reaction Mechanisms and Process Engineering. React. Chem. Eng. 2020, 5, 2017–2047. [Google Scholar] [CrossRef]
Rajput, A.A.; Zeshan; Visvanathan, C. Effect of Thermal Pretreatment on Chemical Composition, Physical Structure and Biogas Production Kinetics of Wheat Straw. J. Environ. Manag. 2018, 221, 45–52. [Google Scholar] [CrossRef]
Rodriguez, C.; Alaswad, A.; Mooney, J.; Prescott, T.; Olabi, A.G. Pre-Treatment Techniques Used for Anaerobic Digestion of Algae. Fuel Process. Technol. 2015, 138, 765–779. [Google Scholar] [CrossRef]
Mirmasoumi, S.; Khoshbakhti Saray, R.; Ebrahimi, S. Evaluation of Thermal Pretreatment and Digestion Temperature Rise in a Biogas Fueled Combined Cooling, Heat, and Power System Using Exergo-Economic Analysis. Energy Convers. Manag. 2018, 163, 219–238. [Google Scholar] [CrossRef]
Veluchamy, C.; Kalamdhad, A.S. Enhancement of Hydrolysis of Lignocellulose Waste Pulp and Paper Mill Sludge through Different Heating Processes on Thermal Pretreatment. J. Clean. Prod. 2017, 168, 219–226. [Google Scholar] [CrossRef]
Barua, V.B.; Kalamdhad, A.S. Effect of Various Types of Thermal Pretreatment Techniques on the Hydrolysis, Compositional Analysis and Characterization of Water Hyacinth. Bioresour. Technol. 2017, 227, 147–154. [Google Scholar] [CrossRef]
Kainthola, J.; Shariq, M.; Kalamdhad, A.S.; Goud, V.V. Comparative Study of Different Thermal Pretreatment Techniques for Accelerated Methane Production from Rice Straw. Biomass Convers. Biorefinery 2021, 11, 1145–1154. [Google Scholar] [CrossRef]
Barua, V.B.; Rathore, V.; Kalamdhad, A.S. Anaerobic Co-Digestion of Water Hyacinth and Banana Peels with and without Thermal Pretreatment. Renew. Energy 2019, 134, 103–112. [Google Scholar] [CrossRef]
Gao, W.; Zhou, L.; Liu, S.; Guan, Y.; Gao, H.; Hui, B. Machine Learning Prediction of Lignin Content in Poplar with Raman Spectroscopy. Bioresour. Technol. 2022, 348, 126812. [Google Scholar] [CrossRef] [PubMed]
Kartal, F.; Özveren, U. An Improved Machine Learning Approach to Estimate Hemicellulose, Cellulose, and Lignin in Biomass. Carbohydr. Polym. Technol. Appl. 2021, 2, 100148. [Google Scholar] [CrossRef]
Löfgren, J.; Tarasov, D.; Koitto, T.; Rinke, P.; Balakshin, M.; Todorović, M. Machine Learning Optimization of Lignin Properties in Green Biorefineries. ACS Sustain. Chem. Eng. 2022, 10, 9469–9479. [Google Scholar] [CrossRef]
Kardani, N.; Hedayati Marzbali, M.; Shah, K.; Zhou, A. Machine Learning Prediction of the Conversion of Lignocellulosic Biomass during Hydrothermal Carbonization. Biofuels 2022, 13, 703–715. [Google Scholar] [CrossRef]
Yildirim, O.; Ozkaya, B. Prediction of Biogas Production of Industrial Scale Anaerobic Digestion Plant by Machine Learning Algorithms. Chemosphere 2023, 335, 138976. [Google Scholar] [CrossRef]
Chiu, M.-C.; Wen, C.-Y.; Hsu, H.-W.; Wang, W.-C. Key Wastes Selection and Prediction Improvement for Biogas Production through Hybrid Machine Learning Methods. Sustain. Energy Technol. Assess. 2022, 52, 102223. [Google Scholar] [CrossRef]
Dong, Z.; Bai, X.; Xu, D.; Li, W. Machine Learning Prediction of Pyrolytic Products of Lignocellulosic Biomass Based on Physicochemical Characteristics and Pyrolysis Conditions. Bioresour. Technol. 2023, 367, 128182. [Google Scholar] [CrossRef]
Demir, S.; Şahin, E.K. Liquefaction Prediction with Robust Machine Learning Algorithms (SVM, RF, and XGBoost) Supported by Genetic Algorithm-Based Feature Selection and Parameter Optimization from the Perspective of Data Processing. Environ. Earth Sci. 2022, 81, 459. [Google Scholar] [CrossRef]
Fan, J.; Wang, X.; Wu, L.; Zhou, H.; Zhang, F.; Yu, X.; Lu, X.; Xiang, Y. Comparison of Support Vector Machine and Extreme Gradient Boosting for Predicting Daily Global Solar Radiation Using Temperature and Precipitation in Humid Subtropical Climates: A Case Study in China. Energy Convers. Manag. 2018, 164, 102–111. [Google Scholar] [CrossRef]
Polikar, R. Ensemble Learning. In Ensemble Machine Learning: Methods and Applications; Zhang, C., Ma, Y., Eds.; Springer: New York, NY, USA, 2012; pp. 1–34. ISBN 978-1-4419-9326-7. [Google Scholar]
Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012; ISBN 978-1-4398-3003-1. [Google Scholar]
De Clercq, D.; Wen, Z.; Fei, F.; Caicedo, L.; Yuan, K.; Shang, R. Interpretable Machine Learning for Predicting Biomethane Production in Industrial-Scale Anaerobic Co-Digestion. Sci. Total Environ. 2020, 712, 134574. [Google Scholar] [CrossRef] [PubMed]
Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; Hinkelmann, R. Ensemble Machine Learning Paradigms in Hydrology: A Review. J. Hydrol. 2021, 598, 126266. [Google Scholar] [CrossRef]
American Public Health Association (APHA). Standard Methods for the Examination of Water and Wastewater; American Public Health Association: Washington, DC, USA, 1998. [Google Scholar]
Goering, H.K.; Van Soest, P.J. Forage Fiber Analyses (Apparatus, Reagents, Procedures, and Some Applications); US Agricultural Research Service: Beltsville, MD, USA, 1970.
Gilbertson, J.K.; van Niekerk, A. Value of Dimensionality Reduction for Crop Differentiation with Multi-Temporal Imagery and Machine Learning. Comput. Electron. Agric. 2017, 142, 50–58. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Hatwell, J.; Gaber, M.M.; Azad, R.M.A. CHIRPS: Explaining Random Forest Classification. Artif. Intell. Rev. 2020, 53, 5747–5788. [Google Scholar] [CrossRef]
Ferreira, A.J.; Figueiredo, M.A.T. Boosting Algorithms: A Review of Methods, Theory, and Applications. In Ensemble Machine Learning: Methods and Applications; Zhang, C., Ma, Y., Eds.; Springer: New York, NY, USA, 2012; pp. 35–85. ISBN 978-1-4419-9326-7. [Google Scholar]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting. Available online: https://CRAN.R-project.org/package=xgboost (accessed on 23 October 2023).
Awad, M.; Khanna, R. Support Vector Machines for Classification. In Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers; Awad, M., Khanna, R., Eds.; Apress: Berkeley, CA, USA, 2015; pp. 39–66. ISBN 978-1-4302-5990-9. [Google Scholar]
Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenkel, B.; Benesty, M.; et al. Caret: Classification and Regression Training. Available online: https://CRAN.R-project.org/package=caret (accessed on 30 May 2022).
Deane-Mayer, Z.A.; Knowles, J.E. caretEnsemble: Ensembles of Caret Models. Available online: https://cran.r-project.org/web/packages/caretEnsemble/index.html (accessed on 24 December 2023).
Wong, T.-T. Performance Evaluation of Classification Algorithms by K-Fold and Leave-One-out Cross Validation. Pattern Recognit. 2015, 48, 2839–2846. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The Coefficient of Determination R-Squared Is More Informative than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)?—Arguments against Avoiding RMSE in the Literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Yoo, C.G.; Meng, X.; Pu, Y.; Ragauskas, A.J. The Critical Role of Lignin in Lignocellulosic Biomass Conversion and Recent Pretreatment Strategies: A Comprehensive Review. Bioresour. Technol. 2020, 301, 122784. [Google Scholar] [CrossRef] [PubMed]
Kai, D.; Chow, L.P.; Loh, X.J. Lignin and Its Properties. In Functional Materials from Lignin; Sustainable Chemistry Series; World Scientific (Europe): London, UK, 2017; Volume 3, pp. 1–28. ISBN 978-1-78634-520-2. [Google Scholar]
Buranov, A.U.; Mazza, G. Lignin in Straw of Herbaceous Crops. Ind. Crops Prod. 2008, 28, 237–259. [Google Scholar] [CrossRef]
Tursi, A. A Review on Biomass: Importance, Chemistry, Classification, and Conversion. Biofuel Res. J. 2019, 6, 962–979. [Google Scholar] [CrossRef]
Smuga-Kogut, M.; Kogut, T.; Markiewicz, R.; Słowik, A. Use of Machine Learning Methods for Predicting Amount of Bioethanol Obtained from Lignocellulosic Biomass with the Use of Ionic Liquids for Pretreatment. Energies 2021, 14, 243. [Google Scholar] [CrossRef]
Ma, Z.; Wang, R.; Song, G.; Zhang, K.; Zhao, Z.; Wang, J. Interpretable Ensemble Prediction for Anaerobic Digestion Performance of Hydrothermal Carbonization Wastewater. Sci. Total Environ. 2024, 908, 168279. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Xu, Y.; Nairat, S.; Zhou, J.; He, Z. Prediction of Biogas Production in Anaerobic Digestion of a Full-Scale Wastewater Treatment Plant Using Ensembled Machine Learning Models. Water Environ. Res. 2023, 95, e10893. [Google Scholar] [CrossRef] [PubMed]
Radočaj, D.; Jurišić, M.; Tadić, V. The Effect of Bioclimatic Covariates on Ensemble Machine Learning Prediction of Total Soil Carbon in the Pannonian Biogeoregion. Agronomy 2023, 13, 2516. [Google Scholar] [CrossRef]
Yates, L.A.; Aandahl, Z.; Richards, S.A.; Brook, B.W. Cross Validation for Model Selection: A Review with Examples from Ecology. Ecol. Monogr. 2023, 93, e1557. [Google Scholar] [CrossRef]
ElSahly, O.; Abdelfatah, A. An Incident Detection Model Using Random Forest Classifier. Smart Cities 2023, 6, 1786–1813. [Google Scholar] [CrossRef]

Figure 1. Workflow of the machine learning AIDL prediction based on thermal pretreatment of harvest residues.

Figure 2. The boxplots of AIDL values per harvest residue and thermal pretreatment parameters, including temperature and duration, with black dots representing outliers.

Figure 3. The scatterplots of predicted and observed AIDL values obtained using multiple linear regression and individual machine learning methods.

Figure 4. The scatterplots of predicted and observed AIDL values obtained using an ensemble machine learning approach.

Figure 5. The relative variable importance metrics per harvest residue and individual prediction method.

Table 1. The results of the Kruskal-Wallis one-way analysis of variance test for AIDL and two thermal pretreatment parameters.

Parameter	χ2	p	Significance Level
Temperature	9.285	0.0096	**
Duration	4.420	0.1097

** p < 0.01.

Table 2. Accuracy assessment metrics of evaluated machine learning methods for AIDL prediction based on leave-one-out cross-validation.

Harvest Residues	Method	R²	RMSE	MAE
All	LM	0.066	3.56	2.76
	RF	0.098	3.54	2.91
	XGB	0.101	3.53	2.90
	SVM	0.121	3.73	2.25
	Ensemble (RF + XGB)	0.092	3.44	2.71
	Ensemble (RF + SVM)	0.049	3.53	2.74
	Ensemble (XGB + SVM)	0.100	3.73	2.92
Maize	LM	0.360	4.68	4.26
	RF	0.946	1.33	1.08
	XGB	0.980	0.80	0.67
	SVM	0.328	5.69	4.25
	Ensemble (RF + XGB)	0.967	1.04	0.85
	Ensemble (RF + SVM)	0.910	1.76	1.27
	Ensemble (XGB + SVM)	0.948	1.29	0.91
Soybean	LM	0.006	1.33	1.04
	RF	0.633	0.75	0.60
	XGB	0.756	0.62	0.41
	SVM	0.001	1.28	0.99
	Ensemble (RF + XGB)	0.489	0.78	0.57
	Ensemble (RF + SVM)	0.565	0.88	0.64
	Ensemble (XGB + SVM)	0.671	0.77	0.57
Sunflower	LM	0.537	0.57	0.47
	RF	0.600	0.54	0.40
	XGB	0.768	0.41	0.31
	SVM	0.512	0.67	0.56
	Ensemble (RF + XGB)	0.511	0.56	0.42
	Ensemble (RF + SVM)	0.514	0.60	0.44
	Ensemble (XGB + SVM)	0.532	0.58	0.45

Table 3. The optimal hyperparameters used for prediction with each machine learning method.

Harvest Residues	Individual Machine Learning Methods
Harvest Residues	RF	XGB	SVM
All	mtry = 2	nrounds = 50, lambda = 0.0001, alpha = 0, eta = 0.3	sigma = 0.778, C = 1
Maize	mtry = 2	nrounds = 100, lambda = 0.1, alpha = 0.1, eta = 0.3	sigma = 0.944, C = 1
Soybean	mtry = 2	nrounds = 50, lambda = 0, alpha = 0.0001, eta = 0.3	sigma = 0.136, C = 0.5
Sunflower	mtry = 2	nrounds = 50, lambda = 0.1, alpha = 0.0001, eta = 0.3	sigma = 0.800, C = 1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Influence of Thermal Pretreatment on Lignin Destabilization in Harvest Residues: An Ensemble Machine Learning Approach

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Set-Up

2.2. Machine Learning Prediction and Accuracy Assessment

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics