Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model

Montes-Carmona, María Estela; Burgos-Castro, Ivan Andres; Portillo-Vélez, Rogelio de Jesús; García-Ramírez, Pedro Javier; Marín-Urías, Luis Felipe; Hernández-Pérez, Miguel Ángel

doi:10.3390/pr14091452

Open AccessArticle

Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model

by

María Estela Montes-Carmona

¹,

Ivan Andres Burgos-Castro

^2,*

,

Rogelio de Jesús Portillo-Vélez

^2,3

,

Pedro Javier García-Ramírez

¹

,

Luis Felipe Marín-Urías

^2,3

and

Miguel Ángel Hernández-Pérez

¹

Instituto de Ingeniería, Universidad Veracruzana, S. S. Juan Pablo II, Zona Universitaria, Boca del Río 94294, Mexico

²

Facultad de Ingeniería de la Construcción y el Hábitat, Universidad Veracruzana, Boca del Río 94294, Mexico

³

Facultad de Ingeniería Eléctrica y Electrónica, Universidad Veracruzana, Boca del Río 94294, Mexico

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(9), 1452; https://doi.org/10.3390/pr14091452

Submission received: 29 March 2026 / Revised: 20 April 2026 / Accepted: 28 April 2026 / Published: 30 April 2026

(This article belongs to the Special Issue Artificial Intelligence (AI) and Automation-Driven Innovations in Chemical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Biogas production estimation has been one of the most important and challenging objectives for anaerobic digestion processes due to the complexity of its dynamics and the lack of high-quality open-access datasets. This study presents a hybrid modeling framework that combines a mechanistic model, based on ordinary differential equations (ODEs), with a machine learning model. Rather than relying exclusively on experimental data, the proposed approach leverages physics-informed synthetic data generation, complemented by a lag-based feature engineering to capture inherent temporal dependencies in the process dynamics available in operational data of a bio-digester. Two configurations were evaluated: a baseline model and an enhanced version incorporating lag features and a simplified temperature profile. This specific computational enhancement provides a robust predictive core that successfully avoids the severe predictive degradation observed in purely mechanistic approaches at high spatial discretizations. While the improved surrogate model achieved high predictive performance (

R^{2} = 0.9788

,

R M S E = 131.80

[L/d]), additional analyses reveal that this resilience is driven by temporal memory and remains sensitive to noise and feature composition. Instead of presenting the model as a final independent physical validation, this work is rigorously framed as a proof-of-concept digital twin core, acknowledging the gap that still exists between simulation-based ODE emulation and unstructured real-world reliability.

Keywords:

biogas production; machine learning; surrogate model; digital twin; bio-digester; biogas prediction; methane production

1. Introduction

The biogas generation process is a natural phenomenon that occurs in a variety of anaerobic environments [1]. These include marine and freshwater environments, sediments, wastewater treatment plant sludge, etc. Interest in the process stems primarily from the following reasons. On the one hand, a high degree of organic matter reduction is achieved with a small increase—compared to aerobic processes—in bacterial biomass. On the other hand, biogas production can be used to generate various forms of energy (heat and electricity) or processed as automotive fuel. Nevertheless, biogas has a lower calorific value than natural gas, and in specific applications, such as automotive fuel, treatment is necessary to improve its quality [2].

Energy production from the anaerobic digestion has been used worldwide for over 30 years [3]. Its viability and profitability depend not only on the amount of biogas produced, the available technology, and the efficiency of the wastewater treatment operation, but also on external parameters such as the local cost of energy production and available energy resources [4]. Besides the economic advantages, biogas has also yielded environmental benefits [5]. This aligns with broader sustainable wastewater management strategies that increasingly prioritize environmentally friendly alternatives to mitigate ecological impacts [6]. Anaerobic digestion represents a mature yet still evolving technology for renewable energy production. Predicting methane yield remains challenging due strong nonlinearities and environmental coupling. Anaerobic digestion is a biological process where organic carbon is converted through subsequent oxidation and reduction to its more oxidized state (

C O_{2}

) and its more reduced state (

C H_{4}

). A wide range of microorganisms catalyzes the process in the absence of oxygen. Nitrogen, hydrogen, ammonia, and hydrogen sulfide are also generated in smaller quantities (typically less than 1% of the total gas volume). The mixture of gaseous products is called biogas, and the anaerobic degradation process is often also referred to as the biogas process [7]. Estimating biogas production in an anaerobic digester is a complicated problem because the system is not only chemical nor only biological, but a microbial ecosystem highly related to biochemical reactions, mass transfer and variable physicochemical conditions [8]. Current society requirements demands high efficiency biogas production to be implemented in anaerobic digesters of wastewater treatment plants [9]. Therefore, innovative solutions must be developed in the short term [10].

Recent advances in machine learning, particularly deep learning architectures, have shown promising results in modeling temporal dynamics [11]. Nevertheless, some open problems are still challenging. In [12], the optimization of biogas production for treating domestic wastewater is studied using different machine learning models (XGBoost and PSO), pointing out that limitations on the volume of the training data influenced the performance of the model predictions. J. Schulz et. al. [13] investigated the features of the carbon-to-nitrogen ratio of the substrates that can be used for long-term continuous anaerobic co-digestion. In general, the study of sensors calibration and fouling, modeling and optimization approaches together with the lack of high quality standardized datasets are current trends in the literature [14]. Some solutions using frontier technologies such as digital twins [15] might help to shorten the implementation curve provided AI-based models are accurate and simple enough to be implemented in real-time biogas production processes.

Herein we aim to improve biogas estimation using lag-based training vectors. To avoid the lack of accurate and enough datasets, in this work, we propose a biogas estimation surrogate model based on machine learning. Unlike prior studies, our approach explicitly addresses data scarcity through synthetic data generation, temporal dependencies via lag-based modeling, as well as mechanistic consistency through ODE constraints. The model was trained using a set of data obtained from the solution of a set of differential equations with experimentally validated parameters. Compared to a baseline approach relying solely on immediate operational parameters. Specifically, the incorporation of lag-based temporal memory and thermodynamic perturbations significantly improves short-term predictive accuracy, enabling operational forecasting capabilities suitable for digital twin environments. Our analysis demonstrates a concrete quantitative improvement compared to a baseline model, elevating the prediction accuracy (

R^{2}

) from 0.6875 to 0.9788 (an approximate 29% improvement) and reducing the Root Mean Square Error (RMSE) from 480.02 L/d to 131.80 L/d. It is important to emphasize that this framework is presented as a digital twin-oriented proof-of-concept. At this stage, the model serves as an intermediate step toward real-world deployment rather than a fully validated predictive tool for immediate industrial application. The rest of the work is structured as follows. Section 2 describes the proposed methodology while Section 3 depicts the corresponding results to validate our approach. Section 4 presents a brief discussion and, finally Section 5 closes the paper with conclusions and future work.

2. Methodology

2.1. Experimental Design

To systematically evaluate the predictive capacity of the proposed surrogate model for the swine farm bio-digester, in this case methane is the produced biogas, the methodological framework was structured into two progressive experiments:

Base model: 30,000 registers were generated using different stoichiometric and kinetic parameters based on the real bio-digester by means of the ODEs (1), (2), (3), (4) and features in terms of inflow and the organic substrate concentration $S_{1}$ to produce biogas calculated through the Equation (5). An Extreme Gradient (XGBoost) model was trained and tested using a sequential 80/20 split, respectively.
Lag-based improved model: 10,000 registers were generated incorporating a dynamic seasonal temperature factor to perturb the microbial kinetic rates and simulate real-world environmental exposure. To capture this physical complexity, integrating thermodynamic interaction variables and historical lags of the biogas production. The inclusion of this auto-regressive feature is valid given that this information is available in real world applications, it is calculated through Equation (5). An XGBoost model was trained and tested using a sequential 80/20 split as well as an adapted time series cross-validation to evaluate the system’s inertial memory and prevent data leakage.

2.2. Mathematical Modeling and Mass Balance

The physical and biochemical dynamics of the anaerobic digestion process (acidogenic and methanogenic phases) were simulated using a system of ODEs (1), (2), (3), (4).

2.2.1. Spatial Discretization Justification

Given that the real rural bio-digester operates with a plug-flow hydrodynamic regime, the total reactor volume (

V_{t o t a l}

) was spatially discretized into a series of Continuous Stirred-Tank Reactors (CSTRs). For this study, the system was divided into

N = 3

interconnected sub-reactors, following the validated mathematical framework established by Cardona [16].

This configuration was selected based on a targeted sensitivity evaluation comparing the proposed lag-based model against Cardona’s purely mechanistic framework. As shown in Table 1, while the purely mechanistic approach suffers from severe numerical instability when increasing the spatial resolution (with

R^{2}

dropping to 0.2933 for

N = 5

), the proposed lag-based model maintains high predictive fidelity (

R^{2}

= 0.9428). Therefore,

N = 3

represents the optimal balance between physical consistency and numerical stability for the hybrid framework.

2.2.2. Mass Balance Equations

The general mass balance for any given state variable within the i-th sub-reactor is consistent with the fundamental principle of accumulation (Accumulation = Inflow − Outflow + Net Reaction) which is described by the following system of ODEs:

\frac{d X_{1, i}}{d t} = D_{s u b} (X_{1, i - 1} - X_{1, i}) + μ_{1, i} X_{1, i}

(1)

\frac{d X_{2, i}}{d t} = D_{s u b} (X_{2, i - 1} - X_{2, i}) + μ_{2, i} X_{2, i}

(2)

\frac{d S_{1, i}}{d t} = D_{s u b} (S_{1, i - 1} - S_{1, i}) - K_{1} μ_{1, i} X_{1, i}

(3)

\frac{d S_{2, i}}{d t} = D_{s u b} (S_{2, i - 1} - S_{2, i}) + K_{2} μ_{1, i} X_{1, i} - K_{3} μ_{2, i} X_{2, i}

(4)

where

X_{1, i}

and

X_{2, i}

for

i = 1, 2, \dots, N

denote the acidogenic and methanogenic biomass concentrations, respectively, for the i-th biodigester.

S_{1}

represents the organic substrate concentration measured as Chemical Oxygen Demand (COD), and

S_{2}

defines the Volatile Fatty Acid (VFA) concentration.

D_{s u b}

is the local dilution rate for each sub-reactor, calculated as

D \times N

(The global dilution D is defined as the flow rate

Q_{i n}

divided by the total volume

V_{t o t a l}

). This formulation implies a reduction in the effective residence time within each sub-reactor, consistent with the representation of reactors in series.

For the first sub-reactor (i = 1), the feed concentrations correspond to the raw inputs (

S_{1, 0}

=

S_{1 i n}

and

S_{2, 0}

=

S_{2 i n}

), assuming a biomass-free feed stream (

X_{1, 0}

and

X_{2, 0} = 0

).

Methane production is calculated using Equation (5)

C H_{4} = \sum_{i = 1}^{N} K_{4} μ_{2, i} X_{2, i} V_{s u b}

(5)

2.3. Microbial Kinetics and Thermodynamic Perturbation

Microbial growth kinetics were modeled using the Monod equation for the acidogenic phase (6) and the Haldane inhibition model for methanogenesis (7). To improve the performance of the surrogate model, two scenarios were formulated. The base model analysis considered standard kinetic rates without environmental perturbations as expressed in Equations (6) and (7).

μ_{1, i} = μ_{m a x 1} (\frac{S_{1, i}}{K_{s 1} + S_{1, i}})

(6)

μ_{2, i} = μ_{m a x 2} (\frac{S_{2, i}}{K_{s 2} + S_{2, i} + \frac{S_{2, i}^{2}}{K_{I}}})

(7)

where

μ_{m a x}

represents the maximum specific growth rates under standard conditions,

K_{s i}

for

i = {1, 2}

denotes the half-saturation constants, and

K_{I}

is the inhibition constant due to Volatile Fatty Acid (VFA) accumulation.

The lag-based improved model incorporated a seasonal temperature factor to simulate partially real-world environmental exposure and its direct impact on bacterial growth rates acknowledging that this consideration does not totally reflect real-world conditions. It is important to note that, although the model is simplified, this representation captures the dominant seasonal dynamics while providing a controlled and physically consistent perturbation framework for training and evaluating the surrogate model. The modified kinetic equations are shown in Equations (8) and (9):

μ_{1, i} = (μ_{m a x 1} \cdot T_{f a c t o r}) (\frac{S_{1, i}}{K_{s 1} + S_{1, i}})

(8)

μ_{2, i} = (μ_{m a x 2} \cdot T_{f a c t o r}) (\frac{S_{2, i}}{K_{s 2} + S_{2, i} + \frac{S_{2, i}^{2}}{K_{I}}})

(9)

To model the annual environmental exposure of rural bio-digesters, the

T_{f a c t o r}

was defined by a cyclical model over a 365-day period as shown in Equation (10):

T_{f a c t o r} (t) = 1.0 + 0.15 sin (\frac{2 π t}{365})

(10)

Finally, the total daily methane production (

C H_{4}

), representing the system’s target energy yield, was calculated using Equation (5) where

V_{s u b}

is the volume of each individual sub-reactor (

V_{t o t a l} / N

), and

K_{4}

is the stoichiometric yield coefficient for methane.

While the majority of the kinetic and stoichiometric parameters were adopted directly from the validated baseline framework established by Cardona [16], three specific parameters (

K_{s 1}

,

K_{I}

, and

K_{4}

) required empirical calibration to reflect the specific biological stability and operational reality of the target swine farm bio-digester. The calibration process was performed iteratively to ensure the ODE system maintained a stable oscillatory steady-state without succumbing to premature mathematical collapse (acid crash). Specifically, the organic matter saturation constant (

K_{s 1}

) was reduced from 20,000 mg/L to 6000 mg/L to balance the aggressive bacterial consumption rate characteristic of the swine substrate, preventing rapid and unrealistic system acidification. Simultaneously, the Haldane inhibition constant (

K_{I}

) was increased from 50 mmol/L to 150 mmol/L to reflect the enhanced resilience of the adapted methanogenic consortia against Volatile Fatty Acid (VFA) toxicity. Finally, the methane yield coefficient (

K_{4}

) was scaled from 0.11 to 0.18 mmol/g to strictly align the simulated energy output with the physically observed average methane production of 4.6 m³/d. These targeted assumptions and stability considerations ensure that the synthetic data generation accurately mirrors the macroscopic thermodynamic boundaries of the real physical system, with all finalized parameters summarized in Table 2.

2.4. Synthetic Data Generation

To train the XGBoost model, extended operational periods were simulated by numerically solving the ODEs system using the scipy.integrate.odeint library in Python. To reflect the real operation, information from the records of the real bio-digester were considered using bounded normal distributions (empirical operation ranges), the numerical integration required defining an initial condition vector (

t = 0

) for each sub-reactor, representing the starting biomass and substrate concentrations inside the digester. Across all sub-reactors, the initial state was set to:

X_{1} = 1.8

g/L,

X_{2} = 0.8

g/L,

S_{1} = 1500.0

mg/L, and

S_{2} = 10.0

mmol/L.

Input flow rate ( $Q_{i n}$ ): Modeled as $N (4.5, 1.0)$ and physically constrained to the range $[1.0, 8.0]$ m³/d.
Organic substrate loading ( $S_{1 i n}$ ): Modeled as $N (2500, 500)$ and physically constrained to the range $[1000, 4000]$ mg/L.

These specific constraints were derived directly from the empirical operational ranges of the physical swine bio-digester. They were strictly enforced to prevent computationally induced anomalous states (e.g., negative flows or toxic overloads) and to ensure that the generated synthetic scenarios remained entirely within the hydrodynamic design and biological limits of the real rural reactor. However, it must be explicitly acknowledged that while these bounded Gaussian distributions effectively simulate average operational baselines, they are a simplified mathematical approximation. They do not fully capture the complex dynamics of real-world feeding schedules, such as statistical skewness, inherent temporal autocorrelation, or abrupt operational disturbances (e.g., shock loading or sudden system upsets). Consequently, this synthetic generation represents an idealized, baseline operational environment. Through this computational approach, two distinct datasets were generated and exported for the machine learning pipeline:

Base model: A dataset of 30,000 records was generated. The exported feature space consisted of Simulation Time (Time), Input Flow Rate (Q_in_m3_d), Organic Substrate Loading (S1_in_mg_L), and the target variable Methane Biogas Production (CH4_Prod_L_d).
Lag-based improved model: A reduced dataset of 10,000 records was generated. This reduction in sample size was a deliberate operational trade-off to accommodate the significantly higher computational complexity of numerically integrating the dynamic thermodynamic perturbations, as well as processing the expanded 40-dimensional feature space. It is critical to state that the substantial predictive improvement is strictly attributed to the engineered lag features, not the dataset size. As demonstrated in the subsequent sensitivity analysis (Figure 1), reducing the dataset size inherently limits predictive capacity; therefore, the 29% accuracy enhancement achieved by this model—despite utilizing only one-third of the training data—definitively confirms the value of incorporating temporal memory. The exported feature space included Simulation Time (Time), Input Flow Rate (Q_in_m3_d), Organic Substrate Loading (S1_in_mg_L), the Seasonal Temperature Factor (Temp_Factor), and the target variable Methane Biogas Production (CH4_Prod_L_d).

2.5. Feature Engineering and Machine Learning Architecture

To ensure model stability and eliminate the influence of the initial unsteady-state mathematical start-up, a transient period was discarded from the synthetic datasets prior to the feature engineering phase (the first 50 days for the base model and the first 100 days for the lag-based improved model). To capture the system’s temporal dynamics and provide the predictive models with historical memory, specific systematic lags vectors were engineered consistent with the available information in operational bio-digester. The rationale behind selecting these specific lag windows (spanning up to 30 days) was mathematically motivated to encompass the average hydraulic retention time (HRT) of the system. This design guarantees that the feature space captures both the immediate short-term daily fluctuations and the extended biological inertia of the microbial population. Lag intervals of

τ = {1, 5, 10, 15, 20, 25, 30}

days and

τ = {1, 2, 3, 5, 10, 15, 20}

days were incorporated into the feature space for the base model and lag-based improved model, respectively.

For the base model scenario, the input feature vector

x_{t}

constructed for any given day t resulted in a 16-dimensional array, organized as shown in Equation (11):

x_{t} = {[S_{1 i n} (t), Q_{i n} (t), S_{1 i n} (t - τ_{1}), \dots, S_{1 i n} (t - τ_{7}), Q_{i n} (t - τ_{1}), \dots, Q_{i n} (t - τ_{7})]}^{T}

(11)

where the lags set is defined as

τ \in {1, 5, 10, 15, 20, 25, 30}

days.

However, for the lag-based improved model, the

T_{f a c t o r}

and its physical interactions with the mass flows were considered. Additionally, the records of historical biogas production (

C H_{4}

) were considered. The inclusion of lagged methane production variables is consistent with real-world operational scenarios, where historical biogas measurements are continuously monitored and readily available. In this context, these variables provide valuable temporal information that enhances short-term predictive performance. Therefore, this information allows the model to exploit inherent temporal dependencies of the process, enhancing predictive capability within a realistic digital twin framework.

The thermodynamic interaction variables were defined as

I_{S 1} (t) = T_{f a c t o r} (t) \cdot S_{1 i n} (t)

and

I_{Q} (t) = T_{f a c t o r} (t) \cdot Q_{i n} (t)

. Based on the refined lags set

τ \in {1, 2, 3, 5, 10, 15, 20}

days, the augmented input feature vector

x_{t}

expanded into a 40-dimensional array, organized as shown in Equation (12):

\begin{matrix} x_{t} = & [S_{1 i n} (t), Q_{i n} (t), T_{f a c t o r} (t), I_{S 1} (t), I_{Q} (t), \\ S_{1 i n} (t - τ_{1}), \dots, S_{1 i n} (t - τ_{7}), \\ Q_{i n} (t - τ_{1}), \dots, Q_{i n} (t - τ_{7}), \\ T_{f a c t o r} (t - τ_{1}), \dots, T_{f a c t o r} (t - τ_{7}), \\ I_{S 1} (t - τ_{1}), \dots, I_{S 1} (t - τ_{7}), \\ C H_{4} (t - τ_{1}), \dots, C H_{4} (t - τ_{7})]^{T} \end{matrix}

(12)

It is important to emphasize that the primary contribution of this methodology relies on how lag-based temporal memory enhances biogas prediction, rather than an exhaustive benchmarking of machine learning algorithms. To demonstrate this, the XGBoost algorithm was selected as the core regression engine because it represents the optimal trade-off between handling non-linearities and providing transparent model interpretability. While traditional time series models like ARIMA or SARIMA are computationally lightweight, they rely on linear assumptions and struggle with the complex thermodynamic perturbations and microbial inhibitions (e.g., Haldane kinetics) inherent to anaerobic digestion. Conversely, Deep Learning architectures such as Long Short-Term Memory (LSTM) networks excel at sequential dynamics but function largely as “black boxes,” obscuring the specific physical relevance of the input variables. Furthermore, while other tree-based ensembles (such as Random Forest, LightGBM, or standard Decision Trees) could theoretically model the data, XGBoost was selected over them for its superior handling of sparse lagged matrices and its strict regularization mechanisms. Because the core objective was to evaluate the impact of historical process memory, interpretability was as critical as predictive accuracy. XGBoost successfully captures these non-linear temporal dependencies while providing explicit feature importance metrics. This transparent architecture allowed for the quantitative validation of the engineered lag features, explicitly revealing the predictive weight of the system’s biological memory without the opacity of deep learning approaches.

2.6. Model Training, Evaluation Metrics, and Data Leakage Prevention

To explicitly prevent data leakage and rigorously evaluate predictive performance, a multi-layered prevention strategy was implemented. First, physical data leakage was strictly avoided during the feature engineering phase. Variables corresponding to system effluents or physical “out” parameters were explicitly excluded from the predictive feature space, as these measurements are practically unobtainable prior to actual biogas generation in a real-world operational setting. Including them would introduce an artificial, non-causal advantage that would compromise the model’s reliability as an operational tool.

Second, to prevent temporal data leakage, a primary sequential chronological split was implemented for both experiments. The first 80% of the temporal registers were used for model training, while the remaining 20% were strictly reserved as an unseen testing set. Unlike randomized splitting, which can introduce severe data leakage in time series modeling by training on future events to predict the past, this strict chronological division preserves the natural temporal evolution of the biochemical process.

Crucially, to prevent implicit information leakage during hyperparameter tuning, the 80% development set was internally subdivided. A chronological validation subset was extracted from this internal data specifically to monitor validation loss and trigger early stopping, ensuring the 20% test set was never exposed to the model during the learning or selection phase.

Furthermore, to ensure robustness across different temporal subsets and definitively rule out temporal overfitting, supplementary validation schemes were integrated into the methodology. These included a time series-aware 5-fold cross-validation strategy—sequentially expanding the training window to respect temporal dependencies—and a more restrictive 60/40 chronological split.

The hyperparameter tuning workflow for the XGBoost regressor relied on an iterative evaluation of validation loss, systematically adjusting tree complexity and learning speeds, and explicit regularization parameters (such as subsampling) to mitigate overfitting risks, thereby balancing learning capacity and generalization:

Base model configuration: The model used 1000 boosting rounds (estimators), a learning rate of 0.05, and a maximum tree depth of 5. Subsampling and column sampling by tree were both set to 0.80. Early stopping was triggered if the validation loss did not improve for 50 consecutive rounds.
Lag-based improved model configuration: To capture the complex non-linearities of the expanded 40-dimensional feature space without overfitting, the maximum tree depth was significantly increased to 12. To balance this high capacity and ensure proper generalization, the learning rate was reduced to a conservative 0.01, paired with an extended allowance of 5000 maximum estimators. Furthermore, explicit regularization was enforced by restricting subsampling and column sampling to 0.70, injecting stochasticity into the tree-building process. Finally, the early stopping criteria was configured with a patience of 100 rounds. This stopping mechanism was strictly driven by the isolated chronological validation subset, effectively halting training before memorization could occur (a regularization success that is empirically demonstrated in the estimator sensitivity analysis presented in Section 3.3.3).

The performance of the model was assessed using the coefficient of determination (

R^{2}

), the Root Mean Square Error (RMSE), and the Mean Absolute Error (MAE), which are defined in Equations (13), (14), and (15), respectively:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(13)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(14)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(15)

where n represents the total number of observations in the testing set,

y_{i}

is the ground-truth biogas production generated by the phenomenological model,

{\hat{y}}_{i}

is the predicted biogas production by the XGBoost model, and

\bar{y}

is the mean of the actual observed values.

3. Results

3.1. Synthetic Data Generation

Dataset generation described in Section 2 was consistent with kinetic and stoichiometric parameters of the real swine bio-digester based on the previous mathematical approximation developed by Cardona [16]. Table 2 summarizes the parameters utilized for the simulation, highlighting the specific adjustments made to the baseline parameters to prevent computational instability and accurately reflect real-world biogas production which is often necessary according to the state of the art [14,17].

3.2. Performance of the Base Model

The first XGBoost model was trained using the 16 features described in Equation (11) which contains current and lagged values of

Q_{i n}

and

S_{1 i n}

. After training and testing procedures, the model achieved a Coefficient of Determination (

R^{2}

) of 0.6875, a Root Mean Square Error (RMSE) of 480.02 L/d, and a Mean Absolute Error (MAE) of 381.19 L/d.

The developed surrogate model approximates the real synthetic biogas production, oscillating stably between 4000 and 10,000 L/d. While this behavior captures the general macroscopic trend of the real physical system—which reported averages of 4.6 m³/d with peaks from 6 to 8 m³/d as shown in Figure 2 and Figure 3—it does so with only moderate explanatory power. Specifically, achieving an

R^{2}

of 0.6875 indicates that approximately 30% of the system’s variance remains unexplained when relying solely on instantaneous operational parameters. This empirical limitation explicitly justified the necessity of developing the subsequent lag-based architecture to capture the missing temporal dynamics.

3.3. Performance of the Lag-Based Improved Model

To improve the performance of the prediction, the lag-based model considered the biogas registers of the target variable

C H_{4}

as detailed in Section 2 included within the 40 features. After training and testing, the model showed a significant improvement in predictive accuracy, achieving an

R^{2}

score of 0.9788, an RMSE of 131.80 L/d, and an MAE of 85.48 L/d. The temporal tracking performance of the model over a 100-day window is shown in Figure 4. Furthermore, a deeper performance analysis of the residual behavior is illustrated in the parity plot (Figure 5). The scatter distribution demonstrates a tight, symmetrical clustering along the ideal fit line (

y = x

). Crucially, the residuals display a uniform dispersion across the entire operational range (from baseline production up to peaks of 10,000 L/d). This behavior confirms that the model maintains consistent predictive accuracy without exhibiting proportional bias, effectively predicting both standard operations and extreme biogas generation spikes with the same relative precision.

The model’s ability to maintain high fidelity tracking of methane production under dynamic seasonal perturbations demonstrates that the lag components effectively act as a “microbial memory” making it highly suitable for biogas prediction in rural anaerobic digestion processes. A comprehensive comparison of the predictive performance metrics for both models (base model and lag-based model) is summarized in Table 3.

3.3.1. Feature Importance and Microbial Memory

To understand the contribution of the every predictor variable, a feature importance analysis was conducted as shown in Figure 6 (Only top 15 feature were plotted for clarity). The influence of each variable was evaluated using the Normalized Gain metric, which quantifies the relative contribution of each feature to the reduction of the model’s prediction error across all decision trees in the XGBoost.

The results revealed the relative contribution and the top 6 features values were expressed in the vector

G

for clearly visualization, directly mapped to the its associated feature in the vector

x_{t}

. Equation (16) synthesized the information.

G = [\begin{matrix} 44.36 \\ 11.68 \\ 6.90 \\ 6.15 \\ 4.04 \\ 4.04 \end{matrix}] % corresponding to x_{t} = [\begin{matrix} C H_{4} (t - 1) \\ C H_{4} (t - 2) \\ S_{1 i n} (t) \\ I_{S 1} (t) \\ S_{1 i n} (t - 1) \\ I_{Q} (t) \end{matrix}]

(16)

3.3.2. Five-Fold Cross-Validation

To rigorously validate the lag-based improved model, a 5-fold cross-validation adapted for time series [18] was evaluated confirming the high predictive accuracy with an average

R^{2}

= 0.9740 and an RMSE = 140.03 L/d, consistent with the standard 80/20 split validation (

R^{2}

= 0.9788 and RMSE = 131.80 L/d). Table 4 details the predictive performance across each fold.

3.3.3. Robustness Checks and Sensitivity Analysis

To address potential concerns regarding model overfitting—particularly given the high complexity of the lag-based improved model (5000 estimators, depth = 12)—and to ensure generalization beyond the single 80/20 chronological split, additional robustness checks were conducted.

First, the model’s predictive stability was evaluated under a more aggressive 60/40 temporal split (60% for training, 40% for testing). As shown in Table 5, predicting further into the future with significantly less training data resulted in a negligible variation in performance. Specifically, the predictive accuracy remained virtually unchanged (

R^{2}

shifted from 0.9788 to 0.9780), while the error metrics demonstrated remarkable stability. Notably, the RMSE even showed a slight improvement (from 131.80 to 130.83 L/d) and the MAE maintained a highly consistent level (from 85.48 to 88.51 L/d). This quantitative resilience explicitly demonstrates that the model successfully captures the underlying physical process dynamics rather than merely over-fitting or memorizing the training sequence.

Second, a sensitivity analysis of the XGBoost estimators was performed to evaluate the learning curve and validate the early stopping mechanism. Table 6 illustrates how the model’s performance on the unseen test set varies as the maximum number of estimators is restricted. The results confirm that the regularization parameters (learning rate and subsampling) successfully prevent the model from overfitting, as the performance plateaus rather than degrading when approaching higher estimator counts.

4. Discussion

The presented results demonstrate the efficacy of a machine learning-based surrogate model for predicting biogas production in anaerobic digesters. In the first analysis, the predictive performance of the base model in terms of

R^{2}

was limited even by increasing the number of registers to train the model (100,000 registers,

R^{2}

= 0.6949) as expressed in the sensitivity analysis in Figure 1, which confirms the higher performance of the lag-based model (with only 10,000 registers achieved an

R^{2}

= 0.9788), in addition, when implemented with 100,000 samples the improved model produced an

R^{2}

= 0.9939.

To evaluate the model’s reliability under unpredictable real-world weather, bounded stochastic perturbations were added to the temperature profile (representing daily variations of up to 20%). While this approach successfully evaluates the model’s resilience against daily micro-climatic fluctuations, it is acknowledged as a limitation that these purely random perturbations lack complex temporal autocorrelation and do not represent extreme, isolated climatic events (e.g., severe storms). Despite these limitations, the lag-based model demonstrated high resilience. As shown in Table 7, addressing the need for a comprehensive evaluation beyond the

R^{2}

score of 0.9599, the predictive robustness was confirmed through strict absolute error metrics, maintaining an RMSE of 180.90 L/d under the most severe noise condition. This minimal drop in performance demonstrates that the model does not merely fit the dominant seasonal trends, but successfully maintains bounded accuracy across noisy operating conditions, reflecting the natural thermal inertia of real biogas plants.

This contribution is highly valuable as it addresses a current challenge in the application of artificial intelligence to anaerobic digestion: the lack of open-access training databases, as recently highlighted in the state of the art [14]. By generating and utilizing the two synthetic datasets described in this study, our methodological approach effectively addressed this limitation.

Furthermore, this methodology has the potential to optimize biogas production by allowing the computational tuning of operational parameters prior to physical implementation, which translates into significant cost reductions. Such predictive and data-driven optimization frameworks are becoming increasingly essential across various environmental engineering applications, similar to the use of response surface methodologies for scaling up complex remediation systems [19]. Another key contribution is the model’s applicability to physically implemented bio-digesters. By integrating real-world sensor data to create a surrogate model (Digital Twin), operators can evaluate the system’s response to dynamic variations such as fluctuating organic load inputs or substrate concentrations without risking the biological stability of the physical system. However, a fundamental limitation regarding the synthetic data generation must be explicitly acknowledged. Because the machine learning architecture was trained on datasets generated by a mechanistic framework, it inherently reproduces and partially inherits the structural assumptions and theoretical biases embedded in the underlying ODE. Consequently, the current evaluation predominantly reflects ODE-consistent behavior rather than an independent physical validation against highly unstructured real-world variability. Practical phenomena such as real sensor noise, instrument drift, and unmodeled operational disturbances are not fully captured by the synthetic datasets. Rather, this work is rigorously framed as an advanced ODE emulation and surrogate modeling strategy. The true value of this hybrid approach lies in its ability to overcome the numerical diffusion and instability limits of classical models, providing a mathematically robust, lag-aware core that serves as a highly efficient intermediate step toward practical digital twin implementations.

Finally, while the initial base model performed acceptably, the lag-based architecture demonstrated superior reliability and predictive accuracy by incorporating historical biogas production data, a variable that is typically monitored and readily available in field operations. This architectural choice represents a concrete modeling advancement over purely mechanistic approaches. As demonstrated during the spatial discretization analysis (Table 1), relying solely on the phenomenological ODEs established by Cardona [16] leads to a drastic collapse in predictive correlation in terms of the

R^{2}

when spatial nodes are increased to capture plug-flow dynamics (dropping to

R^{2}

= 0.2933 for

N = 5

), culminating in a total simulated system failure (acid crash) at

N = 7

based on our kinetic, stoichiometric, and operational parameters (Table 2). In contrast, the proposed lag-based model overcomes these numerical diffusion limitations, demonstrating remarkable resilience by recovering the correlation to

R^{2}

= 0.9428 at

N = 5

and stabilizing at a highly reliable 0.9788 for the optimal

N = 3

configuration.

This superiority was further confirmed by the feature importance analysis (Figure 6), where the immediately preceding biogas production day is alone responsible for 44.36% of the model’s overall predictive accuracy, with the remaining percentage distributed among the other 39 predictors. This result highlights the relevance of recent historical measurements in capturing the temporal dynamics of anaerobic digestion processes, particularly for short-term forecasting scenarios in an operational environment. Finally, the additional validation scheme through time series cross-validation (5-fold) confirmed this high predictive stability, maintaining the

R^{2} \approx 0.9788

(Table 4) presented in the primary temporal 80/20 split.

Limitations of the Study

While the proposed lag-based surrogate model demonstrates significant predictive enhancements and effectively bypasses the numerical instability of purely mechanistic approaches, several limitations within the current framework must be explicitly acknowledged:

Synthetic Data Bias: Because the machine learning architecture was trained on datasets generated by a mechanistic framework, it inherently reproduces and partially inherits the structural assumptions and theoretical biases embedded in the underlying ordinary differential equations.
Lack of Real-World Validation: The current evaluation predominantly reflects ODE-consistent behavior. Practical phenomena such as real sensor noise, instrument drift, and unmodeled operational disturbances are not fully captured. Independent physical validation using real-world SCADA datasets remains a necessary next step.
Simplified Temperature Modeling: Environmental exposure was simulated using a bounded sinusoidal baseline with stochastic perturbations. While this evaluates basic robustness, it lacks the complex temporal autocorrelation and extreme climatic events characteristic of real weather patterns.
Limited Biological Complexity: The foundational phenomenological model relies on simplified macroscopic kinetics (Monod and Haldane). It does not explicitly account for the full spectrum of complex biological dynamics, such as deep syntrophic dependencies, pH micro-environments, or dynamic shifts in microbial community structures.
Absence of Multi-Model Benchmarking: The core objective of this study was to isolate and quantify the predictive value of lag-based temporal memory. Consequently, an exhaustive algorithmic benchmark against other architectures (e.g., Random Forest, LSTM, SARIMA) was intentionally excluded to maintain scope, remaining an open avenue for future computational research.

5. Conclusions

On the one hand, biogas estimation is a challenging problem due to the complex models involved in the dynamics of the biochemical process. On the other hand, the lack of, quality and quantity enough, data availability for biogas generation processes makes it difficult to train and validate machine learning models to be useful in most applications. In the first stage of our proposal, the methodology renders an option to consider differential equation models to generate synthetic data considering the kinetic, stoichiometric and operational parameters of a state of the art real-world implemented bio-digester to address the lack of high quality open-access datasets. Moreover, in the second stage, two machine learning-based models that reflect the operational behavior of the physical system (according to the biogas production) were obtained. While the base model performed acceptably, a 29% performance improvement was achieved by properly including a historical memory by designing specific lag vectors with the available information from the bio-digester operational data representing the main contribution of this paper. Sensitivity analysis and uncertainty analysis under data quantity and noise in temperature profile was performed, respectively, validating the robustness of our findings. Both of the machine learning models were developed in terms of operational parameters such as the organic substrate loading (

S_{1 i n}

) and the input flow rate (

Q_{i n}

). This approach facilitates the implementation of a digital twin, allowing operators to troubleshoot the system virtually before applying changes to the physical biodigester. To bridge the gap between simulation-based accuracy and practical reliability, our primary direction for future work includes a concrete validation roadmap. This pathway encompasses: the integration of the surrogate model with real-time SCADA data, its calibration using real-world operational datasets—leveraging transfer learning techniques—and the development of hybrid physics–machine learning correction layers designed to independently filter and adapt to unstructured environmental noise. Additional future efforts will incorporate advanced stochastic processes (e.g., ARIMA-based perturbations) to model operational mass flows and temperature profiles, moving beyond simple Gaussian or sinusoidal assumptions to accurately reflect system upsets, skewness, and temporal autocorrelation. We will also expand the approach to other biogas production processes in wastewater treatment plants to validate its implementation feasibility, including the modeling of more complex microbiological dynamics.

Author Contributions

Conceptualization, M.E.M.-C. and R.d.J.P.-V.; methodology, I.A.B.-C. and R.d.J.P.-V.; software, I.A.B.-C.; validation, I.A.B.-C., M.E.M.-C. and L.F.M.-U.; investigation, R.d.J.P.-V., M.Á.H.-P. and M.E.M.-C.; resources, P.J.G.-R. and M.E.M.-C.; data curation, M.E.M.-C., M.Á.H.-P. and I.A.B.-C.; writing—original draft preparation, I.A.B.-C., M.E.M.-C., R.d.J.P.-V., L.F.M.-U., M.Á.H.-P. and P.J.G.-R.; writing—review and editing, P.J.G.-R., R.d.J.P.-V., M.Á.H.-P. and L.F.M.-U.; visualization, I.A.B.-C. and L.F.M.-U.; supervision, P.J.G.-R., M.E.M.-C. and R.d.J.P.-V.; project administration, M.E.M.-C., M.Á.H.-P. and P.J.G.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data generated in this study is available at https://github.com/mecatronico-consultor/biogas-prediction. Accessed on 1 January 2026.

Acknowledgments

The second author acknowledges support from SECIHTI-Mexico through scholarship number 4030171.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tabatabaei, M.; Ghanavati, H. Biogas; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar]
Ashokkumar, V.; Kumar, G.; Lakshmanan, H.; Chandramughi, V.; Flora, G.; Kothari, R.; Piechota, G. A critical review of biogas production and upgrading from organic wastes: Recent advances, challenges and opportunities. Biomass Bioenergy 2025, 194, 107566. [Google Scholar] [CrossRef]
Nayeri, D.; Mohammadi, P.; Bashardoust, P.; Eshtiaghi, N. A comprehensive review on the recent development of anaerobic sludge digestions: Performance, mechanism, operational factors, and future challenges. Results Eng. 2024, 22, 102292. [Google Scholar] [CrossRef]
Wang, Z.; Liu, Y.; Zhang, A.; Liu, Z.; Gai, H. A review of process development, mechanistic insights, and enhancement technologies for anaerobic digestion in industrial wastewater treatment. J. Environ. Chem. Eng. 2025, 13, 118217. [Google Scholar] [CrossRef]
Simeonov, I.; Chorukova, E.; Kabaivanova, L. Two-stage anaerobic digestion for green energy production: A review. Processes 2025, 13, 294. [Google Scholar] [CrossRef]
Benalia, A.; Baatache, O.; Derbal, K.; Khalfaoui, A.; Atime, L.; Pizzi, A.; Trancone, G.; Panico, A. The Effect of a Cactus-Based Natural Coagulant on the Physical–Chemical and Bacteriological Quality of Drinking Water: Batch and Continuous Mode Studies. Water 2026, 18, 138. [Google Scholar] [CrossRef]
Gavala, H.N.; Angelidaki, I.; Ahring, B.K. Kinetics and modeling of anaerobic digestion process. In Biomethanation I; Springer: Berlin/Heidelberg, Germany, 2003; pp. 57–93. [Google Scholar]
Farid, M.U.; Olbert, I.A.; Bück, A.; Ghafoor, A.; Wu, G. CFD modelling and simulation of anaerobic digestion reactors for energy generation from organic wastes: A comprehensive review. Heliyon 2025, 11, e41911. [Google Scholar] [CrossRef] [PubMed]
Lucas, D.; Oliveira, P.; Bessa, A.; Marcondes, F.S.; Rodrigues, M. Towards Efficient Biogas Production: Deep Learning-Based Methane Forecasting in Anaerobic Digesters of Wastewater Treatment Plants. In Highlights in Practical Applications of Agents, Multi-Agent Systems and Computational Social Science. The PAAMS Collection, Proceedings of the International Conference on Practical Applications of Agents and Multi-Agent Systemsm, Lille, France, 25–27 June 2025; Springer: Cham, Switzerland, 2025; pp. 154–165. [Google Scholar]
Shamshad, J.; Rehman, R.U. Innovative approaches to sustainable wastewater treatment: A comprehensive exploration of conventional and emerging technologies. Environ. Sci. Adv. 2025, 4, 189–222. [Google Scholar] [CrossRef]
Ling, J.Y.X.; Chan, Y.J.; Chen, J.W.; Chong, D.J.S.; Tan, A.L.L.; Arumugasamy, S.K.; Lau, P.L. Machine learning methods for the modelling and optimisation of biogas production from anaerobic digestion: A review. Environ. Sci. Pollut. Res. 2024, 31, 19085–19104. [Google Scholar] [CrossRef] [PubMed]
Kumar, S.; Kumar, S.; Kumar, D.R.; Sharma, D.; Wipulanusat, W. Machine learning-based optimization of biogas and methane yields in UASB reactors for treating domestic wastewater. Biodegradation 2025, 36, 55. [Google Scholar] [CrossRef] [PubMed]
Schultz, J.; Scherzinger, M.; Elbanhawy, A.Y.; Kaltschmitt, M. Long-term continuous anaerobic co-digestion of residual biomass—model validation and model-based Investigation of different carbon-to-nitrogen ratios. BioEnergy Res. 2025, 18, 58. [Google Scholar] [CrossRef]
Marycz, M.; Turowska, I.; Glazik, S.; Jasiński, P. Artificial Intelligence in Anaerobic Digestion: A Review of Sensors, Modeling Approaches, and Optimization Strategies. Sensors 2025, 25, 6961. [Google Scholar] [CrossRef] [PubMed]
Kim, M.; Ghobadi, F.; Tayerani Charmchi, A.S.; Lee, M.; Lee, J. Digital Twins for Clean Energy Systems: A State-of-the-Art Review of Applications, Integrated Technologies, and Key Challenges. Sustainability 2025, 18, 43. [Google Scholar] [CrossRef]
Cardona Acuña, L.D. Development of an Approximate Mathematical Model for Rural Biodigesters (Desarrollo de un Modelo Matemático Aproximado para Biodigestores Rurales). Master’s Thesis, Universidad de Ibagué, Ibagué, Colombia, 2021. Available online: https://hdl.handle.net/20.500.12313/3983 (accessed on 2 March 2026). (In Spanish)
Cheng, M.; Zhao, X.; Dhimish, M.; Qiu, W.; Niu, S. A Review of Data-Driven Surrogate Models for Design Optimization of Electric Motors. IEEE Trans. Transp. Electrif. 2024, 10, 8413–8431. [Google Scholar] [CrossRef]
Bergmeir, C.; Hyndman, R.J.; Koo, B. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Comput. Stat. Data Anal. 2018, 120, 70–83. [Google Scholar] [CrossRef]
Muscetta, M.; Bianco, F.; Trancone, G.; Race, M.; Siciliano, A.; D’Agostino, F.; Sprovieri, M.; Clarizia, L. Washing Bottom Sediment for the Removal of Arsenic from Contaminated Italian Coast. Processes 2023, 11, 902. [Google Scholar] [CrossRef]

Figure 1. Performance of the models (

R^{2}

) vs. Number of registers.

Figure 1. Performance of the models (

R^{2}

) vs. Number of registers.

Figure 2. Performance of the base XGBoost model for 100 days. Comparison between the synthetic data generated by the ODEs and the surrogate model’s prediction (

R^{2} = 0.6875

).

Figure 2. Performance of the base XGBoost model for 100 days. Comparison between the synthetic data generated by the ODEs and the surrogate model’s prediction (

R^{2} = 0.6875

).

Figure 3. Actual Methane Production vs. Predicted Value (Base Model). Comparison between current methane production vs. model’s prediction (

R^{2} = 0.6875

).

Figure 3. Actual Methane Production vs. Predicted Value (Base Model). Comparison between current methane production vs. model’s prediction (

R^{2} = 0.6875

).

Figure 4. Performance of XGBoost model (Lag-based Improved Model). Comparison between the synthetic data generated by the ODEs and the surrogate model’s prediction (

R^{2} = 0.9788

).

Figure 4. Performance of XGBoost model (Lag-based Improved Model). Comparison between the synthetic data generated by the ODEs and the surrogate model’s prediction (

R^{2} = 0.9788

).

Figure 5. Actual Methane Production vs. Predicted value (Lag-based Improved Model). Comparison between current methane production vs. model’s prediction (

R^{2} = 0.9788

).

Figure 5. Actual Methane Production vs. Predicted value (Lag-based Improved Model). Comparison between current methane production vs. model’s prediction (

R^{2} = 0.9788

).

Figure 6. Top 15 Feature Importance for the Lag-based Improved Model (XGBoost) evaluated by Normalized Gain.

Table 1. Predictive correlation (

R^{2}

) comparison between Cardona’s mechanistic framework and the proposed lag-based model across different spatial discretizations.

Table 1. Predictive correlation (

R^{2}

) comparison between Cardona’s mechanistic framework and the proposed lag-based model across different spatial discretizations.

Number of Reactors (N)	$R^{2}$ Mechanistic Model [16]	$R^{2}$ Proposed Lag-Based Model
3	0.9719	0.9788
5	0.2933	0.9428
7	0.2001	0.0000 *

* The correlation of 0.0000 reflects a complete mathematical collapse of the ODE system (simulated acid crash) at high discretization levels.

Table 2. Kinetic, stoichiometric, and operational parameters utilized for the ODE numerical simulation.

Parameter	Description	Value	Reference
$K_{1}$	Organic matter consumption yield	1.30 g/g	[16]
$K_{2}$	VFA generation yield	1.06 mmol/g	[16]
$K_{3}$	VFA consumption yield	6.30 mmol/g	[16]
$μ_{m a x 1}$	Maximum specific growth rate (acidogenic)	1.70 d⁻¹	[16]
$μ_{m a x 2}$	Maximum specific growth rate (methanogenic)	0.84 d⁻¹	[16]
$K_{s 2}$	Saturation constant for VFA	12.0 mmol/L	[16]
$V_{t o t a l}$	Total bio-digester volume	61.0 m³	[16]
$N_{r e a c t o r s}$	Number of CSTRs in series	3	[16]
$K_{4}$	Methane yield coefficient	0.18 mmol/g	Calibrated ¹
$K_{s 1}$	Saturation constant for organic matter	6000 mg/L	Calibrated ²
$K_{I}$	Haldane inhibition constant	150 mmol/L	Calibrated ³

¹ Increased from the baseline 0.11 to reflect the real-world operational average of 4600 L/d. ² Adjusted from 20,000 mg/L to balance bacterial consumption and prevent rapid system acidification. ³ Increased from 50 mmol/L to enhance methanogenic resilience against VFA toxicity ensuring stable oscillatory production.

Table 3. Comparison of the evaluation metrics demonstrating the specific impact of incorporating historical

C H_{4}

lags into the surrogate model.

Table 3. Comparison of the evaluation metrics demonstrating the specific impact of incorporating historical

C H_{4}

lags into the surrogate model.

Analysis	Feature Configuration	$R^{2}$	RMSE (L/d)	MAE (L/d)
Base Model	XGBoost without $C H_{4}$ lags (16 features)	0.6875	480.02	381.19
Lag-based Model	XGBoost with $C H_{4}$ lags (40 features)	0.9788	131.80	85.48

Table 4. Time Series Cross-Validation (5-Fold) performance metrics for the Lag-based Improved Model.

Fold	$R^{2}$	RMSE (L/d)
1	0.9550	190.90
2	0.9774	130.79
3	0.9797	122.00
4	0.9796	121.08
5	0.9784	135.36
Average	0.9740 ± 0.0095	140.03 ± 26.00

Table 5. Performance comparison between temporal split variations for the Lag-based Improved Model.

Temporal Split	$R^{2}$	RMSE (L/d)	MAE (L/d)
80% Train/20% Test	0.9788	131.80	85.48
60% Train/40% Test	0.9780	130.83	88.51

Table 6. Sensitivity analysis of predictive performance varying the maximum number of estimators (Lag-based Improved Model).

Number of Estimators	Test $R^{2}$	Test RMSE (L/d)	Test MAE (L/d)
1000	0.9786	132.41	85.83
2000	0.9788	131.81	85.49
3000	0.9788	131.81	85.48
5000	0.9788	131.80	85.48

Table 7. Evaluation of the lag-based model’s robustness under stochastic temperature variations.

Temperature Noise Level	$R^{2}$	RMSE (L/d)
0% (Deterministic control)	0.9788	131.80
10% Stochastic noise	0.9682	169.50
20% Stochastic noise	0.9599	180.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Montes-Carmona, M.E.; Burgos-Castro, I.A.; Portillo-Vélez, R.d.J.; García-Ramírez, P.J.; Marín-Urías, L.F.; Hernández-Pérez, M.Á. Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model. Processes 2026, 14, 1452. https://doi.org/10.3390/pr14091452

AMA Style

Montes-Carmona ME, Burgos-Castro IA, Portillo-Vélez RdJ, García-Ramírez PJ, Marín-Urías LF, Hernández-Pérez MÁ. Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model. Processes. 2026; 14(9):1452. https://doi.org/10.3390/pr14091452

Chicago/Turabian Style

Montes-Carmona, María Estela, Ivan Andres Burgos-Castro, Rogelio de Jesús Portillo-Vélez, Pedro Javier García-Ramírez, Luis Felipe Marín-Urías, and Miguel Ángel Hernández-Pérez. 2026. "Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model" Processes 14, no. 9: 1452. https://doi.org/10.3390/pr14091452

APA Style

Montes-Carmona, M. E., Burgos-Castro, I. A., Portillo-Vélez, R. d. J., García-Ramírez, P. J., Marín-Urías, L. F., & Hernández-Pérez, M. Á. (2026). Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model. Processes, 14(9), 1452. https://doi.org/10.3390/pr14091452

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Biogas Prediction Enhancement for a Swine Farm Bio-Digester Using a Lag-Based Surrogate Machine Learning Model

Abstract

1. Introduction

2. Methodology

2.1. Experimental Design

2.2. Mathematical Modeling and Mass Balance

2.2.1. Spatial Discretization Justification

2.2.2. Mass Balance Equations

2.3. Microbial Kinetics and Thermodynamic Perturbation

2.4. Synthetic Data Generation

2.5. Feature Engineering and Machine Learning Architecture

2.6. Model Training, Evaluation Metrics, and Data Leakage Prevention

3. Results

3.1. Synthetic Data Generation

3.2. Performance of the Base Model

3.3. Performance of the Lag-Based Improved Model

3.3.1. Feature Importance and Microbial Memory

3.3.2. Five-Fold Cross-Validation

3.3.3. Robustness Checks and Sensitivity Analysis

4. Discussion

Limitations of the Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI