1. Introduction
The biogas generation process is a natural phenomenon that occurs in a variety of anaerobic environments [
1]. These include marine and freshwater environments, sediments, wastewater treatment plant sludge, etc. Interest in the process stems primarily from the following reasons. On the one hand, a high degree of organic matter reduction is achieved with a small increase—compared to aerobic processes—in bacterial biomass. On the other hand, biogas production can be used to generate various forms of energy (heat and electricity) or processed as automotive fuel. Nevertheless, biogas has a lower calorific value than natural gas, and in specific applications, such as automotive fuel, treatment is necessary to improve its quality [
2].
Energy production from the anaerobic digestion has been used worldwide for over 30 years [
3]. Its viability and profitability depend not only on the amount of biogas produced, the available technology, and the efficiency of the wastewater treatment operation, but also on external parameters such as the local cost of energy production and available energy resources [
4]. Besides the economic advantages, biogas has also yielded environmental benefits [
5]. This aligns with broader sustainable wastewater management strategies that increasingly prioritize environmentally friendly alternatives to mitigate ecological impacts [
6]. Anaerobic digestion represents a mature yet still evolving technology for renewable energy production. Predicting methane yield remains challenging due strong nonlinearities and environmental coupling. Anaerobic digestion is a biological process where organic carbon is converted through subsequent oxidation and reduction to its more oxidized state (
) and its more reduced state (
). A wide range of microorganisms catalyzes the process in the absence of oxygen. Nitrogen, hydrogen, ammonia, and hydrogen sulfide are also generated in smaller quantities (typically less than 1% of the total gas volume). The mixture of gaseous products is called biogas, and the anaerobic degradation process is often also referred to as the biogas process [
7]. Estimating biogas production in an anaerobic digester is a complicated problem because the system is not only chemical nor only biological, but a microbial ecosystem highly related to biochemical reactions, mass transfer and variable physicochemical conditions [
8]. Current society requirements demands high efficiency biogas production to be implemented in anaerobic digesters of wastewater treatment plants [
9]. Therefore, innovative solutions must be developed in the short term [
10].
Recent advances in machine learning, particularly deep learning architectures, have shown promising results in modeling temporal dynamics [
11]. Nevertheless, some open problems are still challenging. In [
12], the optimization of biogas production for treating domestic wastewater is studied using different machine learning models (XGBoost and PSO), pointing out that limitations on the volume of the training data influenced the performance of the model predictions. J. Schulz et. al. [
13] investigated the features of the carbon-to-nitrogen ratio of the substrates that can be used for long-term continuous anaerobic co-digestion. In general, the study of sensors calibration and fouling, modeling and optimization approaches together with the lack of high quality standardized datasets are current trends in the literature [
14]. Some solutions using frontier technologies such as digital twins [
15] might help to shorten the implementation curve provided AI-based models are accurate and simple enough to be implemented in real-time biogas production processes.
Herein we aim to improve biogas estimation using lag-based training vectors. To avoid the lack of accurate and enough datasets, in this work, we propose a biogas estimation surrogate model based on machine learning. Unlike prior studies, our approach explicitly addresses data scarcity through synthetic data generation, temporal dependencies via lag-based modeling, as well as mechanistic consistency through ODE constraints. The model was trained using a set of data obtained from the solution of a set of differential equations with experimentally validated parameters. Compared to a baseline approach relying solely on immediate operational parameters. Specifically, the incorporation of lag-based temporal memory and thermodynamic perturbations significantly improves short-term predictive accuracy, enabling operational forecasting capabilities suitable for digital twin environments. Our analysis demonstrates a concrete quantitative improvement compared to a baseline model, elevating the prediction accuracy (
) from 0.6875 to 0.9788 (an approximate 29% improvement) and reducing the Root Mean Square Error (RMSE) from 480.02 L/d to 131.80 L/d. It is important to emphasize that this framework is presented as a digital twin-oriented proof-of-concept. At this stage, the model serves as an intermediate step toward real-world deployment rather than a fully validated predictive tool for immediate industrial application. The rest of the work is structured as follows.
Section 2 describes the proposed methodology while
Section 3 depicts the corresponding results to validate our approach.
Section 4 presents a brief discussion and, finally
Section 5 closes the paper with conclusions and future work.
2. Methodology
2.1. Experimental Design
To systematically evaluate the predictive capacity of the proposed surrogate model for the swine farm bio-digester, in this case methane is the produced biogas, the methodological framework was structured into two progressive experiments:
Base model: 30,000 registers were generated using different stoichiometric and kinetic parameters based on the real bio-digester by means of the ODEs (
1), (
2), (
3), (
4) and features in terms of inflow and the organic substrate concentration
to produce biogas calculated through the Equation (
5). An Extreme Gradient (XGBoost) model was trained and tested using a sequential 80/20 split, respectively.
Lag-based improved model: 10,000 registers were generated incorporating a dynamic seasonal temperature factor to perturb the microbial kinetic rates and simulate real-world environmental exposure. To capture this physical complexity, integrating thermodynamic interaction variables and historical lags of the biogas production. The inclusion of this auto-regressive feature is valid given that this information is available in real world applications, it is calculated through Equation (
5). An XGBoost model was trained and tested using a sequential 80/20 split as well as an adapted time series cross-validation to evaluate the system’s inertial memory and prevent data leakage.
2.2. Mathematical Modeling and Mass Balance
The physical and biochemical dynamics of the anaerobic digestion process (acidogenic and methanogenic phases) were simulated using a system of ODEs (
1), (
2), (
3), (
4).
2.2.1. Spatial Discretization Justification
Given that the real rural bio-digester operates with a plug-flow hydrodynamic regime, the total reactor volume (
) was spatially discretized into a series of Continuous Stirred-Tank Reactors (CSTRs). For this study, the system was divided into
interconnected sub-reactors, following the validated mathematical framework established by Cardona [
16].
This configuration was selected based on a targeted sensitivity evaluation comparing the proposed lag-based model against Cardona’s purely mechanistic framework. As shown in
Table 1, while the purely mechanistic approach suffers from severe numerical instability when increasing the spatial resolution (with
dropping to 0.2933 for
), the proposed lag-based model maintains high predictive fidelity (
= 0.9428). Therefore,
represents the optimal balance between physical consistency and numerical stability for the hybrid framework.
2.2.2. Mass Balance Equations
The general mass balance for any given state variable within the i-th sub-reactor is consistent with the fundamental principle of accumulation (Accumulation = Inflow − Outflow + Net Reaction) which is described by the following system of ODEs:
where
and
for
denote the acidogenic and methanogenic biomass concentrations, respectively, for the
i-th biodigester.
represents the organic substrate concentration measured as Chemical Oxygen Demand (COD), and
defines the Volatile Fatty Acid (VFA) concentration.
is the local dilution rate for each sub-reactor, calculated as
(The global dilution
D is defined as the flow rate
divided by the total volume
). This formulation implies a reduction in the effective residence time within each sub-reactor, consistent with the representation of reactors in series.
For the first sub-reactor (i = 1), the feed concentrations correspond to the raw inputs ( = and = ), assuming a biomass-free feed stream ( and ).
Methane production is calculated using Equation (
5)
2.3. Microbial Kinetics and Thermodynamic Perturbation
Microbial growth kinetics were modeled using the Monod equation for the acidogenic phase (
6) and the Haldane inhibition model for methanogenesis (
7). To improve the performance of the surrogate model, two scenarios were formulated. The base model analysis considered standard kinetic rates without environmental perturbations as expressed in Equations (
6) and (
7).
where
represents the maximum specific growth rates under standard conditions,
for
denotes the half-saturation constants, and
is the inhibition constant due to Volatile Fatty Acid (VFA) accumulation.
The lag-based improved model incorporated a seasonal temperature factor to simulate partially real-world environmental exposure and its direct impact on bacterial growth rates acknowledging that this consideration does not totally reflect real-world conditions. It is important to note that, although the model is simplified, this representation captures the dominant seasonal dynamics while providing a controlled and physically consistent perturbation framework for training and evaluating the surrogate model. The modified kinetic equations are shown in Equations (
8) and (
9):
To model the annual environmental exposure of rural bio-digesters, the
was defined by a cyclical model over a 365-day period as shown in Equation (
10):
Finally, the total daily methane production (
), representing the system’s target energy yield, was calculated using Equation (
5) where
is the volume of each individual sub-reactor (
), and
is the stoichiometric yield coefficient for methane.
While the majority of the kinetic and stoichiometric parameters were adopted directly from the validated baseline framework established by Cardona [
16], three specific parameters (
,
, and
) required empirical calibration to reflect the specific biological stability and operational reality of the target swine farm bio-digester. The calibration process was performed iteratively to ensure the ODE system maintained a stable oscillatory steady-state without succumbing to premature mathematical collapse (acid crash). Specifically, the organic matter saturation constant (
) was reduced from 20,000 mg/L to 6000 mg/L to balance the aggressive bacterial consumption rate characteristic of the swine substrate, preventing rapid and unrealistic system acidification. Simultaneously, the Haldane inhibition constant (
) was increased from 50 mmol/L to 150 mmol/L to reflect the enhanced resilience of the adapted methanogenic consortia against Volatile Fatty Acid (VFA) toxicity. Finally, the methane yield coefficient (
) was scaled from 0.11 to 0.18 mmol/g to strictly align the simulated energy output with the physically observed average methane production of 4.6 m
3/d. These targeted assumptions and stability considerations ensure that the synthetic data generation accurately mirrors the macroscopic thermodynamic boundaries of the real physical system, with all finalized parameters summarized in
Table 2.
2.4. Synthetic Data Generation
To train the XGBoost model, extended operational periods were simulated by numerically solving the ODEs system using the scipy.integrate.odeint library in Python. To reflect the real operation, information from the records of the real bio-digester were considered using bounded normal distributions (empirical operation ranges), the numerical integration required defining an initial condition vector () for each sub-reactor, representing the starting biomass and substrate concentrations inside the digester. Across all sub-reactors, the initial state was set to: g/L, g/L, mg/L, and mmol/L.
Input flow rate (): Modeled as and physically constrained to the range m3/d.
Organic substrate loading (): Modeled as and physically constrained to the range mg/L.
These specific constraints were derived directly from the empirical operational ranges of the physical swine bio-digester. They were strictly enforced to prevent computationally induced anomalous states (e.g., negative flows or toxic overloads) and to ensure that the generated synthetic scenarios remained entirely within the hydrodynamic design and biological limits of the real rural reactor.
However, it must be explicitly acknowledged that while these bounded Gaussian distributions effectively simulate average operational baselines, they are a simplified mathematical approximation. They do not fully capture the complex dynamics of real-world feeding schedules, such as statistical skewness, inherent temporal autocorrelation, or abrupt operational disturbances (e.g., shock loading or sudden system upsets). Consequently, this synthetic generation represents an idealized, baseline operational environment. Through this computational approach, two distinct datasets were generated and exported for the machine learning pipeline:
Base model: A dataset of 30,000 records was generated. The exported feature space consisted of Simulation Time (Time), Input Flow Rate (Q_in_m3_d), Organic Substrate Loading (S1_in_mg_L), and the target variable Methane Biogas Production (CH4_Prod_L_d).
Lag-based improved model: A reduced dataset of 10,000 records was generated. This reduction in sample size was a deliberate operational trade-off to accommodate the significantly higher computational complexity of numerically integrating the dynamic thermodynamic perturbations, as well as processing the expanded 40-dimensional feature space. It is critical to state that the substantial predictive improvement is strictly attributed to the engineered lag features, not the dataset size. As demonstrated in the subsequent sensitivity analysis (
Figure 1), reducing the dataset size inherently limits predictive capacity; therefore, the 29% accuracy enhancement achieved by this model—despite utilizing only one-third of the training data—definitively confirms the value of incorporating temporal memory. The exported feature space included Simulation Time (
Time), Input Flow Rate (
Q_in_m3_d), Organic Substrate Loading (
S1_in_mg_L), the Seasonal Temperature Factor (
Temp_Factor), and the target variable Methane Biogas Production (
CH4_Prod_L_d).
2.5. Feature Engineering and Machine Learning Architecture
To ensure model stability and eliminate the influence of the initial unsteady-state mathematical start-up, a transient period was discarded from the synthetic datasets prior to the feature engineering phase (the first 50 days for the base model and the first 100 days for the lag-based improved model). To capture the system’s temporal dynamics and provide the predictive models with historical memory, specific systematic lags vectors were engineered consistent with the available information in operational bio-digester. The rationale behind selecting these specific lag windows (spanning up to 30 days) was mathematically motivated to encompass the average hydraulic retention time (HRT) of the system. This design guarantees that the feature space captures both the immediate short-term daily fluctuations and the extended biological inertia of the microbial population. Lag intervals of days and days were incorporated into the feature space for the base model and lag-based improved model, respectively.
For the base model scenario, the input feature vector
constructed for any given day
t resulted in a 16-dimensional array, organized as shown in Equation (
11):
where the lags set is defined as
days.
However, for the lag-based improved model, the and its physical interactions with the mass flows were considered. Additionally, the records of historical biogas production () were considered. The inclusion of lagged methane production variables is consistent with real-world operational scenarios, where historical biogas measurements are continuously monitored and readily available. In this context, these variables provide valuable temporal information that enhances short-term predictive performance. Therefore, this information allows the model to exploit inherent temporal dependencies of the process, enhancing predictive capability within a realistic digital twin framework.
The thermodynamic interaction variables were defined as
and
. Based on the refined lags set
days, the augmented input feature vector
expanded into a 40-dimensional array, organized as shown in Equation (
12):
It is important to emphasize that the primary contribution of this methodology relies on how lag-based temporal memory enhances biogas prediction, rather than an exhaustive benchmarking of machine learning algorithms. To demonstrate this, the XGBoost algorithm was selected as the core regression engine because it represents the optimal trade-off between handling non-linearities and providing transparent model interpretability. While traditional time series models like ARIMA or SARIMA are computationally lightweight, they rely on linear assumptions and struggle with the complex thermodynamic perturbations and microbial inhibitions (e.g., Haldane kinetics) inherent to anaerobic digestion. Conversely, Deep Learning architectures such as Long Short-Term Memory (LSTM) networks excel at sequential dynamics but function largely as “black boxes,” obscuring the specific physical relevance of the input variables. Furthermore, while other tree-based ensembles (such as Random Forest, LightGBM, or standard Decision Trees) could theoretically model the data, XGBoost was selected over them for its superior handling of sparse lagged matrices and its strict regularization mechanisms. Because the core objective was to evaluate the impact of historical process memory, interpretability was as critical as predictive accuracy. XGBoost successfully captures these non-linear temporal dependencies while providing explicit feature importance metrics. This transparent architecture allowed for the quantitative validation of the engineered lag features, explicitly revealing the predictive weight of the system’s biological memory without the opacity of deep learning approaches.
2.6. Model Training, Evaluation Metrics, and Data Leakage Prevention
To explicitly prevent data leakage and rigorously evaluate predictive performance, a multi-layered prevention strategy was implemented. First, physical data leakage was strictly avoided during the feature engineering phase. Variables corresponding to system effluents or physical “out” parameters were explicitly excluded from the predictive feature space, as these measurements are practically unobtainable prior to actual biogas generation in a real-world operational setting. Including them would introduce an artificial, non-causal advantage that would compromise the model’s reliability as an operational tool.
Second, to prevent temporal data leakage, a primary sequential chronological split was implemented for both experiments. The first 80% of the temporal registers were used for model training, while the remaining 20% were strictly reserved as an unseen testing set.
Unlike randomized splitting, which can introduce severe data leakage in time series modeling by training on future events to predict the past, this strict chronological division preserves the natural temporal evolution of the biochemical process.
Crucially, to prevent implicit information leakage during hyperparameter tuning, the 80% development set was internally subdivided. A chronological validation subset was extracted from this internal data specifically to monitor validation loss and trigger early stopping, ensuring the 20% test set was never exposed to the model during the learning or selection phase.
Furthermore, to ensure robustness across different temporal subsets and definitively rule out temporal overfitting, supplementary validation schemes were integrated into the methodology. These included a time series-aware 5-fold cross-validation strategy—sequentially expanding the training window to respect temporal dependencies—and a more restrictive 60/40 chronological split.
The hyperparameter tuning workflow for the XGBoost regressor relied on an iterative evaluation of validation loss, systematically adjusting tree complexity and learning speeds, and explicit regularization parameters (such as subsampling) to mitigate overfitting risks, thereby balancing learning capacity and generalization:
Base model configuration: The model used 1000 boosting rounds (estimators), a learning rate of 0.05, and a maximum tree depth of 5. Subsampling and column sampling by tree were both set to 0.80. Early stopping was triggered if the validation loss did not improve for 50 consecutive rounds.
Lag-based improved model configuration: To capture the complex non-linearities of the expanded 40-dimensional feature space without overfitting, the maximum tree depth was significantly increased to 12. To balance this high capacity and ensure proper generalization, the learning rate was reduced to a conservative 0.01, paired with an extended allowance of 5000 maximum estimators. Furthermore, explicit regularization was enforced by restricting subsampling and column sampling to 0.70, injecting stochasticity into the tree-building process. Finally, the early stopping criteria was configured with a patience of 100 rounds. This stopping mechanism was strictly driven by the isolated chronological validation subset, effectively halting training before memorization could occur (a regularization success that is empirically demonstrated in the estimator sensitivity analysis presented in
Section 3.3.3).
The performance of the model was assessed using the coefficient of determination (
), the Root Mean Square Error (RMSE), and the Mean Absolute Error (MAE), which are defined in Equations (
13), (
14), and (
15), respectively:
where
n represents the total number of observations in the testing set,
is the ground-truth biogas production generated by the phenomenological model,
is the predicted biogas production by the XGBoost model, and
is the mean of the actual observed values.
4. Discussion
The presented results demonstrate the efficacy of a machine learning-based surrogate model for predicting biogas production in anaerobic digesters. In the first analysis, the predictive performance of the base model in terms of
was limited even by increasing the number of registers to train the model (100,000 registers,
= 0.6949) as expressed in the sensitivity analysis in
Figure 1, which confirms the higher performance of the lag-based model (with only 10,000 registers achieved an
= 0.9788), in addition, when implemented with 100,000 samples the improved model produced an
= 0.9939.
To evaluate the model’s reliability under unpredictable real-world weather, bounded stochastic perturbations were added to the temperature profile (representing daily variations of up to 20%). While this approach successfully evaluates the model’s resilience against daily micro-climatic fluctuations, it is acknowledged as a limitation that these purely random perturbations lack complex temporal autocorrelation and do not represent extreme, isolated climatic events (e.g., severe storms). Despite these limitations, the lag-based model demonstrated high resilience. As shown in
Table 7, addressing the need for a comprehensive evaluation beyond the
score of 0.9599, the predictive robustness was confirmed through strict absolute error metrics, maintaining an RMSE of 180.90 L/d under the most severe noise condition. This minimal drop in performance demonstrates that the model does not merely fit the dominant seasonal trends, but successfully maintains bounded accuracy across noisy operating conditions, reflecting the natural thermal inertia of real biogas plants.
This contribution is highly valuable as it addresses a current challenge in the application of artificial intelligence to anaerobic digestion: the lack of open-access training databases, as recently highlighted in the state of the art [
14]. By generating and utilizing the two synthetic datasets described in this study, our methodological approach effectively addressed this limitation.
Furthermore, this methodology has the potential to optimize biogas production by allowing the computational tuning of operational parameters prior to physical implementation, which translates into significant cost reductions. Such predictive and data-driven optimization frameworks are becoming increasingly essential across various environmental engineering applications, similar to the use of response surface methodologies for scaling up complex remediation systems [
19]. Another key contribution is the model’s applicability to physically implemented bio-digesters. By integrating real-world sensor data to create a surrogate model (Digital Twin), operators can evaluate the system’s response to dynamic variations such as fluctuating organic load inputs or substrate concentrations without risking the biological stability of the physical system. However, a fundamental limitation regarding the synthetic data generation must be explicitly acknowledged. Because the machine learning architecture was trained on datasets generated by a mechanistic framework, it inherently reproduces and partially inherits the structural assumptions and theoretical biases embedded in the underlying ODE. Consequently, the current evaluation predominantly reflects ODE-consistent behavior rather than an independent physical validation against highly unstructured real-world variability. Practical phenomena such as real sensor noise, instrument drift, and unmodeled operational disturbances are not fully captured by the synthetic datasets. Rather, this work is rigorously framed as an advanced ODE emulation and surrogate modeling strategy. The true value of this hybrid approach lies in its ability to overcome the numerical diffusion and instability limits of classical models, providing a mathematically robust, lag-aware core that serves as a highly efficient intermediate step toward practical digital twin implementations.
Finally, while the initial base model performed acceptably, the lag-based architecture demonstrated superior reliability and predictive accuracy by incorporating historical biogas production data, a variable that is typically monitored and readily available in field operations. This architectural choice represents a concrete modeling advancement over purely mechanistic approaches. As demonstrated during the spatial discretization analysis (
Table 1), relying solely on the phenomenological ODEs established by Cardona [
16] leads to a drastic collapse in predictive correlation in terms of the
when spatial nodes are increased to capture plug-flow dynamics (dropping to
= 0.2933 for
), culminating in a total simulated system failure (acid crash) at
based on our kinetic, stoichiometric, and operational parameters (
Table 2). In contrast, the proposed lag-based model overcomes these numerical diffusion limitations, demonstrating remarkable resilience by recovering the correlation to
= 0.9428 at
and stabilizing at a highly reliable 0.9788 for the optimal
configuration.
This superiority was further confirmed by the feature importance analysis (
Figure 6), where the immediately preceding biogas production day is alone responsible for 44.36% of the model’s overall predictive accuracy, with the remaining percentage distributed among the other 39 predictors. This result highlights the relevance of recent historical measurements in capturing the temporal dynamics of anaerobic digestion processes, particularly for short-term forecasting scenarios in an operational environment. Finally, the additional validation scheme through time series cross-validation (5-fold) confirmed this high predictive stability, maintaining the
(
Table 4) presented in the primary temporal 80/20 split.
Limitations of the Study
While the proposed lag-based surrogate model demonstrates significant predictive enhancements and effectively bypasses the numerical instability of purely mechanistic approaches, several limitations within the current framework must be explicitly acknowledged:
Synthetic Data Bias: Because the machine learning architecture was trained on datasets generated by a mechanistic framework, it inherently reproduces and partially inherits the structural assumptions and theoretical biases embedded in the underlying ordinary differential equations.
Lack of Real-World Validation: The current evaluation predominantly reflects ODE-consistent behavior. Practical phenomena such as real sensor noise, instrument drift, and unmodeled operational disturbances are not fully captured. Independent physical validation using real-world SCADA datasets remains a necessary next step.
Simplified Temperature Modeling: Environmental exposure was simulated using a bounded sinusoidal baseline with stochastic perturbations. While this evaluates basic robustness, it lacks the complex temporal autocorrelation and extreme climatic events characteristic of real weather patterns.
Limited Biological Complexity: The foundational phenomenological model relies on simplified macroscopic kinetics (Monod and Haldane). It does not explicitly account for the full spectrum of complex biological dynamics, such as deep syntrophic dependencies, pH micro-environments, or dynamic shifts in microbial community structures.
Absence of Multi-Model Benchmarking: The core objective of this study was to isolate and quantify the predictive value of lag-based temporal memory. Consequently, an exhaustive algorithmic benchmark against other architectures (e.g., Random Forest, LSTM, SARIMA) was intentionally excluded to maintain scope, remaining an open avenue for future computational research.
5. Conclusions
On the one hand, biogas estimation is a challenging problem due to the complex models involved in the dynamics of the biochemical process. On the other hand, the lack of, quality and quantity enough, data availability for biogas generation processes makes it difficult to train and validate machine learning models to be useful in most applications. In the first stage of our proposal, the methodology renders an option to consider differential equation models to generate synthetic data considering the kinetic, stoichiometric and operational parameters of a state of the art real-world implemented bio-digester to address the lack of high quality open-access datasets. Moreover, in the second stage, two machine learning-based models that reflect the operational behavior of the physical system (according to the biogas production) were obtained. While the base model performed acceptably, a 29% performance improvement was achieved by properly including a historical memory by designing specific lag vectors with the available information from the bio-digester operational data representing the main contribution of this paper. Sensitivity analysis and uncertainty analysis under data quantity and noise in temperature profile was performed, respectively, validating the robustness of our findings. Both of the machine learning models were developed in terms of operational parameters such as the organic substrate loading () and the input flow rate (). This approach facilitates the implementation of a digital twin, allowing operators to troubleshoot the system virtually before applying changes to the physical biodigester. To bridge the gap between simulation-based accuracy and practical reliability, our primary direction for future work includes a concrete validation roadmap. This pathway encompasses: the integration of the surrogate model with real-time SCADA data, its calibration using real-world operational datasets—leveraging transfer learning techniques—and the development of hybrid physics–machine learning correction layers designed to independently filter and adapt to unstructured environmental noise. Additional future efforts will incorporate advanced stochastic processes (e.g., ARIMA-based perturbations) to model operational mass flows and temperature profiles, moving beyond simple Gaussian or sinusoidal assumptions to accurately reflect system upsets, skewness, and temporal autocorrelation. We will also expand the approach to other biogas production processes in wastewater treatment plants to validate its implementation feasibility, including the modeling of more complex microbiological dynamics.