Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data

Ostos, Iván; Ruiz, Iván; Cruz, Diego; Flórez-Pardo, Luz Marina

doi:10.3390/bioengineering12111133

Open AccessArticle

Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data

by

Iván Ostos

^1,*

,

Iván Ruiz

¹,

Diego Cruz

² and

Luz Marina Flórez-Pardo

³

¹

Grupo de Investigación en Ingeniería Electrónica, Industrial, Ambiental, Metrología GIEIAM, Universidad Santiago de Cali, Cali 760036, Colombia

²

Grupo de Interacciones Microbianas GIM, Universidad de Salamanca, 37008 Salamanca, Spain

³

Grupo de Investigación en Modelado, Análisis y Simulación de Procesos Ambientales e Industriales PAI+, Universidad Autónoma de Occidente, Cali 760030, Colombia

^*

Author to whom correspondence should be addressed.

Bioengineering 2025, 12(11), 1133; https://doi.org/10.3390/bioengineering12111133

Submission received: 23 August 2025 / Revised: 18 September 2025 / Accepted: 22 September 2025 / Published: 22 October 2025

(This article belongs to the Special Issue Anaerobic Digestion Advances in Biomass and Waste Treatment)

Download

Browse Figures

Versions Notes

Abstract

Anaerobic codigestion of organic residues is a proven strategy for enhancing methane recovery. However, the complexity of microbial interactions and variability in operational conditions make it difficult to estimate methane concentration in real time, particularly in rural contexts. This study developed a multiple linear regression model to predict methane concentration using operational data and microbial community profiles derived from 16S rRNA gene sequencing. The system involved the codigestion of cassava by-product and pig manure in a two-phase anaerobic reactor. Predictor variables were selected through a hybrid approach combining statistical correlation with microbial functional relevance. The final model, trained on 70% of the dataset, demonstrated satisfactory generalization capability on the other 30 test set, achieving a coefficient of determination (R²) of 0.92 and a mean relative error (MRE) of 6.50%. Requiring only a limited set of inputs and minimal computational resources, the model offers a practical and accessible solution for estimating methane levels in decentralized systems. The integration of microbial community data represents a meaningful innovation, improving prediction by capturing biological variation not reflected in operational parameters alone. This approach can support local decision making and contribute to Sustainable Development Goal 7 by promoting reliable and affordable technologies for clean energy generation in rural and resource-constrained settings.

Keywords:

codigestion; metagenomics; biogas; MLR

Graphical Abstract

1. Introduction

The Valle del Cauca region is one of Colombia’s most active agro-industrial areas, combining high agricultural productivity with unique ecological richness. The territory is sustained by ecosystems that range from coastal plains to montane forests, which support both biological diversity and productive capacity. It ranks as the third-largest producer and consumer of pork in the country, with a reported output of 88,105 tons in 2023, equivalent to 15.6% of national production, and an average pig population exceeding 396,000 animals [1]. This sector generates large volumes of pig manure (PM) that require appropriate handling to prevent environmental and public health risks. Another productive activity with growing regional relevance is cassava cultivation, which covered approximately 564 hectares in 2020, yielding a total of 9888 tons of fresh roots [2]. During starch extraction, each kilogram of cassava generates about 0.2 kg of starch, 0.65 kg of fibrous residue (cassava dregs (CD)), and between five and seven liters of wastewater [3,4]. Based on these ratios, the estimated annual generation of by-products in the region reaches nearly 6427 tons, most of which are not currently valorized.

To address the increasing accumulation of organic residues from pig farming and cassava processing, anaerobic digestion (AD) has been promoted in rural areas of Valle del Cauca as a strategy for energy recovery and waste management. In these settings, one-phase tubular biodigesters are commonly employed due to their affordable construction, ease of installation and minimal infrastructure requirements, making them particularly attractive to smallholder producers [5].

AD is a biologically mediated process capable of metabolizing up to 95% organic matter [6]. It proceeds through four main sequential stages, each driven by specific microbial groups. In the hydrolysis phase, hydrolytic bacteria degrade complex macromolecules such as carbohydrates, proteins and lipids into soluble monomers. During acidogenesis, these compounds are converted by fermentative bacteria into volatile fatty acids (VFAs), alcohols, hydrogen and carbon dioxide. In acetogenesis, acetogenic microorganisms convert these intermediates into acetate, along with additional hydrogen and carbon dioxide. Finally, in methanogenesis, archaea utilize acetate, hydrogen and carbon dioxide to generate methane as the principal end product [7]. While each stage performs a distinct role, the overall efficiency of this multistep pathway depends on the synchronized activity of these microbial groups, where syntrophic cross-feeding and interspecies H₂/formate transfer channel mediates toward effective substrate valorization and stable methane formation [6,8,9]. Beyond its biological complexity, AD offers important advantages, including the provision of reliable baseload renewable energy that is independent of weather conditions, the achievement of high energy yields per unit area once stabilized, and the generation of multiple energy outputs such as biomethane, hydromethane, electricity, heat, and biohydrogen [7].

For the process to remain stable and efficient, environmental conditions such as pH and temperature must be kept within optimal ranges, typically between 6.5 and 7.5 for pH and 30 to 38 °C under mesophilic conditions [10,11]. In addition, maintaining a C:N ratio between 20:1 and 30:1 is considered ideal for AD, as it ensures sufficient nitrogen for microbial growth without leading to ammonia inhibition or carbon limitation [12,13]. However, most rural systems lack monitoring tools and operate through empirical practices, without clear understanding of internal conditions or microbial dynamics [5,14]. This limitation frequently leads to process imbalance, reduced performance and early system failure.

To overcome the performance limitations of conventional digesters, several strategies have been developed to improve substrate biodegradability and enhance biogas production. Among them, mechanical pre-treatments, codigestion, and multiphase configurations have proven to be particularly effective in increasing system efficiency [15,16,17]. Mechanical pre-treatments have proven effective in enhancing the hydrolysis of lignocellulosic substrates by reducing particle size and fiber crystallinity, thus increasing surface area and enzymatic accessibility [15,18]. Depending on specific conditions, methane production improvements of 16% to 99% have been reported with mechanical treatments [19]. These results highlight the potential of simple mechanical treatments to enhance biodegradability and biogas productivity, especially during the hydrolysis and acidogenesis phases, which are often rate-limited in solid waste digestion.

Codigestion has emerged as a robust strategy to address the nutrient imbalances and low biodegradability often associated with single-substrate digestion. By combining complementary feedstocks, this approach improves the carbon to nitrogen (C:N) ratio, dilutes inhibitors, and stimulates microbial activity, allowing for higher energy yields [16]. For instance, it has been reported that mixtures containing 66% PM, 16% cassava pulp, and 16% bagasse achieve higher methane yields than those with high bagasse content alone, which led to pH imbalances and process failure [20]. Likewise, biogas production efficiency and system stability for food waste and corn straw co-digestion with a hydraulic retention time (HRT) of 25 days have been informed, showing that codigestion notably enhanced the efficiency of the hydrolysis and acidogenesis stages, with the highest anaerobic biodegradability (85.7%) obtained when the food waste content was set at 60% [21]. These improvements are attributed to enhanced microbial synergy and substrate availability, which accelerate volatile solids degradation.

Multiphase AD systems have been developed to address the limitations of single-stage configurations by creating distinct operational environments for each metabolic phase [17]. In two-phase systems, the acidogenic and methanogenic stages are physically separated, which enables more efficient substrate conversion, greater resilience to organic shocks, and better pH control [4,17]. This structural decoupling has led to increases in methane yields, improved volatile solids removal, and significant reductions in HRT without compromising performance [4]. Although three-phase systems further refine process compartmentalization by isolating hydrolysis, acidogenesis, and methanogenesis, they often entail higher operational complexity, energy consumption, and maintenance requirements [22]. These drawbacks have limited their scalability, particularly in low-resource contexts. Consequently, two-phase systems represent a practical balance between performance enhancement and technical feasibility, making them a more accessible alternative for decentralized applications.

Among the strategies developed to improve AD performance, the integration of real-time monitoring systems has become increasingly relevant for enhancing process oversight and operational efficiency [23]. Basic and key variables such as pH, temperature, and methane concentration can be considered to infer the internal state of the reactor and anticipate potential imbalances. The use of cost-effective IoT platforms such as ESP32 microcontrollers coupled with sensors has proven suitable for real-time tracking, achieving deviations below 2% for CH₄ and 1.7% for pH when compared to laboratory-grade methods [23,24]. Systems incorporating the MQ-4 sensor (200–10,000 ppm CH₄) and platforms like ThingSpeak facilitate continuous data acquisition, cloud visualization, and automatic alerts, offering a practical solution to reduce manual intervention and increase system reliability [23,25,26].

In parallel, greater attention should be given to the microbial community (MC) involved in AD, as they are rarely considered in routine operation despite being responsible for driving the entire process. Recent studies have highlighted that variations in microbial structure are strongly influenced by substrate type, operational parameters such as temperature and organic loading rates (OLR), and reactor configuration. However, most operational strategies still rely exclusively on physicochemical parameters, overlooking microbial signals often preceding system imbalances [7]. With this in mind, evidence from studies on hunger stress has demonstrated that shifts in microbial communities under adverse conditions provide valuable insights into process behavior and system dynamics, underscoring the importance of integrating microbial data into process understanding to clarify how structural and functional changes within the community influence methane levels [27].

Despite their central role in AD, MLR models have traditionally been developed using operational variables that capture external system conditions, parameters that are directly measurable or predefined during setup, while MC have often been treated as secondary inputs or excluded altogether. For instance, recent studies have used MLR to predict specific methane production from dry AD of the organic fraction of municipal solid waste in pilot-scale plug-flow reactors. Six significant, mostly operational predictors were prioritized (VS, OLR, HRT, C/N ratio, lignin content, and VFA) via Pearson correlation and PCA. Simple regression showed low performance (R² = 0.3), while the full MLR reached R² = 0.91. A reduced model with four uncorrelated variables (VS, OLR, C/N ratio, lignin content) maintained strong accuracy (R² = 0.87) with fewer inputs [28]. Similarly, MLR has been applied to predict VFA concentrations in AD of primary and secondary sludge using operational and physicochemical inputs. The model achieved R² values above 0.85 in several scenarios, offering high interpretability and low computational demand. Although less accurate than leading ensemble methods, MLR remains suitable for applications that require clear interpretation of variable influence [29].

Unlike models based solely on operational parameters, recent full-scale work in thermophilic dry methane systems showed that MC remained stable, with Methanoculleus and syntrophic acetate oxidizers dominating throughout the process. This stability enabled the development of an adjusted MLR model which achieved high predictive accuracy (R² = 0.97) and outperformed gradient boosting approaches, highlighting the importance of linking microbial consistency with operational data for reliable large-scale biogas prediction [30].

Building on emerging evidence supporting the integration of microbial data into statistical modeling, this study aims to develop a predictive model for methane concentration based on a set of measurable variables, including VFAs, microbial populations, and operational parameters. It evaluates the potential of MLR to predict methane concentrations in a low-cost, two-phase anaerobic digester treating PM and CD at laboratory scale. This work aligns with Sustainable Development Goal 7 by promoting accessible tools for energy generation from organic waste.

The article is structured into four main sections. The Introduction outlines the context of AD in the Valle del Cauca region, highlighting environmental and operational challenges from agro-industrial organic waste, reviewing strategies to improve biogas systems, and emphasizing the need to integrate microbial data into predictive models. The Materials and methods detail the system setup, monitoring, sequencing, and the MLR approach used for variable selection and model construction. The Results and Discussion sections present the modeling outcomes, identify relevant predictors, and interpret their contribution to system behavior. The Conclusions section summarizes the key findings and future perspectives for incorporating microbiota into data-driven frameworks for sustainable energy transitions.

2. Materials and Methods

This section first describes the dataset and the preprocessing steps undertaken. Subsequently, it details the initial linear modeling approach, followed by a feature selection process based on variable weighting to derive a simplified, yet robust, model. Finally, it presents the development of an adaptive predictive model using a moving window technique combined with a regularization method to prevent overfitting.

2.1. Substrate Selection

The substrates used in this study were fresh PM and CD. The inoculum, obtained from the same source as the manure, was included to ensure microbial compatibility with the feedstock. Both were collected at a small-scale pig farm located in the municipality of Florida, Valle del Cauca, where approximately 20 pigs are kept under semi-intensive conditions. Animal pens are washed twice daily, and the resulting wastewater, rich in organic matter, drains into a static open-air tank that served as the inoculum source. Fresh manure was manually collected after excretion using sanitized tools. CD were obtained from a medium-sized cassava starch-processing facility located in the rural area of Mandiba, Santander de Quilichao, Cauca. Processing nearly eight tons of cassava per day, the plant generates over two tons of lignocellulosic residue each week. This material was delivered in dry, milled form.

All samples were stored at 4 °C until physicochemical characterization, which included proximate analysis by gravimetric methods and determination of the carbon-to-nitrogen (C:N) ratio via high-temperature combustion. These procedures followed the Standard Methods for the Examination of Water and Wastewater (APHA, AWWA, WEF), ensuring analytical consistency as summarized in Table 1 [31,32,33].

2.2. Experimental Setup

The experimental setup consisted of a two-phase laboratory-scale anaerobic digester designed to operate without integrated control systems Figure 1. The system was constructed using 110 mm sanitary-grade PVC tubing due to its low cost, durability, and ease of assembly. Phase 1 (D1F1) (3 L) was expected to perform hydrolysis and acidogenesis, while phase 2 (D1F2) (4 L) supposedly supported acetogenesis and methanogenesis. Each chamber was operated at 80% of its total volume, 2.4 L in phase 1 and 3.2 L in phase 2, leaving the remaining headspace for biogas accumulation. To enable real-time monitoring, a low-cost IoT module was incorporated into the digester, integrating an Arduino UNO microcontroller with sensors for pH, temperature, and methane concentration. Data was transmitted through a mobile network to the ThingSpeak platform for remote visualization [26]. This setup allowed continuous monitoring without the need for sophisticated instrumentation.

2.3. Operational Parameter

To establish an active MC, both phases were fed inoculum for five days, until reaching a working volume. The inoculum had a C:N ratio of 10.3 and 2.2% TS. During start-up, the OLR, estimated with a five-day HRT, was 8.37 gVS/L·day. Thereafter, feeding used a 73:27 blend of PM and CD. The daily feed was 35 g fresh PM and 13 g CD, plus 166 g water to achieve 10% TS (214 g/day total). The theoretical C:N ratio was 21.55. With the defined working volumes, HRTs were 12 days for D1F1 and 15 days for D1F2. Corresponding OLRs were 7.7 and 5.7 gVS/L·day. vs. inputs were 18.46 g/day (D1F1) and 18.45 g/day (D1F2). Daily manual feeding with graduated containers and isolation valves ensured accurate dosing and anaerobiosis.

The IoT-instrumented digester (D1) enabled incremental, data-driven feed adjustments in both phases (D1F1, D1F2) using real-time pH, temperature, and methane concentration. These signals guided when to lower the OLR and TS and when to apply temporary pH control, moving the reactors toward consistent operating conditions. Five feed formulations were implemented (Table 2). In D1F1, pH was briefly corrected with lime and then NaOH to keep it within 6.5–7.5; by mixture 5, recirculated digestate from D1F2 maintained pH without further chemicals. Mixture 4 used inoculum from an anaerobic digester at a university in Colombia treating food waste. Across mixtures, TS was reduced from 10% to 8–9%, OLR decreased from 12.4 gVS/L·day (inoculum step) to 5–6 gVS/L·day, and the C:N ratio increased in the final mixture due to recirculation while the contributions of PM and CD were reduced.

2.4. Steady State

Identifying steady-state periods was essential to build a reliable dataset, define representative operating conditions, and guide downstream variable prioritization and modeling. pH, temperature, and methane concentration were monitored continuously for 161 days (24/7). The IoT system logged three readings per minute for each variable and was routinely cross-checked against bench measurements to validate operational reliability.

Data volume was substantial, D1F1 recorded 694,110 samples per variable and D1F2 573,215. Processing followed six steps: (1) splitting timestamp into date and time; (2) validity filtering (e.g., pH 3–12; 10–45 °C; CH₄ within instrument bounds) with out-of-range values set to blank; (3) multivariate imputation by chained equations (MICE) to preserve temporal continuity [34]; (4) resampling to hourly means (2893 rows in D1F1; 2389 in D1F2) and (5) to daily means (152 and 147, respectively), retaining trends while reducing computational load as shown in Table 3.

Stable windows were then identified via rolling windows using relative standard deviation thresholds (<15%) around moving means for pH, temperature, and methane concentration, with a minimum continuous duration and compliance with predefined operating limits [35]. D1 showed extended steady windows, typically with pH 6.5–7.5, facilitated by high-frequency data and the ability to adjust operating conditions in real time.

2.5. VFA Quantification

Samples were collected every three days in 5 mL Eppendorf tubes and stored at −20 °C until analysis. The final selection of samples for analysis was made considering the periods of system stabilization under IoT monitoring and budgetary constraints, prioritizing those most representative of the overall process behavior. Sampling was carried out during the active operation of the digester.

The quantification of VFAs was performed by gas chromatography, following the procedure described in section 5560D of the Standard Methods for the Examination of Water and Wastewater (APHA) [36], in the laboratory of the Department of Chemical Engineering and Analytical Chemistry at the University of Barcelona. Prior to chromatographic analysis, the samples were centrifuged and filtered through 0.45 µm nylon membranes to remove suspended solids. Each analysis vial contained 1 mL of sample, diluted or not depending on the estimated concentration level, along with 0.1 mL of 15% orthophosphoric acid containing a known concentration of 2-ethylbutyric acid (~500 mg/L) as an internal standard. This compound allowed verification of injection consistency and facilitated calibration of the equipment through the ratio of analyte to standard peak areas.

Analyses were carried out on a Shimadzu GC-2010 Plus (Shimadzu Corporation, Kyoto, Japan) gas chromatograph with a flame ionization detector, using a DB-FFAP capillary column, 30 m × 0.25 mm × 0.25 µm (Agilent Technologies, Santa Clara, CA, USA). The oven temperature program started at 60 °C with a two-minute hold, followed by an increase of 20 °C/min up to 240 °C, maintained for an additional two minutes. The total analysis time was 13 min. The injector (SPL-1) operated at 220 °C in split mode, with a split ratio of 50:1. Helium was used as the carrier gas at a pressure of 42.6 kPa, with a total flow of 233.4 mL/min, a column flow of 8.86 mL/min, and a linear velocity of 60 cm/s. The purge flow was set at 3 mL/min, and the makeup gas flow (nitrogen) at the detector was 10 mL/min. The injection volume was 2 mL, using helium, air, hydrogen, and nitrogen as auxiliary gases.

For equipment calibration, a commercial VFA standard (Volatile Free Acid Mix, CRM46975, Supelco/MiliporeSigma [37]) containing defined concentrations of acetic, propionic, isobutyric, butyric, isovaleric, valeric, isocaproic, caproic, hexanoic, and heptanoic acids was used. Serial dilutions were prepared in 1:1, 1:2, 1:4, 1:8, 1:16, and 1:32 ratios, to which orthophosphoric acid and the internal standard were also added. For alcohol analysis (ethanol, propanol, and butanol), defined-concentration standard solutions were prepared, applying the same dilutions and analytical conditions. This procedure allowed precise and reproducible determination of VFAs in the samples, essential for evaluating the performance of the AD system and its relationship with operating conditions and microbiota.

2.6. Metagenomic Analysis

Samples for metagenomic analysis were collected directly from operational biodigester using 50 mL Falcon tubes. Sampling was performed every three days throughout the process, following the same prioritization criteria used for the quantification of VFAs, focusing on periods of greatest microbiological representativeness and considering the availability of resources. Once collected, samples were immediately frozen at −20 °C and stored until further processing.

To analyze the MC, Falcon tubes were sent to Omega Bioservices (Norcross, GA, USA) for DNA extraction using the kit E.Z.N.A.^® Universal Pathogen Kit, library preparation and for sequencing the V3–V4 hypervariable region of the 16S rRNA gene using the primers 341F (CCTACGGGNGGCWGCAG) and 806R (GACTACHVGGGTATCTAATCC) which was conducted on an Illumina Miseq sequencing platform (Illumina, San Diego, CA, USA) (Paired-end sequencing 300 bp). Illumina reads were then analyzed using BaseSpace app (version 1.1.3) [38]. Thus, raw sequence data were demultiplexed and then quality filtered, denoised, merged, and chimera removed using the DADA2 [39] to generate amplicon sequence variants (ASVs). Taxonomic assignment was conducted using the SILVA database (version 138.2) [40].

To structure the analysis of microbial interactions, a subset of phyla of interest was defined from the general metagenomic dataset, considering the sequencing reads obtained for each taxonomic group. The selection was based on two main criteria. First, the sustained presence of each phylum throughout the monitoring period was evaluated, excluding those with very low or intermittent representation, as their variability would hinder the detection of consistent associations in the relational analysis. Second, functional relevance reported in previous studies on anaerobic digestion was reviewed, prioritizing phyla whose involvement in fermentative, acetogenic, or methanogenic pathways has been extensively documented in similar systems [7,41].

Once the representative periods were defined, the results from VFA quantification and metagenomic analysis were integrated, extending the characterization to the biochemical and microbiological components of the system. In several cases, the observed patterns were consistent with those reported in the specialized literature, which supported the robustness of the approach. The dataset included operational, biochemical, and microbiological variables [42,43,44,45,46].

Since the biochemical and microbiological measurements were less frequent than the operational records, imputation techniques were applied within the selected periods to expand the dataset without distorting the relationships among variables. Methods such as KNN imputation, iterative imputation, and MICE were employed [43,47]. The analysis focused on the period between days 97 and 154, which, although not representing a fully stabilized phase, shows a trend toward stabilization and coincides with the selected VFA and microbiological samples. This ensured consistency between the experimental data and the operational conditions.

2.7. Preprocessing and Unified Database

Once the representative periods were defined, the results from VFA quantification and metagenomic analysis were incorporated to extend the characterization of the system to its biochemical and microbiological dimensions. The patterns obtained aligned with those reported in specialized literature, reinforcing the validity of the approach [42,43,44,45,46]. The unified dataset combined operational, biochemical, and microbiological variables. Because biochemical and microbiological measurements were less frequent than operational records, imputation method MICE was applied to harmonize the dataset without altering the underlying relationships among variables [43,47].

The analysis focused on the period between days 97 and 154, which, while not fully stabilized, displayed a clear trend toward steady performance and coincided with the VFA and microbiological samples selected. This ensured coherence between experimental observations and operational conditions. The resulting dataset comprised daily averages over 58 days, which were further refined through linear interpolation to increase temporal resolution. This process expanded the series to 1000 points, enabling the application of moving window analyses, as illustrated in Figure 2.

The interpolation was validated for all variables, yielding R² values close to 1 and the mean relative error (MRE) values around 0.1%, confirming a high-fidelity representation of the original data.

2.8. Linear Modeling

To simplify the proposed equations and procedure, the suffixes associated with each VFA (Table 4) microorganism (Table 5), and operating condition (Table 6) are shown below.

Equation (1) that linearly approximates

C H_{4}

concentration as a function of the microorganisms, fatty acids, and operating conditions was proposed in the following linear form, based on the suffixes from Table 4, Table 5 and Table 6.

{(C H_{4})}_{a p r o x} = C_{1} (a_{1}) + C_{2} (a_{2}) + \dots + C_{25} (m_{25}) + C_{26} (m_{26}) + \dots + C_{30} (p_{30}) + C_{31} (p_{31})

(1)

In its matrix form (matrix

\underline{A}

), Equation (1) can be expressed as follows (Equation (2)):

[\begin{matrix} {(C H_{4})}_{1} \\ {(C H_{4})}_{2} \\ \begin{matrix} . \\ . \\ . \\ {(C H_{4})}_{n} \end{matrix} \end{matrix}] = [\begin{matrix} {(a_{1})}_{1} & {(a_{2})}_{1} & \begin{matrix} \dots & \begin{matrix} {(m_{26})}_{1} & \dots & {(p_{31})}_{1} \end{matrix} \end{matrix} \\ {(a_{1})}_{2} & {(a_{2})}_{2} & \begin{matrix} \dots & \begin{matrix} {(m_{26})}_{2} & \dots & {(p_{31})}_{2} \end{matrix} \end{matrix} \\ \begin{matrix} . \\ . \\ . \\ {(a_{1})}_{n} \end{matrix} & \begin{matrix} . \\ . \\ . \\ {(a)}_{n} \end{matrix} & \begin{matrix} . \\ . \\ . \\ \begin{matrix} \dots & \begin{matrix} {(m_{26})}_{n} & \dots & {(p_{31})}_{n} \end{matrix} \end{matrix} \end{matrix} \end{matrix}] [\begin{matrix} C_{1} \\ C_{2} \\ \begin{matrix} . \\ . \\ . \\ C_{31} \end{matrix} \end{matrix}]

(2)

where the constants

C_{i}

are the approximation coefficients. This matrix form from Equation (2) can be written more compactly as shown in Equation (3):

\underline{C H_{4}} = \underline{A} \underline{x}

(3)

The matrix

\underline{A}

contains data collected from fatty acids, microorganisms, and operating conditions, the vector

\underline{C H_{4}}

represents the collected methane production data, while the vector

\underline{x}

contains the approximation coefficients that must be determined to formulate the model. The vector

\underline{x}

, can be solved by rearranging Equation (3) as follows:

\underline{x} = [{\underline{A}}^{T} \underline{A}]^{- 1} {\underline{A}}^{T} \underline{C H_{4}}

(4)

where

{\underline{A}}^{T}

is the transpose of matrix

\underline{A}

.

2.8.1. Assessing Variable Importance

To determine the relative importance of each variable in the approximation, and subsequently define a smaller, more practical subset (as working with all 31 variables can be impractical and costly in terms of laboratory testing), a variable weighting method was used. Therefore, Equation (5) appears as a modification of Equation (3) considering the minimal error

ϵ

.

\underline{C H_{4}} = \underline{A} \underline{x} + ϵ

(5)

To quantify how much each variable “contributes” to the

C H_{4}

production within the approximation, it is necessary to measure the relevance of each variable in the linear model. Since each variable may be measured on a different scale (e.g., microorganism abundance vs. fatty acid concentration in mg/L), directly comparing the raw coefficients

C_{i}

in vector

\underline{x}

can be misleading. Therefore, it is necessary to standardize the input data. In the same way, to compare the relative importance of each variable, the coefficients

C_{i}

that form the vector

\underline{x}

were standardized (as z-scores). The standardized coefficient

C_{i}^{*}

for each variable

C_{i}

was calculated as:

C_{i}^{*} = C_{i} \frac{σ_{C_{i}}}{σ_{C H 4}}

(6)

where

σ_{C_{i}}

and

σ_{C H 4}

are the standard deviations of the approximation coefficient

C_{i}

and the response variable

C H_{4}

, respectively.

2.8.2. Predictive Model

To capture the evolutionary nature of the anaerobic digestion process, a dynamic predictive model was developed based on a moving windows approach. The model operates iteratively. At each time step

t

, a linear regression model is trained using a window containing the last

k

observations (in this case,

k = 10

was chosen). This model is then used to make a one-step-ahead prediction of

\underline{C H_{4}}

(denoted as

\underline{\hat{C H_{4}}}

), as a function of the weighted variables previously described. However, the use of small data windows can lead to overfitting. To address this problem and improve the model’s generalization capability, Ridge Regression was used instead of ordinary least squares. This regression introduces a penalty term into the least squares cost function. For each time window, the objective is to find the coefficient vector

\underline{x}

that minimizes the following function:

{m i n}_{\underline{x}} ({‖\underline{C H_{4}} - \underline{A} \underline{x}‖}_{2}^{2} + λ {‖\underline{x}‖}_{2}^{2})

(7)

where

{‖\underline{C H_{4}} - \underline{A} \underline{x}‖}_{2}^{2}

is the sum of squared errors (the data fit term at time

t

),

λ {‖\underline{x}‖}_{2}^{2}

is the regularization term applied at time

t

,

λ

is the regularization hyperparameter that controls the balance between the data fit and the model simplicity, and

\underline{x}

contains the coefficients

C_{i}

. The hyperparameter

λ

was selected to improve the model predictive performance (

λ = 2

). Thus, Equation (4) is rewritten to obtain the predictive parameters (

\underline{\hat{x}}

) by solving the following equation:

{(\underline{\hat{x}})}_{t + 1} = {([{\underline{A}}^{T} \underline{A} + λ \underline{I}]^{- 1} {\underline{A}}^{T} \underline{C H_{4}})}_{t}

(8)

where

\underline{I}

is the identity matrix. The goal of Equation (8) is to find the value of the coefficients in vector

\underline{x}

at time

t + 1

using the data available at time

t

. Using Equation (8), it is possible to find the

\underline{\hat{C H_{4}}}

values for a subsequent window, given a defined window size of

k = 10

. In this way, Equation (3) becomes a prediction equation as follows:

{(\hat{\underline{C H_{4}}})}_{t + 1} = {\underline{(A)}}_{t + 1} ({\hat{\underline{x}})}_{t + 1}

(9)

2.8.3. Model Performance Evaluation

The precision of the predictive model was quantified using three standard statistical metrics. These metrics evaluate the divergence between the real observed values of

\underline{C H_{4}}

and the values predicted by the model,

\hat{\underline{C H_{4}}}

. On one hand, the Coefficient of Determination (

R^{2}

) indicates the proportion of the variance in methane production that is predictable from the independent variables. A value close to 1 indicates an almost perfect fit. Equation (10) shows how it was calculated for this case.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {[{(C H_{4})}_{i} - {(\hat{C H_{4}})}_{i}]}^{2}}{\sum_{i = 1}^{n} {[({(C H_{4})}_{i} - \bar{C H_{4}})]}^{2}}

(10)

Next, the Root Mean Square Error (RMSE) represents the standard deviation of the prediction residuals. It is a measure of the average error of the model in the same units as the response variable (ppm of

C H_{4}

), which facilitates its interpretation and is expressed in Equation (11).

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {[{(C H_{4})}_{i} - {(\hat{C H_{4}})}_{i}]}^{2}}

(11)

Finally, the MRE measures the average error in relative or percentage terms with respect to the real value is defined in Equation (12). The absolute value was used to prevent positive and negative errors from canceling each other out.

M R E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{{(C H_{4})}_{i} - {(\hat{C H_{4}})}_{i}}{{(C H_{4})}_{i}}|

(12)

In Equations (10)–(12),

{(C H_{4})}_{i}

is the real value of the i-th observation,

{(\hat{C H_{4}})}_{i}

is the value predicted by the model for the i-th observation,

\bar{C H_{4}}

is the mean value of all real values, and n is the total number of observations used for the evaluation.

3. Results and Discussion

3.1. Digester Performance

During the implementation of the two-phase biodigester, one of the main technical challenges was controlling gas leaks and internal pressure, which required multiple structural adjustments and caused delays in the early stages of operation. The installed manometers failed, likely due to H₂S-induced corrosion, while the valves progressively stiffened with use, in some cases requiring replacement. In addition, certain PVC weld joints developed fractures, compromising system integrity. The manual stirring mechanism did not improve process performance but was associated with gas leaks, leading to its deactivation and reinforcement of seals. After several corrective interventions, continuous and functional operation was achieved for the duration of the experiment.

3.1.1. IoT Monitoring Advantages

Although low-cost systems might be unsuited for long-term use or high-precision data collection, these biodigesters provide a practical alternative for experimental applications at laboratory scale when resources are limited. The estimated cost of assembling a two-phase biodigester without IoT monitoring was 300 USD, while the addition of a digital monitoring system increased the total to 420 USD per unit. Although some sensors required replacement during operation, the IoT system performed reliably, providing consistent readings comparable to manual instruments. Its implementation enabled real-time, continuous data acquisition, which was essential for detecting operational variations, making timely adjustments, and improving process understanding.

Figure 3 (D1F1) and Figure 4 (D1F2) show the 24 h profiles of pH, temperature, and CH₄ concentration during different operational stages. The days were randomly selected from both phases to provide representative snapshots of system behavior under varying conditions. In all cases a consistent inverse relationship was observed, where pH increased during early morning hours as ambient temperature declined and then decreased progressively as temperature rose throughout the day. In D1F1 this trend was stronger and more reproducible, with correlation coefficients between −0.84 and −0.94, while in D1F2 the association was weaker (between −0.49 and −0.66) and accompanied by larger fluctuations in methane concentrations. These results highlight the direct effect of ambient thermal oscillations on microbial activity, especially on pH dynamics.

This behavior may be linked to phases of microbial adaptation or to the accumulation of internal self-regulation mechanisms. As temperature dropped during the night and metabolic activity slowed, nitrogenous compounds likely continued decomposing and releasing ammonia (NH₃). This ammonia could react with dissolved CO₂, which is more soluble at low temperatures, to form ammonium bicarbonate (NH₄HCO₃). The resulting increase in alkalinity buffered pH variations, preventing excessive acidification and contributing to system resilience [48,49].

Such dynamics are rarely captured in conventional laboratory-scale digesters where data are typically restricted to discrete measurements. In this case the use of IoT-based continuous monitoring provided hourly resolution and made it possible to identify fine-scale responses such as the rise in pH at lower nighttime temperatures that would otherwise remain unnoticed. This approach delivers a more realistic picture of system performance under environmental conditions and emphasizes the value of continuous monitoring strategies for interpreting anaerobic digestion behavior beyond the limits of punctual sampling.

3.1.2. Stabilization of Anaerobic Codigestion

Achieving a steady-state is a critical milestone in AD, as it reflects the convergence of operational and microbial conditions that support sustained methanogenic activity [50,51]. In system D1, specific time segments were identified where pH, temperature, and methane concentration aligned within the functional ranges expected for AD. Figure 5 integrates these three variables across the full operational period, providing a comprehensive view of the transitions from unstable to stabilized phases. This visualization not only highlights the progression of the system under different operating conditions but also illustrates how corrective measures and phase-specific dynamics gradually steered the reactor toward a functional equilibrium.

In Figure 5a, a shift becomes evident after day 120, when pH consistently remained above 6.5 while methane rose steadily, surpassing 8000 ppm by day 132. These conditions coincided with stable temperatures between 30 and 32 °C, an optimal mesophilic range that favors methanogenic activity [35,42]. The segmentation into D1F1 and D1F2 reveals the influence of each phase under a combined HRT of 27 days. During early mixtures (Mix1–Mix2), high OLR and the absence of adapted inoculum produced irregular methane signals dominated by acidogenesis. From Mix3 onwards, corrective measures such as alkalinization promoted higher CH₄ concentrations, although fluctuations beyond ±15% prevented these periods from being classified as steady. Toward Mix4 and Mix5, adjustments including higher inoculum input and recirculation from D1F2 likely increased microbial density and functional diversity, progressively creating conditions more favorable to methanogenesis [52,53].

Figure 5b highlights two segments where the system approached steady-state behavior. The first, between days 126 and 136, was characterized by pH values between 6.5 and 7.5, methane above 8000 ppm, and stable temperatures (30–32 °C), all within functional ranges and with fluctuations below ±15%. A temporary decline in methane around day 137 disrupted this stage, but from day 139 onwards the system recovered, initiating a second steady segment that persisted until the end of the experiment.

Overall, the convergence of pH, CH₄, and temperature demonstrates that the steady state achieved in D1 was not the result of a single correction but the cumulative effect of progressive adjustments. This sequence of changes allowed the system to transition from acidogenic predominance to a consolidated methanogenic phase, representing a functional stabilization consistent with the goals of two-phase AD [53].

3.2. Volatile Fatty Acids (VFAs) and Metagenomic Analysis

Once stabilization was established from IoT-monitored variables, VFAs and microbiota were analyzed during the transition toward optimal operation (days 97–154, Mix 4 and Mix 5). Thirteen samples were taken, and missing data were inputted through MICE to ensure continuity.

Figure 6 displays when VFA concentrations dropped sharply after day 100, from above 14,000 mg/L to 5800 mg/L, before oscillating between 5000 and 7000 mg/L. This decline reflects the mitigation of acidogenic pressure and the progressive adjustment of the microbial community, setting conditions increasingly suitable for methanogenesis and linking metabolite dynamics with microbial responses in the path toward functional balance [54,55].

The individual analysis of VFAs confirmed that the steep decline after day 100 was largely driven by the reduction in acetic and butyric acids, both tied to early fermentative pathways [54]. Between days 103 and 118, however, propionic acid and medium-chain carboxylates (C5–C8), including caproic, heptanoic, and valeric, increased notably, reaching averages of 140 mg/L, 218 mg/L, and 820 mg/L, respectively [56]. These less common metabolites are typically linked to secondary fermentation processes or to transitional phases of temporary accumulation [57,58,59]. Their persistence, together with measurable levels of propanol (134 mg/L) and the absence of ethanol, suggests a fermentative stage dominated by chain-elongation routes, potentially hindered by propanol’s inhibitory effect on methanogenic consortia [57,60,61]. After day 119, these acids gradually declined (e.g., caproic down to 109 mg/L, valeric to 758 mg/L), while ethanol reappeared (20 mg/L) and propanol rose to 191 mg/L. This pattern may indicate that, despite higher alcohol concentrations, microbes capable of degrading medium-chain acids regained activity, backing a functional shift toward methanogenesis [62,63].

Regarding microbial analysis, a total of 1,815,465 high-quality reads, with an average of 201,718 ± 103,945 reads per sample. Rarefaction analysis based on Shannon index showed that sequencing depth was adequate to capture most of the bacterial diversity across samples as shown in Figure 7.

The microbial dynamics derived from phylum-level aggregate counts (Supplementary Table S1), as depicted in Figure 8, may offer insights into the associations between dominant phyla, methane concentrations, and the VFAs profiles presented in Figure 5 and Figure 6, respectively. During days 97 and 154, Firmicutes remained the prevailing group, averaging 49,235 reads (64.2% of the total), underscoring its central role in the early stages of the process, particularly in hydrolysis and acidogenesis [7]. This activity likely promoted the production of fermentative precursors, consistent with the elevated concentrations of acetic, propionic, and butyric acids recorded at the beginning of this interval [54]. Along with Firmicutes, Bacteroidetes (15%) and Actinobacteria (7.5%) contributed to medium-chain fatty acids such as valeric and caproic during the accumulation phase (days 103–118) [57,61]. This functional diversity points to a bacterial consortium engaged in degrading complex polymers and extending fermentative pathways, buffering intermediates before methanogenic activity resumed [64,65]. Toward the end, Firmicutes declined while Euryarchaeota increased to 1.5%, coinciding with reduced VFAs and steadier methane, suggesting activation of acetoclastic and hydrogenotrophic routes [7,66].

Minor groups might have played complementary roles, with Planctomycetes (3.9%), likely coupling sulfide oxidation to methanogenesis, Proteobacteria (1.9%) contributing to propionate and acetate turnover, and Synergistetes (1.9%) participating in syntrophic H₂ transfer [7,67]. Unexpectedly, Verrucomicrobia appeared in the community profile, a phylum typically restricted to volcanic habitats dominated by acidophilic methanotrophs [68]. These bacteria can oxidize methane as their main substrate and, to a lesser extent, hydrogen, carbon dioxide, ammonium and hydrogen sulfide, functioning as natural biofilters in extreme ecosystems. Their presence in a mesophilic anaerobic digester is unusual and may reflect residual inoculum or localized microredox niches rather than an active role in methanogenesis [68,69]. Alongside other low-abundance phyla such as Lentisphaerae, Candidatus saccharibacteria, and Parcubacteria, their detection expands the taxonomic spectrum and raises questions about potential ecological roles still unexplored in anaerobic bioenergy systems [70]. Many of these groups remain unresolved at the species level, even after advanced genomic assembly, forming part of the so called microbial dark matter. This hidden fraction highlights one of the major challenges in deciphering the functional complexity of anaerobic microbiomes [71,72].

3.3. Multiple Linear Regression (MLR)

Modeling phase was based on the Supplementary Table S2. The values in Table 7 show the coefficients

C_{i}

obtained by finding the vector

\underline{x}

after applying Equation (4).

This method of finding the approximation coefficients is known by some authors as inverse modeling and can be considered a multilinear regression. In fact, this inverse modeling approach is equivalent to the least squares method applied to multiple vectors. When the vector

\underline{x}

was obtained, Equation (1) was applied to generate the approximation curve. Figure 9 shows the resulting approximation.

3.3.1. Data Prioritization

While Figure 9 shows the overall approximation result, it is important to determine the contribution of each fatty acid, microorganism, or operating condition to the

C H_{4}

production. As previously mentioned, a direct comparison of the

C_{i}

coefficients can be misleading, so it is important to perform a variable weighting process using Equation (6). The values obtained from this process are shown in Table 8 and plotted in Figure 10.

To define how many variables are needed to achieve good fit without a significant loss of precision, two limits were established, as can be seen in Figure 11, with an

R^{2}

Coefficient greater than 0.9 and an MRE lower than 15%. Based on Table 8 and Figure 10, the 12 variables with the highest values, or greatest impact on

C H_{4}

production, were selected. They were

C_{10}^{*}

,

C_{12}^{*}

,

C_{17}^{*}

,

C_{13}^{*}

,

C_{15}^{*}

,

C_{11}^{*}

,

C_{20}^{*}

,

C_{14}^{*}

,

C_{9}^{*}

,

C_{16}^{*}

,

C_{19}^{*}

and

C_{18}^{*}

. According to Table 5, these correspond to Bacteroidetes, Proteobacteria, Verrucomicrobia, Planctomycetes, Spirochaetes, Actinobacteria, Armatimonadetes, Synergistetes, Firmicutes, Euryarchaeota, Tenericutes, and Cloacimonetes, correspondingly.

The predictive weight of the identified phyla by the model is substantiated by their established functional roles in AD. Key bacterial groups such as Bacteroidetes, Proteobacteria, Actinobacteria, and Spirochaetes are recognized for their indispensable roles in the hydrolysis and acidogenesis stages. These phyla collaborate to break down complex organic matter into direct precursors for methanogenesis, including acetate and

H_{2}

[44,73]. As expected, the phylum Euryarchaeota, which encompasses all known methanogens, was a fundamental predictor, being responsible for the final conversion to methane [74,75]. Notably, the model also highlighted the importance of Verrucomicrobia, a finding validated by research demonstrating that members of this phylum can actively degrade complex polysaccharides like xylan under anaerobic conditions, positioning them as specialized primary degraders [76]. Altogether, these results confirm that the model successfully captures the complex microbial network that drives methane production, from the initial decomposers to the terminal methanogens.

Thus, Equations (1) and (2) can be rewritten in terms of the 12 variables that were found to be most important according to the weighting performed. This leads to Equation (13):

\begin{matrix} {(C H_{4})}_{W e i g h t e d A p r o x} \\ = C_{10} (m_{10}) + C_{12} (m_{12}) + C_{17} (m_{17}) + C_{13} (m_{13}) + C_{15} (m_{15}) \\ + C_{11} (m_{11}) + C_{20} (m_{20}) + C_{14} (m_{14}) + C_{9} (m_{9}) + C_{16} (m_{16}) \\ + C_{19} (m_{19}) + C_{18} (m_{18}) \end{matrix}

(13)

In its matrix form, Equation (13) can be expressed as follows:

[\begin{matrix} {(C H_{4})}_{1} \\ {(C H_{4})}_{2} \\ \begin{matrix} . \\ . \\ . \\ {(C H_{4})}_{n} \end{matrix} \end{matrix}] = [\begin{matrix} {(m_{10})}_{1} & {(m_{12})}_{1} & \begin{matrix} \dots & {(m_{18})}_{1} \end{matrix} \\ {(m_{10})}_{2} & {(m_{12})}_{2} & \begin{matrix} \dots & {(m_{18})}_{2} \end{matrix} \\ \begin{matrix} . \\ . \\ . \\ {(m_{10})}_{n} \end{matrix} & \begin{matrix} . \\ . \\ . \\ {(m_{12})}_{n} \end{matrix} & \begin{matrix} . \\ . \\ . \\ \begin{matrix} \dots & {(m_{18})}_{n} \end{matrix} \end{matrix} \end{matrix}] [\begin{matrix} C_{10} \\ C_{12} \\ \begin{matrix} . \\ . \\ . \\ C_{18} \end{matrix} \end{matrix}]

(14)

Table 9 shows the new values found for the approximation coefficients, calculated using Equation (13) in conjunction with Equation (4). These results in the weighted approximation shown in Figure 12.

Meanwhile, the MRE for the approximation with Equation (1) is around 12.59%, while for the weighted approximation with Equation (13), it is 14.94%. Finally, the errors evaluated by the RMSE are below 450 ppm, which, considering the scale of Figure 5, are within an acceptable range. This approach successfully developed a simplified, dynamic model for predicting methane (

C H_{4}

) production in an anaerobic digestion process. The key achievement was the ability to reduce a complex system of 31 variables to a robust predictive model based on only the 12 most influential factors, without a significant loss of precision.

Table 10 shows a comparison of the metrics used to compare the fits discussed previously. While the general behavior is replicated by both curves, the

R^{2}

coefficient for the approximation curve using Equation (1) is 0.989, whereas with the weighted approximation from Equation (13), the

R^{2}

value is 0.979.

The primary finding of this work is the overwhelming importance of microbial populations as indicators of

C H_{4}

concentration compared to VFAs and operational parameters. The variable weighting analysis revealed that the 12 most significant variables were exclusively microorganisms, with groups like Bacteroidetes and Proteobacteria showing the highest importance scores. This suggests that, within the context of this study, the state of the microbial community is a more direct and powerful predictor of methanogenic activity than the concentration of intermediate substrates (VFAs) or the operational conditions measured. While VFAs are essential for methanogenesis, their concentrations can be transient. In contrast, the abundance of specific microbial groups likely represents the metabolic potential of the system, making them more robust indicators for modeling purposes.

3.3.2. Predictive Model Development

The simplification of the model from 31 variables to 12 demonstrates the practical value of the feature selection process. The weighted approximation model, using only the selected microorganisms, achieved an

R^{2}

of 0.979, a negligible decrease from the 0.989

R^{2}

of the full model. By focusing only on the most critical microbial indicators, laboratory testing and data analysis efforts can be substantially reduced while still maintaining a high degree of accuracy. The slight increase in MRE and RMSE is an acceptable trade-off for the considerable reduction in model complexity.

In this context, using Equation (13), it was possible to develop a predictive model to predict the components of the vector

\underline{\hat{x}}

and, subsequently, the behavior of the

\underline{\hat{C H_{4}}}

production by applying Equations (8) and (9), respectively. This was performed by using 70% of the dataset to train the model and the remaining 30% to test its performance. Thus, Figure 13 shows how, by applying Equation (8), it is possible to obtain a prediction for the behavior of each of the coefficients for the weighted variables

\underline{\hat{x}}

(originally listed in Table 9).

By applying Equation (9), the prediction for

\hat{C H_{4}}

was obtained, as shown in Figure 14. This was derived from the coefficients

\underline{\hat{x}}

obtained via Equation (8). The analysis of the data from the training and prediction curves is recorded in Table 11. The

R^{2}

fit values show a strong correlation (greater than 0.9), and the MRE values are considerably low (less than 7%), with a comparatively low RMSE for the prediction.

The development of a dynamic predictive model using a moving window approach combined with Ridge Regression proved to be highly effective. This strategy was designed to capture the evolutionary nature of the biological process and to prevent overfitting that can occur with small data windows [77]. The performance of the final predictive model on the test data was significant, achieving a coefficient of determination (

R^{2}

) of 0.920 and a MRE of 6.50%. This result confirms that the model not only fits the training data well but also generalizes effectively to make accurate short-term predictions on unseen data. The use of Ridge Regression (

λ = 2

) was crucial in stabilizing the coefficients and ensuring the model robustness.

Several studies have also attempted to predict CH₄ concentration in AD systems using different modeling techniques (Table 12). These works vary notably in terms of model complexity, data requirements, and analytical focus.

For instance, some approaches achieve high accuracy with computationally intensive “black-box” models, such as a Multilayer Perceptron (MLP) neural network optimized with metaheuristics like the Evaporation-Rate Water Cycle Algorithm (ERWCA) [50]. In contrast, other studies leverage massive, high-frequency Supervisory Control and Data Acquisition (SCADA) data [78], concluding that microbial inputs are unnecessary for their predictions. While research using algorithms like Random Forest (RF) confirms that combining genomic and operational data improves accuracy [44], the present study demonstrates that the microbial community structure alone can be the primary predictive driver. The MLR model developed here distinguishes itself by prioritizing a reduced set of key microbial predictors within a simpler, more interpretable framework.

This comparison underscores the value of an approach that provides a practical and computationally efficient solution, particularly for systems where microbial dynamics, rather than extensive operational data, are the main drivers of performance. The strength of this proposed model lies in its focus on microbial data. While other approaches successfully use operational data alone such as high-frequency SCADA feeds or MSW loads [30], this study demonstrates that a biological signature can be effectively employed to predict methane concentration patterns in AD. This highlights the crucial role of metagenomics in uncovering the core drivers of anaerobic digestion, enabling accurate predictions by focusing directly on the process biology.

To provide a clear summary of the study outcomes, Figure 15 visually compares the performance of the developed model variations.

The initial fitting models, the Full Approx using all 31 variables and the Weighted Approx using the 12 most important phyla, both achieved an outstanding fit on the training data (

R^{2}

> 0.97). This confirms that the variable reduction step successfully produced a simpler model without a significant loss of explanatory power.

More importantly, the predictive model, when evaluated on the unseen test dataset, demonstrated exceptional generalization. Although its

R^{2}

of 0.92 is slightly lower than that of the fitting models, it achieved a MRE of only 6.50%, which is less than half the error of the full approximation. This combination of a high

R^{2}

and a significantly lower MRE on test data supports the robustness of the model and confirms the effectiveness of the proposed approach.

The novelty of this work lies in integrating MC data into an MLR framework, a significant departure from conventional modeling strategies. While previous MLR models have successfully predicted biogas production, they have predominantly relied on operational and physicochemical variables. Furthermore, studies that do incorporate genomic data often turn to complex, computationally intensive “black-box” models like Random Forests or neural networks.

This research bridges that gap by demonstrating that a simple, interpretable MLR model can achieve high predictive accuracy by prioritizing MC data, specifically the abundance of key phyla, over traditional inputs. This approach was validated on a scarcely explored substrate mixture PM and CD, which are significant agro-industrial residues in regions like Valle del Cauca, Colombia. The model’s utility is further enhanced by its synergy with low-cost IoT monitoring systems, which offer a practical solution for real-time data acquisition in decentralized settings. By enabling accurate methane prediction with reasonable computational demand, this work presents an accessible and robust tool for optimizing biogas management in the rural and resource-limited contexts where it is most needed.

Considering the analysis was conducted over a limited timeframe within a single AD process, the outcomes provide a solid foundation for evaluating the predictive relationship between microbial composition and methane concentration. Expanding the evaluation to longer periods and incorporating diverse feedstocks, reactor designs, and operational conditions would support a more comprehensive assessment of the model’s versatility.

In parallel, investigating the metabolic contributions of the identified microorganisms may yield additional insight that enriches the statistical perspective. Integrating the model with real-time monitoring of microbial populations, volatile fatty acid profiles, and key process variables could facilitate adaptive management strategies and enhance the efficiency of AD systems.

4. Conclusions

This study successfully developed a MLR model to predict methane concentration in anaerobic codigestion using integrated microbial and operational data. The model demonstrated high predictive accuracy (R² = 0.92, MRE = 6.50%) while requiring only 12 key predictors, substantially reducing complexity compared to the initial 31 variable set. Among the relevant findings, the identification of Verrucomicrobia as a significant predictor was particularly noteworthy, as this phylum is typically associated with extreme environments rather than mesophilic digesters, suggesting previously unrecognized ecological adaptations. The overwhelming dominance of microbial indicators over conventional process parameters highlights the critical importance of community dynamics in driving methanogenic performance. Furthermore, the moving window approach with Ridge regularization effectively captured the system’s biological evolution while maintaining robustness against overfitting. This modeling approach demonstrates significant potential for practical implementation in rural and resource-limited settings, offering a viable method for methane prediction without sophisticated computational requirements.

Future work should focus on validating this model across diverse reactor configurations and feedstock types to assess its generalizability. Additionally, developing cost-effective molecular monitoring tools for the identified key microbial groups could enable real-time implementation of this predictive approach in practical applications.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/bioengineering12111133/s1, Table S1: Phylum level aggregate counts with no imputed data; Table S2: Cleaned database for initial modeling.

Author Contributions

Conceptualization, I.O. and I.R.; methodology, I.O., I.R. and D.C.; software, I.O., I.R. and D.C.; validation, I.O., I.R. and L.M.F.-P.; formal analysis, I.O., I.R. and D.C.; investigation, I.O. and I.R.; resources, I.O. and I.R.; data curation, I.O., I.R. and D.C.; writing—original draft preparation, I.O. and I.R.; writing—review and editing, I.O., I.R. and L.M.F.-P.; visualization, I.O., I.R. and D.C.; supervision, L.M.F.-P.; project administration, I.O. and L.M.F.-P.; funding acquisition, I.O., I.R. and L.M.F.-P. All authors have read and agreed to the published version of the manuscript.

Funding

The author(s) declare that financial support was received for the research, authorship and publication of this article. This research has been funded by Dirección General de Investigaciones of Universidad Santiago de Cali under call No. DGI-01-2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

The authors thank the Universidad Autónoma de Occidente and the Universidad Santiago de Cali for their invaluable support. This research has been funded by Dirección General de Investigaciones of Universidad Santiago de Cali under call No. DGI-01-2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Asociación Porkcolombia. Crecimiento Real, Estable y Continuo, Distintivo de La Porcicultura Colombiana. Available online: https://porkcolombia.co/wp-content/uploads/2024/06/REVISTA-PORKCOLOMBIA-ED-275-DIGITAL.pdf (accessed on 21 January 2025).
Sidartha, Z.; Mendoza, J.C.; Gonzalez, L.S.; Kaiser, F.L.; Gebauer, A. Guía de Biogás Para El Sector Porcícola En Colombia. J. Chem. Inf. Model. 2020, 53, 1689–1699. [Google Scholar]
Jiang, H.; Qin, Y.; Gadow, S.I.; Li, Y.Y. The Performance and Kinetic Characterization of the Three Metabolic Reactions in the Thermophilic Hydrogen and Acidic Fermentation of Cassava Residue. Int. J. Hydrogen Energy 2017, 42, 2868–2877. [Google Scholar] [CrossRef]
Fasheun, D.O.; Sant’Ana da Silva, A.; Teixeira, R.S.S.; Santana Ferreira-Leitão, V. Enhancing Methane Production from Cassava Starch: The Potential of Extrusion Pretreatment in Single-Stage and Two-Stage Anaerobic Digestion. Fuel 2024, 366, 131406. [Google Scholar] [CrossRef]
Tavera-Ruiz, C.; Martí-Herrero, J.; Mendieta, O.; Jaimes-Estévez, J.; Gauthier-Maradei, P.; Azimov, U.; Escalante, H.; Castro, L. Current Understanding and Perspectives on Anaerobic Digestion in Developing Countries: Colombia Case Study. Renew. Sustain. Energy Rev. 2023, 173, 113097. [Google Scholar] [CrossRef]
Patel, S.K.S.; Das, D.; Kim, S.C.; Cho, B.K.; Kalia, V.C.; Lee, J.K. Integrating Strategies for Sustainable Conversion of Waste Biomass into Dark-Fermentative Hydrogen and Value-Added Products. Renew. Sustain. Energy Rev. 2021, 150, 111491. [Google Scholar] [CrossRef]
Ostos, I.; Flórez-Pardo, L.M.; Camargo, C. A Metagenomic Approach to Demystify the Anaerobic Digestion Black Box and Achieve Higher Biogas Yield: A Review. Front. Microbiol. 2024, 15, 1437098. [Google Scholar] [CrossRef] [PubMed]
Yan, M.; Shi, Z.; Zhang, X.; Lin, X.; Sun, Y.; Cheng, X.; Tian, H.; Li, Y. Decipher Syntrophies and Adaptive Response towards Enhancing Conversion of Propionate to Methane under Psychrophilic Condition. Water Res. 2025, 274, 123143. [Google Scholar] [CrossRef] [PubMed]
Su, K.; Li, L.; Wang, Q.; Cao, R. A Review on the Interspecies Electron Transfer of Methane Production in Anaerobic Digestion System. Fermentation 2023, 9, 467. [Google Scholar] [CrossRef]
Uddin, M.M.; Wright, M.M. Anaerobic Digestion Fundamentals, Challenges, and Technological Advances. Phys. Sci. Rev. 2023, 8, 2819–2837. [Google Scholar] [CrossRef]
Akindolire, M.A.; Rama, H.; Roopnarain, A. Psychrophilic Anaerobic Digestion: A Critical Evaluation of Microorganisms and Enzymes to Drive the Process. Renew. Sustain. Energy Rev. 2022, 161, 112394. [Google Scholar] [CrossRef]
Rajlakshmi; Jadhav, D.A.; Dutta, S.; Sherpa, K.C.; Jayaswal, K.; Saravanabhupathy, S.; Mohanty, K.T.; Banerjee, R.; Kumar, J.; Rajak, R.C. Co-Digestion Processes of Waste: Status and Perspective. In Bio-Based Materials and Waste for Energy Generation and Resource Management; Elsevier: Amsterdam, The Netherlands, 2023; pp. 207–241. [Google Scholar]
TG, I.; Haq, I.; Kalamdhad, A.S. Factors Affecting Anaerobic Digestion for Biogas Production: A Review. In Advanced Organic Waste Management; Elsevier: Amsterdam, The Netherlands, 2022; pp. 223–233. [Google Scholar]
IICA; RedBioLAC. Estado Actual de la Biodigestión en América Latina y El Caribe; RedBioLAC: San José, Costa Rica, 2023; ISBN 978-92-9273-030-7. [Google Scholar]
Liu, Y.; Watanabe, R.; Li, Q.; Luo, Y.; Tsuzuki, N.; Qin, Y.; Li, Y.Y. Changes in Methanogenic Performance and Microbial Community during Gradual Transition from Co-Digestion with Food Waste to Mono-Digestion of Rice Straw. Bioresour. Technol. 2025, 419, 132072. [Google Scholar] [CrossRef] [PubMed]
Enokida, C.H.; Tapparo, D.C.; Antes, F.G.; Radis Steinmetz, R.L.; Magrini, F.E.; Sophiatti, I.V.M.; Paesi, S.; Kunz, A. Anaerobic Codigestion of Livestock Manure and Agro-Industrial Waste in a CSTR Reactor: Operational Aspects, Digestate Characteristics, and Microbial Community Dynamics. Renew. Energy 2025, 238, 121865. [Google Scholar] [CrossRef]
Kegl, T.; Torres Jiménez, E.; Kegl, B.; Kovač Kralj, A.; Kegl, M. Modeling and Optimization of Anaerobic Digestion Technology: Current Status and Future Outlook. Prog. Energy Combust. Sci. 2025, 106, 101199. [Google Scholar] [CrossRef]
Orlando, M.Q.; Borja, V.M. Pretreatment of Animal Manure Biomass to Improve Biogas Production: A Review. Energies 2020, 13, 3573. [Google Scholar] [CrossRef]
Rahmani, A.M.; Gahlot, P.; Moustakas, K.; Kazmi, A.; Ojha, C.S.P.; Tyagi, V.K. Pretreatment Methods to Enhance Solubilization and Anaerobic Biodegradability of Lignocellulosic Biomass (Wheat Straw): Progress and Challenges. Fuel 2022, 319, 123726. [Google Scholar] [CrossRef]
Martins, R.M. Avaliação Da Co-Digestão Anaeróbia Como Alternativa para Valorização de Resíduos do Processamento de Mandioca e Esterco de Gado Leiteiro; Universidade Federal de Ouro Preto: Ouro Petro, Brazil, 2022. [Google Scholar]
Chen, X.; He, H.; Zhu, N.; Jia, P.; Tian, J.; Song, W.; Cui, Z.; Yuan, X. Food Waste Impact on Dry Anaerobic Digestion of Straw in a Novel Reactor: Biogas Yield, Stability, and Hydrolysis-Methanogenesis Processes. Bioresour. Technol. 2024, 406, 131023. [Google Scholar] [CrossRef]
Elsayed, A.; Laqa Kakar, F.; Mustafa Abdelrahman, A.; Ahmed, N.; AlSayed, A.; Sherif Zagloul, M.; Muller, C.; Bell, K.Y.; Santoro, D.; Norton, J.; et al. Enhancing Anaerobic Digestion Efficiency: A Comprehensive Review on Innovative Intensification Technologies. Energy Convers. Manag. 2024, 320, 118979. [Google Scholar] [CrossRef]
Kalamaras, S.D.; Tsitsimpikou, M.A.; Tzenos, C.A.; Lithourgidis, A.A.; Pitsikoglou, D.S.; Kotsopoulos, T.A. A Low-Cost IoT System Based on the ESP32 Microcontroller for Efficient Monitoring of a Pilot Anaerobic Biogas Reactor. Appl. Sci. 2025, 15, 34. [Google Scholar] [CrossRef]
Mabrouki, J.; Azrour, M.; Fattah, G.; Dhiba, D.; Hajjaji, S. El Intelligent Monitoring System for Biogas Detection Based on the Internet of Things: Mohammedia, Morocco City Landfill Case. Big Data Min. Anal. 2021, 4, 10–17. [Google Scholar] [CrossRef]
Gupta, A. Making Biogas SMART using Internet of Things (lOT). In Proceedings of the 2020 4th International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech), Kolkata, India, 2–4 October 2020; pp. 1–4. [Google Scholar]
TheMathWorks ThingSpeak for Students and Educators. Available online: https://thingspeak.mathworks.com/pages/education (accessed on 2 March 2025).
Song, X.; Jiao, Q.; Lian, Y.; Xie, B.; Liu, J. Unveiling the Mechanism of Hunger Stress Enhancing Methane Production from Low-Strength Wastewater: Insight from Organic Conversion and Microbial Metabolism Perspective. Chem. Eng. J. 2025, 511, 162170. [Google Scholar] [CrossRef]
Rossi, E.; Pecorini, I.; Iannelli, R. Multilinear Regression Model for Biogas Production Prediction from Dry Anaerobic Digestion of OFMSW. Sustainability 2022, 14, 4393. [Google Scholar] [CrossRef]
Abubakar, U.A.; Lemar, G.S.; Bello, A.-A.D.; Ishaq, A.; Dandajeh, A.A.; Jagun, Z.T.; Houmsi, M.R. Evaluation of traditional and machine learning approaches for modeling volatile fatty acid concentrations in anaerobic digestion of sludge: Potential and challenges. Environ. Sci. Pollut. Res. 2024, 1–14. [Google Scholar] [CrossRef]
Sato, Y.; Hasemi, K.; Machikawa, K.; Kinjo, H.; Yashiro, N.; Iimura, Y.; Aoki, H.; Habe, H. Assessing Microbial Stability and Predicting Biogas Production in Full-Scale Thermophilic Dry Methane Fermentation of Municipal Solid Waste. Bioresour. Technol. 2024, 402, 130766. [Google Scholar] [CrossRef] [PubMed]
APHA; AWWA; WEF. 4500-N NITROGEN. Standard Methods For the Examination of Water and Wastewater. Available online: https://www.standardmethods.org/doi/10.2105/SMWW.2882.086 (accessed on 2 April 2025).
APHA; AWWA; WEF. 2540 SOLIDS. Standard Methods for the Examination of Water and Wastewater. Available online: https://www.standardmethods.org/doi/10.2105/SMWW.2882.030 (accessed on 2 April 2025).
APHA; AWWA; WEF. 5310 TOTAL ORGANIC CARBON. Standard Methods for the Examination of Water and Wastewater. Available online: https://www.standardmethods.org/doi/10.2105/SMWW.2882.104 (accessed on 2 April 2025).
Wilson, S. The MICE Algorithm. Available online: https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html (accessed on 4 May 2025).
Gaspari, M.; Ghiotto, G.; Centurion, V.B.; Kotsopoulos, T.; Santinello, D.; Campanaro, S.; Treu, L.; Kougias, P.G. Decoding Microbial Responses to Ammonia Shock Loads in Biogas Reactors through Metagenomics and Metatranscriptomics. Environ. Sci. Technol. 2024, 58, 591–602. [Google Scholar] [CrossRef] [PubMed]
APHA; AWWA; WEF. 5560 ORGANIC AND VOLATILE ACIDS. Standard Methods for the Examination of Water and Wastewater; Braun-Howland, E., Ed.; APHA Press: Washington, DC, USA, 2023. [Google Scholar]
Sigma-Aldrich. VOLATILE FREE ACID MIX. CRM46975. Available online: https://www.sigmaaldrich.com/CO/en/sds/supelco/crm46975?userType=undefined (accessed on 21 September 2025).
Omega Bio-tek, Inc. E.Z.N.A.® Universal Pathogen DNA Kit. Available online: https://omegabiotek.com/product/pathogen-dna-rna-extraction-e-z-n-a-universal-pathogen/?cn-reloaded=1 (accessed on 8 April 2025).
Callahan, B.J.; McMurdie, P.J.; Rosen, M.J.; Han, A.W.; Johnson, A.J.A.; Holmes, S.P. DADA2: High-Resolution Sample Inference from Illumina Amplicon Data. Nat. Methods 2016, 13, 581–583. [Google Scholar] [CrossRef] [PubMed]
Pruesse, E.; Quast, C.; Knittel, K.; Fuchs, B.M.; Ludwig, W.; Peplies, J.; Glockner, F.O. SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB. Nucleic Acids Res. 2007, 35, 7188–7196. [Google Scholar] [CrossRef]
Zhang, L.; Loh, K.C.; Lim, J.W.; Zhang, J. Bioinformatics Analysis of Metagenomics Data of Biogas-Producing Microbial Communities in Anaerobic Digesters: A Review. Renew. Sustain. Energy Rev. 2019, 100, 110–126. [Google Scholar] [CrossRef]
Navarro-Díaz, M.; Aparicio-Trejo, V.; Valdez-Vazquez, I.; Carrillo-Reyes, J.; Avitia, M.; Escalante, A.E. Levels of Microbial Diversity Affect the Stability and Function of Dark Fermentation Bioreactors. Front. Ind. Microbiol. 2024, 2, 1386726. [Google Scholar] [CrossRef]
Xu, R.Z.; Cao, J.S.; Wu, Y.; Wang, S.N.; Luo, J.Y.; Chen, X.; Fang, F. An Integrated Approach Based on Virtual Data Augmentation and Deep Neural Networks Modeling for VFA Production Prediction in Anaerobic Fermentation Process. Water Res. 2020, 184, 116103. [Google Scholar] [CrossRef]
Long, F.; Wang, L.; Cai, W.; Lesnik, K.; Liu, H. Predicting the Performance of Anaerobic Digestion Using Machine Learning Algorithms and Genomic Data. Water Res. 2021, 199, 117182. [Google Scholar] [CrossRef]
Santinello, D.; Zampieri, G.; Agostini, S.; Müller, B.; Favaro, L.; Treu, L.; Campanaro, S. Process Stability in Anaerobic Digestion: Unveiling Microbial Signatures of Full-Scale Reactor Performance. Chem. Eng. J. 2024, 497, 154962. [Google Scholar] [CrossRef]
Lu, D.; Li, M.; Nie, E.; Guo, R.; Fu, S. Microbial Volatile Organic Compounds Produced during the Anaerobic Digestion Process Can Serve as Potential Indicators of Microbial Community Stability. Water Res. 2025, 277, 123286. [Google Scholar] [CrossRef] [PubMed]
Mumuni, A.; Mumuni, F. Data Augmentation: A Comprehensive Survey of Modern Approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Schnürer, A.; Jarvis, A. Microbiology of the Biogas Process; Swedish University of Agricultural Sciences: Uppsala, Sweden, 2018; ISBN 9789157695468. [Google Scholar]
Velásquez, M.E.; Rincón, J.M. Estimación Del Potencial de Conversión a Biogás de La Biomasa En Colombia y Su Aprovechamiento; Informe Unidad de Planeación Minero Energética UPME—Universidad Nacional de Colombia: Bogota, Colombia, 2018. [Google Scholar]
Sun, Y.; Dai, H.L.; Moayedi, H.; Nguyen Le, B.; Muhammad Adnan, R. Predicting Steady-State Biogas Production from Waste Using Advanced Machine Learning-Metaheuristic Approaches. Fuel 2024, 355, 129493. [Google Scholar] [CrossRef]
De Jonge, N.; Moset, V.; Møller, H.B.; Nielsen, J.L. Microbial Population Dynamics in Continuous Anaerobic Digester Systems during Start up, Stable Conditions and Recovery after Starvation. Bioresour. Technol. 2017, 232, 313–320. [Google Scholar] [CrossRef]
Chen, H.; Zhang, W.; Wu, J.; Chen, X.; Liu, R.; Han, Y.; Xiao, B.; Yu, Z.; Peng, Y. Improving Two-Stage Thermophilic-Mesophilic Anaerobic Co-Digestion of Swine Manure and Rice Straw by Digestate Recirculation. Chemosphere 2021, 274, 129787. [Google Scholar] [CrossRef]
Wu, C.; Huang, Q.; Yu, M.; Ren, Y.; Wang, Q.; Sakai, K. Effects of Digestate Recirculation on a Two-Stage Anaerobic Digestion System, Particularly Focusing on Metabolite Correlation Analysis. Bioresour. Technol. 2018, 251, 40–48. [Google Scholar] [CrossRef]
Ma, G.; Chen, Y.; Ndegwa, P. Association between Methane Yield and Microbiota Abundance in the Anaerobic Digestion Process: A Meta-Regression. Renew. Sustain. Energy Rev. 2021, 135, 110212. [Google Scholar] [CrossRef]
Raja Ram, N.; Nikhil, G.N. A Critical Review on Sustainable Biogas Production with Focus on Microbial-Substrate Interactions: Bottlenecks and Breakthroughs. Bioresour. Technol. Rep. 2022, 19, 101170. [Google Scholar] [CrossRef]
Kim, H.; Jeon, B.S.; Sang, B.-I. An Efficient New Process for the Selective Production of Odd-Chain Carboxylic Acids by Simple Carbon Elongation Using Megasphaera Hexanoica. Sci. Rep. 2019, 9, 11999. [Google Scholar] [CrossRef]
Sun, J.; Zhang, L.; Loh, K.C. Review and Perspectives of Enhanced Volatile Fatty Acids Production from Acidogenic Fermentation of Lignocellulosic Biomass Wastes. Bioresour. Bioprocess 2021, 8, 68. [Google Scholar] [CrossRef]
Harirchi, S.; Wainaina, S.; Sar, T.; Nojoumi, S.A.; Parchami, M.; Parchami, M.; Varjani, S.; Khanal, S.K.; Wong, J.; Awasthi, M.K.; et al. Microbiological Insights into Anaerobic Digestion for Biogas, Hydrogen or Volatile Fatty Acids (VFAs): A Review. Bioengineered 2022, 13, 6521–6557. [Google Scholar] [CrossRef]
Franke-Whittle, I.H.; Walter, A.; Ebner, C.; Insam, H. Investigation into the Effect of High Concentrations of Volatile Fatty Acids in Anaerobic Digestion on Methanogenic Communities. Waste Manag. 2014, 34, 2080–2089. [Google Scholar] [CrossRef] [PubMed]
Lonkar, S.; Fu, Z.; Holtzapple, M. Optimum Alcohol Concentration for Chain Elongation in Mixed-culture Fermentation of Cellulosic Substrate. Biotechnol. Bioeng. 2016, 113, 2597–2604. [Google Scholar] [CrossRef] [PubMed]
Duber, A.; Zagrodnik, R.; Gutowska, N.; Brodowski, F.; Dąbrowski, T.; Dąbrowski, S.; Łężyk, M.; Oleskowicz-Popiel, P. Single- vs. Two-Stage Fermentation of an Organic Fraction of Municipal Solid Waste for an Enhanced Medium Chain Carboxylic Acids Production—The Impact of Different PH and Temperature. Bioresour. Technol. 2025, 415, 131697. [Google Scholar] [CrossRef] [PubMed]
Zakaria, B.S.; Guo, H.; Kim, Y.; Dhar, B.R. Molecular Biology and Modeling Analysis Reveal Functional Roles of Propionate to Acetate Ratios on Microbial Syntrophy and Competition in Electro-Assisted Anaerobic Digestion. Water Res. 2022, 216, 118335. [Google Scholar] [CrossRef]
Jabłoński, S.J.; Łukaszewicz, M. Mathematical Modelling of Methanogenic Reactor Start-up: Importance of Volatile Fatty Acids Degrading Population. Bioresour. Technol. 2014, 174, 74–80. [Google Scholar] [CrossRef]
Basile, A.; Campanaro, S.; Kovalovszki, A.; Zampieri, G.; Rossi, A.; Angelidaki, I.; Valle, G.; Treu, L. Revealing Metabolic Mechanisms of Interaction in the Anaerobic Digestion Microbiome by Flux Balance Analysis. Metab. Eng. 2020, 62, 138–149. [Google Scholar] [CrossRef]
Iglesias-Iglesias, R.; Campanaro, S.; Treu, L.; Kennes, C.; Veiga, M.C. Valorization of Sewage Sludge for Volatile Fatty Acids Production and Role of Microbiome on Acidogenic Fermentation. Bioresour. Technol. 2019, 291, 121817. [Google Scholar] [CrossRef]
Yun, Y.M.; Sung, S.; Kang, S.; Kim, M.S.; Kim, D.H. Enrichment of Hydrogenotrophic Methanogens by Means of Gas Recycle and Its Application in Biogas Upgrading. Energy 2017, 135, 294–302. [Google Scholar] [CrossRef]
Niya, B.; Yaakoubi, K.; Beraich, F.Z.; Arouch, M.; Meftah Kadmiri, I. Current Status and Future Developments of Assessing Microbiome Composition and Dynamics in Anaerobic Digestion Systems Using Metagenomic Approaches. Heliyon 2024, 10, e28221. [Google Scholar] [CrossRef]
Schmitz, R.A.; Peeters, S.H.; Versantvoort, W.; Picone, N.; Pol, A.; Jetten, M.S.M.; Op Den Camp, H.J.M. Verrucomicrobial Methanotrophs: Ecophysiology of Metabolically Versatile Acidophiles. FEMS Microbiol. Rev. 2021, 45, fuab007. [Google Scholar] [CrossRef]
Dunfield, P.F.; Yuryev, A.; Senin, P.; Smirnova, A.V.; Stott, M.B.; Hou, S.; Ly, B.; Saw, J.H.; Zhou, Z.; Ren, Y.; et al. Methane Oxidation by an Extremely Acidophilic Bacterium of the Phylum Verrucomicrobia. Nature 2007, 450, 879–882. [Google Scholar] [CrossRef] [PubMed]
Basile, A.; Zampieri, G.; Kovalovszki, A.; Karkaria, B.; Treu, L.; Patil, K.R.; Campanaro, S. Modelling of Microbial Interactions in Anaerobic Digestion: From Black to Glass Box. Curr. Opin. Microbiol. 2023, 75, 102363. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Loh, K.C.; Lim, J.W.; Zhang, J. Uncovering Microbiome Adaptations in a Full-Scale Biogas Plant: Insights from MAG-Centric Metagenomics and Metaproteomics. Microorganisms 2023, 11, 2412. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Y.; Jiao, P.; Zhang, M.; Deng, Y.; Jiang, C.; Liu, X.-W.; Lou, L.; Li, Y.; Zhang, X.-X.; et al. Microbiome-Functionality in Anaerobic Digesters: A Critical Review. Water Res. 2023, 249, 120891. [Google Scholar] [CrossRef]
Venkiteshwaran, K.; Bocher, B.; Maki, J.; Zitomer, D. Relating Anaerobic Digestion Microbial Community and Process Function: Supplementary Issue: Water Microbiology. Microbiol. Insights 2015, 8, 37–44. [Google Scholar] [CrossRef]
Giongo, A.; Granada, C.E.; Borges, L.G.A.; Pereira, L.M.; Trindade, F.J.; Mattiello, S.P.; Oliveira, R.R.; Shubeita, F.M.; Lovato, A.; Marcon, C.; et al. Microbial communities in anaerobic digesters change over time and sampling depth. Braz. J. Microbiol. 2020, 51, 1177–1190. [Google Scholar] [CrossRef]
Volmer, J.G.; McRae, H.; Morrison, M. The Evolving Role of Methanogenic Archaea in Mammalian Microbiomes. Front. Microbiol. 2023, 14, 1268451. [Google Scholar] [CrossRef]
Rakitin, A.L.; Kulichevskaya, I.S.; Beletsky, A.V.; Mardanov, A.V.; Dedysh, S.N.; Ravin, N.V. Verrucomicrobia of the Family Chthoniobacteraceae Participate in Xylan Degradation in Boreal Peat Soils. Microorganisms 2024, 12, 2271. [Google Scholar] [CrossRef]
Radočaj, D.; Jurišić, M. Comparative Evaluation of Ensemble Machine Learning Models for Methane Production from Anaerobic Digestion. Fermentation 2025, 11, 130. [Google Scholar] [CrossRef]
Schroer, H.W.; Just, C.L. Feature Engineering and Supervised Machine Learning to Forecast Biogas Production during Municipal Anaerobic Co-Digestion. ACS ES T Eng. 2024, 4, 660–672. [Google Scholar] [CrossRef]

Figure 1. Two-phase digester made of PVC.

Figure 2. Original (58 points) vs. interpolated (1000 points) time-series data for CH₄ concentration.

Figure 3. D1F1 operation during 24 h on random days.

Figure 4. D1F2 operation during 24 h on random days.

Figure 5. (a) Mixture feeding over time (b) Steady-state identification.

Figure 6. VFA and alcohol concentration in D1.

Figure 7. Alpha rarefaction curves of the alpha diversity (Shannon) of the 16S rRNA gene.

Figure 8. Temporal dynamics of microbial phyla in D1.

Figure 9. Approximation result.

Figure 10. Relative variable importance with defined cut-off.

Figure 11. (a) R², (b) MRE and (c) RMSE vs. Number of included variables in the approximation.

Figure 12. Weighted approximation.

Figure 13. Behavior of the coefficients associated with the weighted variables.

Figure 14. Methane prediction and training data.

Figure 15. Performance comparison of the developed models.

Table 1. Substrate characterization.

Substrate	C (%)	N (%)	C:N	%Hum.	%TS	%VS	%VS/%TS	%FS
I	32.9	3.2	10.3	98.0%	2.2%	1.3%	60.8%	0.9%
CD	43.1	0.9	45.9	11.0%	89.0%	85.4%	96.0%	3.6%
PM	12.9	1.9	7.0	72.0%	28.0%	21.0%	75.0%	7.0%

I: Inoculum; CD: Cassava dregs; PM: Pig manure; C: Carbon content: N: nitrogen content.

Table 2. Reactor feeding regime.

D1F1
Mix	OLR (gVS/L·day)	Mix Load (g)	I (g)	PM (g)	CD (g)	H₂O Added (g)	Daily Load (g)	%TS	C:N	%I	%PM	%CD	HRT (day)	Period (day)	pH Treatment
I	12.40	329	329			152	481	10%	10.3	100%	0%	0%	5	0–5
1	7.67	48		35	13	166	214	10%	21.6	0%	73%	27%	11	6–27	Lime
2	7.66	54		43	11	164	218	10%	18.3	0%	80%	20%	11	28–49
3	6.60	48		39	9	162	210	9%	17.5	0%	81%	19%	11	50–89	NaOH
4	5.90	46	14	25	7	165	211	8%	15.7	30%	54%	15%	11	90–118	NaOH
5	5.87	71	35	26	10	136	207	8%	20.7	50%	36%	14%	11	119–161
D1F2
I	12.40	439	439			202	641	10%	10.3	100%	0%	0%	5	0–5
1	5.75	48		35	13	166	214	10%	21.6	0%	73%	27%	15	6–38
2	5.74	54		43	11	164	218	10%	18.3	0%	80%	20%	15	39–60
3	4.95	48		39	9	162	210	9%	17.5	0%	81%	19%	15	61–100
4	4.41	46	14	25	7	165	211	8%	15.7	30%	54%	15%	15	101–127
5	4.40	71	35	26	10	136	207	8%	20.7	50%	36%	14%	15	128–161

I = inoculum; PM = pig manure; CD = cassava dreg; OLR = organic loading rate; TS = total solids; HRT = hydraulic retention time.

Table 3. Data processing.

D1F1
Steps	pH	T (°C)	CH₄	Total	%
All data	694,110	694,110	694,110	2,082,330	100%
Day-hour	694,110	694,110	694,110	2,082,330	100%
Filters	635,953	635,953	635,953	1,907,859	92%
MICE	694,110	694,110	694,110	2,082,330	100%
Data per hour	2893	2893	2893	8679	0.42%
Data per day	152	152	152	456	0.02%
D1F2
All data	573,215	573,215	573,215	1,719,645	100%
Day-hour	573,215	573,215	573,215	1,719,645	100%
Filters	518,897	518,897	518,897	1,556,691	91%
MICE	573,215	573,215	573,215	1,719,645	100%
Data per hour	2389	2389	2389	7167	0.42%
Data per day	147	147	147	441	0.03%

Table 4. List of associated suffixes for VFA (a).

Acetic	1	Caproic	5
Propionic	2	Heptanoic	6
Butyric	3	Ethanol	7
Valeric	4	Propanol	8

Table 5. List of associated suffixes for microorganisms (m).

Firmicutes	9	Tenericutes	19
Bacteroidetes	10	Armatimonadetes	20
Actinobacteria	11	Cyanobacteria Chloroplast	21
Proteobacteria	12	Acidobacteria	22
Planctomycetes	13	Lentisphaerae	23
Synergistetes	14	BRC1	24
Spirochaetes	15	Candidatus Saccharibacteria	25
Euryarchaeota	16	Parcubacteria	26
Verrucomicrobia	17	Chloroflexi	27
Cloacimonetes	18

Table 6. List of associated suffixes for operating conditions (p).

phi	28	pho	30
Ti	29	To	31

Table 7. Approximation coefficients.

$C_{1}$	−0.25	$C_{17}$	152.94
$C_{2}$	0.72	$C_{18}$	−154.01
$C_{3}$	4.86	$C_{19}$	−112.07
$C_{4}$	18.20	$C_{20}$	165.65
$C_{5}$	−61.33	$C_{21}$	60.94
$C_{6}$	−22.60	$C_{22}$	15.47
$C_{7}$	−12.08	$C_{23}$	−302.56
$C_{8}$	14.33	$C_{24}$	159.10
$C_{9}$	7.39	$C_{25}$	−450.57
$C_{10}$	−18.99	$C_{26}$	81.69
$C_{11}$	−21.21	$C_{27}$	116.88
$C_{12}$	−161.69	$C_{28}$	−1562.95
$C_{13}$	−58.79	$C_{29}$	6.72
$C_{14}$	67.72	$C_{30}$	68.50
$C_{15}$	123.17	$C_{31}$	242.54
$C_{16}$	84.58

Table 8. Weighting parameters

C_{i}^{*}

.

Table 8. Weighting parameters

C_{i}^{*}

.

$C_{10}^{*}$	66.48	$C_{27}^{*}$	1.79
$C_{12}^{*}$	50.51	$C_{3}^{*}$	1.53
$C_{17}^{*}$	38.48	$C_{4}^{*}$	1.43
$C_{13}^{*}$	36.87	$C_{26}^{*}$	0.88
$C_{15}^{*}$	36.06	$C_{8}^{*}$	0.67
$C_{11}^{*}$	34.35	$C_{6}^{*}$	0.57
$C_{20}^{*}$	30.92	$C_{22}^{*}$	0.37
$C_{14}^{*}$	28.37	$C_{21}^{*}$	0.34
$C_{9}^{*}$	15.61	$C_{1}^{*}$	0.17
$C_{16}^{*}$	14.37	$C_{28}^{*}$	0.15
$C_{19}^{*}$	12.25	$C_{2}^{*}$	0.13
$C_{18}^{*}$	7.87	$C_{31}^{*}$	0.11
$C_{23}^{*}$	7.63	$C_{7}^{*}$	0.11
$C_{25}^{*}$	4.04	$C_{30}^{*}$	0.01
$C_{24}^{*}$	2.30	$C_{29}^{*}$	0.01
$C_{5}^{*}$	2.21

Table 9. New weighted approximation coefficients.

$C_{10}$	−4.64	$C_{20}$	63.94
$C_{12}$	−42.40	$C_{14}$	16.20
$C_{17}$	−94.76	$C_{9}$	1.70
$C_{13}$	−7.66	$C_{16}$	−7.10
$C_{15}$	64.88	$C_{19}$	−22.13
$C_{11}$	−4.02	$C_{18}$	251.39

Table 10. Precision metrics for approximations and real data.

	$R^{2}$	$M R E [%]$	$R M S E$ [ppm]
All variables	0.989	12.59	319.94
Weighted approximation	0.979	14.94	435.82

Table 11. Evaluation of the training model and predictive model.

	$R^{2}$	$M R E [%]$	$R M S E$ [ppm]
Training Fit	0.999	0.35	20.77
Prediction	0.920	6.50	139.84

Table 12. Performance comparison of predictive models in anaerobic digestion.

Reference	Model Type	Key Predictors	Performance (R² or Error)	Key Finding/Novelty
[44]	ML Algorithms (RF, NNET, etc.)	Operational and genomic data.	RF Accuracy = 0.82	Combines data types but requires more complex models for slightly lower performance.
[78]	Multilayer Perceptron (MLP)	High-resolution SCADA data.	MLP (SCADA): Adj. $R^{2}$ = 0.78	Relies on massive, high-frequency operational data, concluding that microbial input is unnecessary.
[50]	MLP optimized with metaheuristics.	Operational parameters.	ERWCA-MLP (test): $R^{2}$ = 0.93	Achieves high accuracy but requires highly complex, computationally intensive “black-box” models.
[30]	Temperature-adjusted MLR.	MSW load and air temperature.	Adjusted MLR: $R^{2}$ = 0.975	Also uses a simple MLR but succeeds by excluding microbial data due to its stability in that specific system.
Present Work	Dynamic Multiple Linear Regression (MLR).	12 key microbial phyla.	MLR (test): $R^{2}$ = 0.920	Simple and interpretable model by demonstrating that microbial data can be a critical predictor.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ostos, I.; Ruiz, I.; Cruz, D.; Flórez-Pardo, L.M. Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data. Bioengineering 2025, 12, 1133. https://doi.org/10.3390/bioengineering12111133

AMA Style

Ostos I, Ruiz I, Cruz D, Flórez-Pardo LM. Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data. Bioengineering. 2025; 12(11):1133. https://doi.org/10.3390/bioengineering12111133

Chicago/Turabian Style

Ostos, Iván, Iván Ruiz, Diego Cruz, and Luz Marina Flórez-Pardo. 2025. "Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data" Bioengineering 12, no. 11: 1133. https://doi.org/10.3390/bioengineering12111133

APA Style

Ostos, I., Ruiz, I., Cruz, D., & Flórez-Pardo, L. M. (2025). Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data. Bioengineering, 12(11), 1133. https://doi.org/10.3390/bioengineering12111133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Methane Concentration Prediction in Anaerobic Codigestion Using Multiple Linear Regression with Integrated Microbial and Operational Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Substrate Selection

2.2. Experimental Setup

2.3. Operational Parameter

2.4. Steady State

2.5. VFA Quantification

2.6. Metagenomic Analysis

2.7. Preprocessing and Unified Database

2.8. Linear Modeling

2.8.1. Assessing Variable Importance

2.8.2. Predictive Model

2.8.3. Model Performance Evaluation

3. Results and Discussion

3.1. Digester Performance

3.1.1. IoT Monitoring Advantages

3.1.2. Stabilization of Anaerobic Codigestion

3.2. Volatile Fatty Acids (VFAs) and Metagenomic Analysis

3.3. Multiple Linear Regression (MLR)

3.3.1. Data Prioritization

3.3.2. Predictive Model Development

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI