1. Introduction
The Valle del Cauca region is one of Colombia’s most active agro-industrial areas, combining high agricultural productivity with unique ecological richness. The territory is sustained by ecosystems that range from coastal plains to montane forests, which support both biological diversity and productive capacity. It ranks as the third-largest producer and consumer of pork in the country, with a reported output of 88,105 tons in 2023, equivalent to 15.6% of national production, and an average pig population exceeding 396,000 animals [
1]. This sector generates large volumes of pig manure (PM) that require appropriate handling to prevent environmental and public health risks. Another productive activity with growing regional relevance is cassava cultivation, which covered approximately 564 hectares in 2020, yielding a total of 9888 tons of fresh roots [
2]. During starch extraction, each kilogram of cassava generates about 0.2 kg of starch, 0.65 kg of fibrous residue (cassava dregs (CD)), and between five and seven liters of wastewater [
3,
4]. Based on these ratios, the estimated annual generation of by-products in the region reaches nearly 6427 tons, most of which are not currently valorized.
To address the increasing accumulation of organic residues from pig farming and cassava processing, anaerobic digestion (AD) has been promoted in rural areas of Valle del Cauca as a strategy for energy recovery and waste management. In these settings, one-phase tubular biodigesters are commonly employed due to their affordable construction, ease of installation and minimal infrastructure requirements, making them particularly attractive to smallholder producers [
5].
AD is a biologically mediated process capable of metabolizing up to 95% organic matter [
6]. It proceeds through four main sequential stages, each driven by specific microbial groups. In the hydrolysis phase, hydrolytic bacteria degrade complex macromolecules such as carbohydrates, proteins and lipids into soluble monomers. During acidogenesis, these compounds are converted by fermentative bacteria into volatile fatty acids (VFAs), alcohols, hydrogen and carbon dioxide. In acetogenesis, acetogenic microorganisms convert these intermediates into acetate, along with additional hydrogen and carbon dioxide. Finally, in methanogenesis, archaea utilize acetate, hydrogen and carbon dioxide to generate methane as the principal end product [
7]. While each stage performs a distinct role, the overall efficiency of this multistep pathway depends on the synchronized activity of these microbial groups, where syntrophic cross-feeding and interspecies H
2/formate transfer channel mediates toward effective substrate valorization and stable methane formation [
6,
8,
9]. Beyond its biological complexity, AD offers important advantages, including the provision of reliable baseload renewable energy that is independent of weather conditions, the achievement of high energy yields per unit area once stabilized, and the generation of multiple energy outputs such as biomethane, hydromethane, electricity, heat, and biohydrogen [
7].
For the process to remain stable and efficient, environmental conditions such as pH and temperature must be kept within optimal ranges, typically between 6.5 and 7.5 for pH and 30 to 38 °C under mesophilic conditions [
10,
11]. In addition, maintaining a C:N ratio between 20:1 and 30:1 is considered ideal for AD, as it ensures sufficient nitrogen for microbial growth without leading to ammonia inhibition or carbon limitation [
12,
13]. However, most rural systems lack monitoring tools and operate through empirical practices, without clear understanding of internal conditions or microbial dynamics [
5,
14]. This limitation frequently leads to process imbalance, reduced performance and early system failure.
To overcome the performance limitations of conventional digesters, several strategies have been developed to improve substrate biodegradability and enhance biogas production. Among them, mechanical pre-treatments, codigestion, and multiphase configurations have proven to be particularly effective in increasing system efficiency [
15,
16,
17]. Mechanical pre-treatments have proven effective in enhancing the hydrolysis of lignocellulosic substrates by reducing particle size and fiber crystallinity, thus increasing surface area and enzymatic accessibility [
15,
18]. Depending on specific conditions, methane production improvements of 16% to 99% have been reported with mechanical treatments [
19]. These results highlight the potential of simple mechanical treatments to enhance biodegradability and biogas productivity, especially during the hydrolysis and acidogenesis phases, which are often rate-limited in solid waste digestion.
Codigestion has emerged as a robust strategy to address the nutrient imbalances and low biodegradability often associated with single-substrate digestion. By combining complementary feedstocks, this approach improves the carbon to nitrogen (C:N) ratio, dilutes inhibitors, and stimulates microbial activity, allowing for higher energy yields [
16]. For instance, it has been reported that mixtures containing 66% PM, 16% cassava pulp, and 16% bagasse achieve higher methane yields than those with high bagasse content alone, which led to pH imbalances and process failure [
20]. Likewise, biogas production efficiency and system stability for food waste and corn straw co-digestion with a hydraulic retention time (HRT) of 25 days have been informed, showing that codigestion notably enhanced the efficiency of the hydrolysis and acidogenesis stages, with the highest anaerobic biodegradability (85.7%) obtained when the food waste content was set at 60% [
21]. These improvements are attributed to enhanced microbial synergy and substrate availability, which accelerate volatile solids degradation.
Multiphase AD systems have been developed to address the limitations of single-stage configurations by creating distinct operational environments for each metabolic phase [
17]. In two-phase systems, the acidogenic and methanogenic stages are physically separated, which enables more efficient substrate conversion, greater resilience to organic shocks, and better pH control [
4,
17]. This structural decoupling has led to increases in methane yields, improved volatile solids removal, and significant reductions in HRT without compromising performance [
4]. Although three-phase systems further refine process compartmentalization by isolating hydrolysis, acidogenesis, and methanogenesis, they often entail higher operational complexity, energy consumption, and maintenance requirements [
22]. These drawbacks have limited their scalability, particularly in low-resource contexts. Consequently, two-phase systems represent a practical balance between performance enhancement and technical feasibility, making them a more accessible alternative for decentralized applications.
Among the strategies developed to improve AD performance, the integration of real-time monitoring systems has become increasingly relevant for enhancing process oversight and operational efficiency [
23]. Basic and key variables such as pH, temperature, and methane concentration can be considered to infer the internal state of the reactor and anticipate potential imbalances. The use of cost-effective IoT platforms such as ESP32 microcontrollers coupled with sensors has proven suitable for real-time tracking, achieving deviations below 2% for CH
4 and 1.7% for pH when compared to laboratory-grade methods [
23,
24]. Systems incorporating the MQ-4 sensor (200–10,000 ppm CH
4) and platforms like ThingSpeak facilitate continuous data acquisition, cloud visualization, and automatic alerts, offering a practical solution to reduce manual intervention and increase system reliability [
23,
25,
26].
In parallel, greater attention should be given to the microbial community (MC) involved in AD, as they are rarely considered in routine operation despite being responsible for driving the entire process. Recent studies have highlighted that variations in microbial structure are strongly influenced by substrate type, operational parameters such as temperature and organic loading rates (OLR), and reactor configuration. However, most operational strategies still rely exclusively on physicochemical parameters, overlooking microbial signals often preceding system imbalances [
7]. With this in mind, evidence from studies on hunger stress has demonstrated that shifts in microbial communities under adverse conditions provide valuable insights into process behavior and system dynamics, underscoring the importance of integrating microbial data into process understanding to clarify how structural and functional changes within the community influence methane levels [
27].
Despite their central role in AD, MLR models have traditionally been developed using operational variables that capture external system conditions, parameters that are directly measurable or predefined during setup, while MC have often been treated as secondary inputs or excluded altogether. For instance, recent studies have used MLR to predict specific methane production from dry AD of the organic fraction of municipal solid waste in pilot-scale plug-flow reactors. Six significant, mostly operational predictors were prioritized (VS, OLR, HRT, C/N ratio, lignin content, and VFA) via Pearson correlation and PCA. Simple regression showed low performance (R
2 = 0.3), while the full MLR reached R
2 = 0.91. A reduced model with four uncorrelated variables (VS, OLR, C/N ratio, lignin content) maintained strong accuracy (R
2 = 0.87) with fewer inputs [
28]. Similarly, MLR has been applied to predict VFA concentrations in AD of primary and secondary sludge using operational and physicochemical inputs. The model achieved R
2 values above 0.85 in several scenarios, offering high interpretability and low computational demand. Although less accurate than leading ensemble methods, MLR remains suitable for applications that require clear interpretation of variable influence [
29].
Unlike models based solely on operational parameters, recent full-scale work in thermophilic dry methane systems showed that MC remained stable, with
Methanoculleus and syntrophic acetate oxidizers dominating throughout the process. This stability enabled the development of an adjusted MLR model which achieved high predictive accuracy (R
2 = 0.97) and outperformed gradient boosting approaches, highlighting the importance of linking microbial consistency with operational data for reliable large-scale biogas prediction [
30].
Building on emerging evidence supporting the integration of microbial data into statistical modeling, this study aims to develop a predictive model for methane concentration based on a set of measurable variables, including VFAs, microbial populations, and operational parameters. It evaluates the potential of MLR to predict methane concentrations in a low-cost, two-phase anaerobic digester treating PM and CD at laboratory scale. This work aligns with Sustainable Development Goal 7 by promoting accessible tools for energy generation from organic waste.
The article is structured into four main sections. The Introduction outlines the context of AD in the Valle del Cauca region, highlighting environmental and operational challenges from agro-industrial organic waste, reviewing strategies to improve biogas systems, and emphasizing the need to integrate microbial data into predictive models. The Materials and methods detail the system setup, monitoring, sequencing, and the MLR approach used for variable selection and model construction. The Results and Discussion sections present the modeling outcomes, identify relevant predictors, and interpret their contribution to system behavior. The Conclusions section summarizes the key findings and future perspectives for incorporating microbiota into data-driven frameworks for sustainable energy transitions.
2. Materials and Methods
This section first describes the dataset and the preprocessing steps undertaken. Subsequently, it details the initial linear modeling approach, followed by a feature selection process based on variable weighting to derive a simplified, yet robust, model. Finally, it presents the development of an adaptive predictive model using a moving window technique combined with a regularization method to prevent overfitting.
2.1. Substrate Selection
The substrates used in this study were fresh PM and CD. The inoculum, obtained from the same source as the manure, was included to ensure microbial compatibility with the feedstock. Both were collected at a small-scale pig farm located in the municipality of Florida, Valle del Cauca, where approximately 20 pigs are kept under semi-intensive conditions. Animal pens are washed twice daily, and the resulting wastewater, rich in organic matter, drains into a static open-air tank that served as the inoculum source. Fresh manure was manually collected after excretion using sanitized tools. CD were obtained from a medium-sized cassava starch-processing facility located in the rural area of Mandiba, Santander de Quilichao, Cauca. Processing nearly eight tons of cassava per day, the plant generates over two tons of lignocellulosic residue each week. This material was delivered in dry, milled form.
All samples were stored at 4 °C until physicochemical characterization, which included proximate analysis by gravimetric methods and determination of the carbon-to-nitrogen (C:N) ratio via high-temperature combustion. These procedures followed the Standard Methods for the Examination of Water and Wastewater (APHA, AWWA, WEF), ensuring analytical consistency as summarized in
Table 1 [
31,
32,
33].
2.2. Experimental Setup
The experimental setup consisted of a two-phase laboratory-scale anaerobic digester designed to operate without integrated control systems
Figure 1. The system was constructed using 110 mm sanitary-grade PVC tubing due to its low cost, durability, and ease of assembly. Phase 1 (D1F1) (3 L) was expected to perform hydrolysis and acidogenesis, while phase 2 (D1F2) (4 L) supposedly supported acetogenesis and methanogenesis. Each chamber was operated at 80% of its total volume, 2.4 L in phase 1 and 3.2 L in phase 2, leaving the remaining headspace for biogas accumulation. To enable real-time monitoring, a low-cost IoT module was incorporated into the digester, integrating an Arduino UNO microcontroller with sensors for pH, temperature, and methane concentration. Data was transmitted through a mobile network to the ThingSpeak platform for remote visualization [
26]. This setup allowed continuous monitoring without the need for sophisticated instrumentation.
2.3. Operational Parameter
To establish an active MC, both phases were fed inoculum for five days, until reaching a working volume. The inoculum had a C:N ratio of 10.3 and 2.2% TS. During start-up, the OLR, estimated with a five-day HRT, was 8.37 gVS/L·day. Thereafter, feeding used a 73:27 blend of PM and CD. The daily feed was 35 g fresh PM and 13 g CD, plus 166 g water to achieve 10% TS (214 g/day total). The theoretical C:N ratio was 21.55. With the defined working volumes, HRTs were 12 days for D1F1 and 15 days for D1F2. Corresponding OLRs were 7.7 and 5.7 gVS/L·day. vs. inputs were 18.46 g/day (D1F1) and 18.45 g/day (D1F2). Daily manual feeding with graduated containers and isolation valves ensured accurate dosing and anaerobiosis.
The IoT-instrumented digester (D1) enabled incremental, data-driven feed adjustments in both phases (D1F1, D1F2) using real-time pH, temperature, and methane concentration. These signals guided when to lower the OLR and TS and when to apply temporary pH control, moving the reactors toward consistent operating conditions. Five feed formulations were implemented (
Table 2). In D1F1, pH was briefly corrected with lime and then NaOH to keep it within 6.5–7.5; by mixture 5, recirculated digestate from D1F2 maintained pH without further chemicals. Mixture 4 used inoculum from an anaerobic digester at a university in Colombia treating food waste. Across mixtures, TS was reduced from 10% to 8–9%, OLR decreased from 12.4 gVS/L·day (inoculum step) to 5–6 gVS/L·day, and the C:N ratio increased in the final mixture due to recirculation while the contributions of PM and CD were reduced.
2.4. Steady State
Identifying steady-state periods was essential to build a reliable dataset, define representative operating conditions, and guide downstream variable prioritization and modeling. pH, temperature, and methane concentration were monitored continuously for 161 days (24/7). The IoT system logged three readings per minute for each variable and was routinely cross-checked against bench measurements to validate operational reliability.
Data volume was substantial, D1F1 recorded 694,110 samples per variable and D1F2 573,215. Processing followed six steps: (1) splitting timestamp into date and time; (2) validity filtering (e.g., pH 3–12; 10–45 °C; CH
4 within instrument bounds) with out-of-range values set to blank; (3) multivariate imputation by chained equations (MICE) to preserve temporal continuity [
34]; (4) resampling to hourly means (2893 rows in D1F1; 2389 in D1F2) and (5) to daily means (152 and 147, respectively), retaining trends while reducing computational load as shown in
Table 3.
Stable windows were then identified via rolling windows using relative standard deviation thresholds (<15%) around moving means for pH, temperature, and methane concentration, with a minimum continuous duration and compliance with predefined operating limits [
35]. D1 showed extended steady windows, typically with pH 6.5–7.5, facilitated by high-frequency data and the ability to adjust operating conditions in real time.
2.5. VFA Quantification
Samples were collected every three days in 5 mL Eppendorf tubes and stored at −20 °C until analysis. The final selection of samples for analysis was made considering the periods of system stabilization under IoT monitoring and budgetary constraints, prioritizing those most representative of the overall process behavior. Sampling was carried out during the active operation of the digester.
The quantification of VFAs was performed by gas chromatography, following the procedure described in section 5560D of the Standard Methods for the Examination of Water and Wastewater (APHA) [
36], in the laboratory of the Department of Chemical Engineering and Analytical Chemistry at the University of Barcelona. Prior to chromatographic analysis, the samples were centrifuged and filtered through 0.45 µm nylon membranes to remove suspended solids. Each analysis vial contained 1 mL of sample, diluted or not depending on the estimated concentration level, along with 0.1 mL of 15% orthophosphoric acid containing a known concentration of 2-ethylbutyric acid (~500 mg/L) as an internal standard. This compound allowed verification of injection consistency and facilitated calibration of the equipment through the ratio of analyte to standard peak areas.
Analyses were carried out on a Shimadzu GC-2010 Plus (Shimadzu Corporation, Kyoto, Japan) gas chromatograph with a flame ionization detector, using a DB-FFAP capillary column, 30 m × 0.25 mm × 0.25 µm (Agilent Technologies, Santa Clara, CA, USA). The oven temperature program started at 60 °C with a two-minute hold, followed by an increase of 20 °C/min up to 240 °C, maintained for an additional two minutes. The total analysis time was 13 min. The injector (SPL-1) operated at 220 °C in split mode, with a split ratio of 50:1. Helium was used as the carrier gas at a pressure of 42.6 kPa, with a total flow of 233.4 mL/min, a column flow of 8.86 mL/min, and a linear velocity of 60 cm/s. The purge flow was set at 3 mL/min, and the makeup gas flow (nitrogen) at the detector was 10 mL/min. The injection volume was 2 mL, using helium, air, hydrogen, and nitrogen as auxiliary gases.
For equipment calibration, a commercial VFA standard (Volatile Free Acid Mix, CRM46975, Supelco/MiliporeSigma [
37]) containing defined concentrations of acetic, propionic, isobutyric, butyric, isovaleric, valeric, isocaproic, caproic, hexanoic, and heptanoic acids was used. Serial dilutions were prepared in 1:1, 1:2, 1:4, 1:8, 1:16, and 1:32 ratios, to which orthophosphoric acid and the internal standard were also added. For alcohol analysis (ethanol, propanol, and butanol), defined-concentration standard solutions were prepared, applying the same dilutions and analytical conditions. This procedure allowed precise and reproducible determination of VFAs in the samples, essential for evaluating the performance of the AD system and its relationship with operating conditions and microbiota.
2.6. Metagenomic Analysis
Samples for metagenomic analysis were collected directly from operational biodigester using 50 mL Falcon tubes. Sampling was performed every three days throughout the process, following the same prioritization criteria used for the quantification of VFAs, focusing on periods of greatest microbiological representativeness and considering the availability of resources. Once collected, samples were immediately frozen at −20 °C and stored until further processing.
To analyze the MC, Falcon tubes were sent to Omega Bioservices (Norcross, GA, USA) for DNA extraction using the kit E.Z.N.A.
® Universal Pathogen Kit, library preparation and for sequencing the V3–V4 hypervariable region of the 16S rRNA gene using the primers 341F (CCTACGGGNGGCWGCAG) and 806R (GACTACHVGGGTATCTAATCC) which was conducted on an Illumina Miseq sequencing platform (Illumina, San Diego, CA, USA) (Paired-end sequencing 300 bp). Illumina reads were then analyzed using BaseSpace app (version 1.1.3) [
38]. Thus, raw sequence data were demultiplexed and then quality filtered, denoised, merged, and chimera removed using the DADA2 [
39] to generate amplicon sequence variants (ASVs). Taxonomic assignment was conducted using the SILVA database (version 138.2) [
40].
To structure the analysis of microbial interactions, a subset of phyla of interest was defined from the general metagenomic dataset, considering the sequencing reads obtained for each taxonomic group. The selection was based on two main criteria. First, the sustained presence of each phylum throughout the monitoring period was evaluated, excluding those with very low or intermittent representation, as their variability would hinder the detection of consistent associations in the relational analysis. Second, functional relevance reported in previous studies on anaerobic digestion was reviewed, prioritizing phyla whose involvement in fermentative, acetogenic, or methanogenic pathways has been extensively documented in similar systems [
7,
41].
Once the representative periods were defined, the results from VFA quantification and metagenomic analysis were integrated, extending the characterization to the biochemical and microbiological components of the system. In several cases, the observed patterns were consistent with those reported in the specialized literature, which supported the robustness of the approach. The dataset included operational, biochemical, and microbiological variables [
42,
43,
44,
45,
46].
Since the biochemical and microbiological measurements were less frequent than the operational records, imputation techniques were applied within the selected periods to expand the dataset without distorting the relationships among variables. Methods such as KNN imputation, iterative imputation, and MICE were employed [
43,
47]. The analysis focused on the period between days 97 and 154, which, although not representing a fully stabilized phase, shows a trend toward stabilization and coincides with the selected VFA and microbiological samples. This ensured consistency between the experimental data and the operational conditions.
2.7. Preprocessing and Unified Database
Once the representative periods were defined, the results from VFA quantification and metagenomic analysis were incorporated to extend the characterization of the system to its biochemical and microbiological dimensions. The patterns obtained aligned with those reported in specialized literature, reinforcing the validity of the approach [
42,
43,
44,
45,
46]. The unified dataset combined operational, biochemical, and microbiological variables. Because biochemical and microbiological measurements were less frequent than operational records, imputation method MICE was applied to harmonize the dataset without altering the underlying relationships among variables [
43,
47].
The analysis focused on the period between days 97 and 154, which, while not fully stabilized, displayed a clear trend toward steady performance and coincided with the VFA and microbiological samples selected. This ensured coherence between experimental observations and operational conditions. The resulting dataset comprised daily averages over 58 days, which were further refined through linear interpolation to increase temporal resolution. This process expanded the series to 1000 points, enabling the application of moving window analyses, as illustrated in
Figure 2.
The interpolation was validated for all variables, yielding R2 values close to 1 and the mean relative error (MRE) values around 0.1%, confirming a high-fidelity representation of the original data.
2.8. Linear Modeling
To simplify the proposed equations and procedure, the suffixes associated with each VFA (
Table 4) microorganism (
Table 5), and operating condition (
Table 6) are shown below.
Equation (1) that linearly approximates
concentration as a function of the microorganisms, fatty acids, and operating conditions was proposed in the following linear form, based on the suffixes from
Table 4,
Table 5 and
Table 6.
In its matrix form (matrix
), Equation (1) can be expressed as follows (Equation (2)):
where the constants
are the approximation coefficients. This matrix form from Equation (2) can be written more compactly as shown in Equation (3):
The matrix
contains data collected from fatty acids, microorganisms, and operating conditions, the vector
represents the collected methane production data, while the vector
contains the approximation coefficients that must be determined to formulate the model. The vector
, can be solved by rearranging Equation (3) as follows:
where
is the transpose of matrix
.
2.8.1. Assessing Variable Importance
To determine the relative importance of each variable in the approximation, and subsequently define a smaller, more practical subset (as working with all 31 variables can be impractical and costly in terms of laboratory testing), a variable weighting method was used. Therefore, Equation (5) appears as a modification of Equation (3) considering the minimal error
.
To quantify how much each variable “contributes” to the
production within the approximation, it is necessary to measure the relevance of each variable in the linear model. Since each variable may be measured on a different scale (e.g., microorganism abundance vs. fatty acid concentration in mg/L), directly comparing the raw coefficients
in vector
can be misleading. Therefore, it is necessary to standardize the input data. In the same way, to compare the relative importance of each variable, the coefficients
that form the vector
were standardized (as z-scores). The standardized coefficient
for each variable
was calculated as:
where
and
are the standard deviations of the approximation coefficient
and the response variable
, respectively.
2.8.2. Predictive Model
To capture the evolutionary nature of the anaerobic digestion process, a dynamic predictive model was developed based on a moving windows approach. The model operates iteratively. At each time step
, a linear regression model is trained using a window containing the last
observations (in this case,
was chosen). This model is then used to make a one-step-ahead prediction of
(denoted as
), as a function of the weighted variables previously described. However, the use of small data windows can lead to overfitting. To address this problem and improve the model’s generalization capability, Ridge Regression was used instead of ordinary least squares. This regression introduces a penalty term into the least squares cost function. For each time window, the objective is to find the coefficient vector
that minimizes the following function:
where
is the sum of squared errors (the data fit term at time
),
is the regularization term applied at time
,
is the regularization hyperparameter that controls the balance between the data fit and the model simplicity, and
contains the coefficients
. The hyperparameter
was selected to improve the model predictive performance (
). Thus, Equation (4) is rewritten to obtain the predictive parameters (
) by solving the following equation:
where
is the identity matrix. The goal of Equation (8) is to find the value of the coefficients in vector
at time
using the data available at time
. Using Equation (8), it is possible to find the
values for a subsequent window, given a defined window size of
. In this way, Equation (3) becomes a prediction equation as follows:
2.8.3. Model Performance Evaluation
The precision of the predictive model was quantified using three standard statistical metrics. These metrics evaluate the divergence between the real observed values of
and the values predicted by the model,
. On one hand, the Coefficient of Determination (
) indicates the proportion of the variance in methane production that is predictable from the independent variables. A value close to 1 indicates an almost perfect fit. Equation (10) shows how it was calculated for this case.
Next, the Root Mean Square Error (RMSE) represents the standard deviation of the prediction residuals. It is a measure of the average error of the model in the same units as the response variable (ppm of
), which facilitates its interpretation and is expressed in Equation (11).
Finally, the MRE measures the average error in relative or percentage terms with respect to the real value is defined in Equation (12). The absolute value was used to prevent positive and negative errors from canceling each other out.
In Equations (10)–(12), is the real value of the i-th observation, is the value predicted by the model for the i-th observation, is the mean value of all real values, and n is the total number of observations used for the evaluation.