Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion

Amangeldy, Bibars; Baigarayeva, Zhanel; Tasmurzayev, Nurdaulet; Boltaboyeva, Assiya; Imanbek, Baglan; Maulenbekov, Marlen; Zhussupbekov, Sarsenbek; Wojcik, Waldemar; Kozhamberdieva, Mergul; Konysbekova, Akzhan

doi:10.3390/a19020127

Open AccessArticle

Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion

by

Bibars Amangeldy

^1,2

,

Zhanel Baigarayeva

^1,2,3

,

Nurdaulet Tasmurzayev

^1,2,*

,

Assiya Boltaboyeva

^1,2,3

,

Baglan Imanbek

^1,2,*,

Marlen Maulenbekov

^1,3,4,

Sarsenbek Zhussupbekov

⁵,

Waldemar Wojcik

^1,6

,

Mergul Kozhamberdieva

^1,2 and

Akzhan Konysbekova

⁷

¹

Joldasbekov Institute of Mechanics and Engineering, Almaty 050010, Kazakhstan

²

Faculty of Information Technologies and Artificial Intelligence, Al Farabi Kazakh National University, Almaty 050040, Kazakhstan

³

LLP “Kazakhstan R&D Solutions”, Almaty 050056, Kazakhstan

⁴

Institute of Automation and Information Technology, Satbayev University, Almaty 050013, Kazakhstan

⁵

Department of Automation and Control, Energo University, Almaty 050013, Kazakhstan

⁶

Institute of Electronics and Information Technology, Politechnika Lubelska, 20-618 Lublin, Poland

⁷

JSC “Research Institute of Cardiology and Internal Diseases”, Almaty 050000, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(2), 127; https://doi.org/10.3390/a19020127

Submission received: 30 December 2025 / Revised: 30 January 2026 / Accepted: 3 February 2026 / Published: 5 February 2026

(This article belongs to the Special Issue AI Applications and Modern Industry)

Download

Browse Figures

Versions Notes

Abstract

Monitoring the concentration of Total Volatile Fatty Acids (TVFA (M)) is critical for ensuring the stability and efficiency of the Anaerobic Digestion (AD) process although conventional laboratory methods are often time-consuming and hinder real-time control. This study develops soft sensors based on machine learning techniques to predict TVFA (M) levels using readily available parameters such as pH, pCO₂, and Total Ammoniacal Nitrogen (TAN). A primary contribution of this work is the comprehensive benchmarking of the proposed approach against current State-of-the-Art (SOTA) deep learning and machine learning models including XGBoost, Random Forest, TorchMLP, and the advanced RealTabPFN-v2.5. Experimental results demonstrate that the RealTabPFN-v2.5 model outperforms other modern algorithms by achieving the highest accuracy with an R² of 0.889 and the lowest error rate with an RMSE of 0.0079. SHAP (SHapley Additive exPlanations) analysis was employed to interpret the model’s predictions, identifying pH as the most influential factor in TVFA (M) prediction and confirming that the model’s decision-making process aligns with established biological principles. These findings highlight the significant potential of integrating SOTA machine learning models into intelligent monitoring systems for the automation and optimization of biogas production processes.

Keywords:

anaerobic digestion; total volatile fatty acids (TVFA (M)); deep learning; neural networks; state-of-the-art (SOTA) models; RealTabPFN; soft sensors; biogas process monitoring; SHAP analysis

1. Introduction

Anaerobic digestion (AD) stands as a cornerstone technology in the global transition toward renewable energy generation and sustainable waste management, offering a dual-benefit solution that simultaneously produces biogas—a versatile energy carrier—and stabilizes organic waste streams [1,2]. The process harnesses complex microbial consortia to convert diverse organic substrates into methane and carbon dioxide through a cascade of biochemical reactions, positioning AD as a critical component in circular economy strategies worldwide. Despite its environmental and economic promise, AD systems exhibit pronounced instability and non-linear dynamics that pose significant operational challenges, particularly in full-scale biogas plants where process disturbances can persist for weeks, reducing biogas yields by 20–30% [3]. These instabilities arise from multiple factors including organic overload, ammonia inhibition, and long-chain fatty acid accumulation, creating a complex optimization landscape where operators must balance maximum biogas production against the risk of process failure [4]. Consequently, most commercial continuous stirred tank reactors operate at suboptimal organic loading rates of 1–4.5 gVS L⁻¹ d⁻¹ as a safety precaution, resulting in substantial capital inefficiency and reduced energy recovery. The inherent complexity of AD microbiomes, which reflect the wide variety of substrates and physicochemical conditions in both natural and technical environments, further compounds these challenges, as the delicate balance between hydrolytic, acidogenic, acetogenic, and methanogenic populations remains vulnerable to environmental fluctuations [5].

Among the myriad parameters influencing AD performance, total volatile fatty acids (TVFA (M)) concentration emerges as the most critical indicator of process health and stability, serving as an early warning signal for impending digester failure [6]. TVFA (M) accumulation directly reflects the metabolic imbalance between acid-producing and acid-consuming microbial communities, with concentrations exceeding 4000 mg/L often heralding significant process disruption [3]. The traditional reliance on “hard sensors” for TVFA (M) measurement—primarily offline laboratory analyses involving gas chromatography or titration—introduces prohibitive constraints for real-time process control [7]. These methods demand specialized equipment, trained personnel, and substantial time delays of hours to days, during which critical process deviations may go undetected and escalate irreversibly. The high capital and operational costs associated with online VFA sensors, coupled with their maintenance requirements and susceptibility to biofouling in harsh digester environments, have limited their widespread industrial adoption. This measurement gap has catalyzed the development of “soft sensors”—computational models that infer difficult-to-measure variables from readily available online measurements—as a pragmatic and cost-effective alternative for continuous AD monitoring [8]. Soft sensors leverage correlations between TVFA (M) and more accessible parameters such as pH, partial pressure of carbon dioxide (pCO₂), and total ammonia nitrogen (TAN) to provide real-time estimates without the logistical burden of direct measurement [9].

Although the application of machine learning (ML) for monitoring anaerobic digestion (AD) has grown significantly, with various studies demonstrating the potential of data-driven models to predict process stability, a critical analysis of the existing literature reveals significant limitations regarding input complexity and model generalization. A prevalent trend in current research is the reliance on a broad spectrum of physicochemical characteristics for model training. Choi et al. [10] utilized a dataset comprising over 40 distinct features, including specific organic components like proteins and lipids, while other recent studies have incorporated heavy computational inputs such as genomic sequencing data [11]. While these approaches can yield high precision, they present practical drawbacks, as parameters like lipid concentration typically require time-consuming offline laboratory analysis that hinders real-time control. To address this, our study focuses on a streamlined set of readily available inputs—pH, partial pressure of CO₂ (pCO₂), and Total Ammoniacal Nitrogen (TAN)—specifically selected to reflect the system’s state without necessitating expensive assays. Furthermore, regarding modeling techniques, the literature is currently dominated by classical algorithms such as XGBoost and Artificial Neural Networks [12,13]. However, these traditional models are often data-hungry, requiring large, labeled datasets to avoid overfitting, and necessitate extensive hyperparameter tuning. This study advances the state-of-the-art by introducing RealTabPFN, a tabular foundation model that, unlike traditional methods training “from scratch” [10,11], leverages prior knowledge of tabular data structures. By moving away from complex feature engineering, our approach addresses the critical gap between high-precision academic models and the robust, low-data requirements of industrial application.

The application of machine learning to AD process control has evolved considerably, with early approaches employing artificial neural networks (ANN) and support vector machines (SVM) to capture the non-linear relationships between operational parameters and process outputs [14]. These methods demonstrated reasonable predictive accuracy but required extensive feature engineering and struggled with the heterogeneous nature of tabular AD data, which typically contains mixed data types, missing values, and uninformative features. Recent years have witnessed a paradigm shift with the emergence of gradient boosting methods, particularly XGBoost and Random Forest, which have established themselves as robust baselines for tabular data prediction tasks [15]. However, the deep learning revolution that transformed computer vision and natural language processing has yielded more ambiguous results for tabular data, with studies demonstrating that deep models often underperform compared to tree-based ensembles despite requiring substantially more computational resources and tuning [16,17]. This performance gap has prompted the development of specialized tabular deep learning architectures, including TabNet, SAINT, and TabTransformer, which incorporate attention mechanisms and feature selection capabilities. Most recently, foundation models for tabular data such as TabPFN and its successor RealTabPFN-v2.5 have challenged conventional wisdom by achieving state-of-the-art performance through in-context learning on synthetic training distributions, though their efficacy on small engineering datasets remains underexplored [18,19]. The ongoing debate between gradient boosting and deep learning for tabular data reflects a fundamental question about the nature of AD datasets: whether their inherent structure favors the axis-aligned splits of decision trees or can benefit from the dense representations learned by neural networks.

This study leverages high-quality experimental data generated by the University of Southampton Water and Environmental Engineering Group, which originally characterized the physicochemical responses of AD systems to induced instability [20]. The dataset captures the dynamic relationships between pH, pCO₂, TAN, and TVFA (M) under controlled perturbations, providing a rich foundation for predictive modeling. Previous analyses of this data focused primarily on stability assessment and mechanistic understanding of failure modes, establishing clear thresholds for safe operation [21]. The experimental protocol involved systematic manipulation of organic loading rates and nitrogen concentrations while continuously monitoring key parameters, producing a temporally resolved dataset that reflects both steady-state and transient digester behavior [22]. This rigorous experimental design ensures that the resulting models capture biologically meaningful relationships rather than spurious correlations, addressing a common limitation of data-driven approaches in process engineering. The dataset’s moderate size—typical of academic research campaigns—presents both challenges and opportunities for machine learning model development, particularly when evaluating the trade-offs between model complexity and generalization performance.

The primary objective of this research is to repurpose the Southampton dataset for TVFA (M) prediction using indirect, easy-to-measure parameters (pH, pCO₂, TAN) as model inputs, thereby transforming a retrospective stability analysis tool into a prospective process control solution. This represents a fundamental shift from the previous study’s focus on understanding failure mechanisms to enabling predictive maintenance and optimization. By developing accurate soft sensors for TVFA (M), operators can implement proactive control strategies that prevent acid accumulation before it reaches inhibitory levels, rather than reacting to laboratory results after the fact [23]. The IoT-enabled framework envisioned in this work integrates these predictive models with sensor networks, creating a closed-loop control system that continuously adapts to changing feedstock characteristics and operational conditions [24]. Such integration addresses the critical need for intelligent automation in AD plants, where manual intervention based on delayed measurements currently limits process efficiency and reliability. The transition from stability characterization to prediction requires careful validation of model robustness across different operational regimes and digester configurations to ensure industrial applicability.

Despite the potential of machine learning for AD process control, industrial adoption remains hindered by the “black-box” nature of many predictive models, which prioritize accuracy at the expense of interpretability. Plant operators and process engineers require transparent models that not only predict TVFA (M) concentrations but also explain the underlying biological rationale, enabling them to trust and act upon model recommendations. The lack of interpretability becomes particularly problematic when models extrapolate beyond their training distribution or encounter anomalous conditions, as operators cannot assess the validity of predictions without understanding the causal relationships driving them. Furthermore, the challenge of applying deep learning to “small data” prevalent in engineering contexts—as opposed to the massive datasets available in internet-scale applications—necessitates rigorous evaluation of model robustness and generalization. While tree-based models like Random Forest and XGBoost offer some inherent interpretability through feature importance measures, their predictions remain opaque at the individual sample level, limiting their utility for diagnostic purposes. This creates a critical gap for AD applications, where understanding parameter interactions and their biological significance is as important as predictive accuracy.

Our study aimed to address the operational challenges of anaerobic digestion process control by exploring the application and interpretation of modern computational methods. We focused on assessing how modern tabular deep learning architectures—including RealTabPFN-v2.5 and optimized versions of TorchMLP, FastaiMLP, and RealMLP—perform when applied to soft sensing in biogas systems. By comparing these models with established benchmarks such as Random Forest and XGBoost using the Southampton dataset, we identified various trade-offs between predictive accuracy, computational demands, and the effort required for hyperparameter tuning in engineering contexts. This analysis helps clarify the practical implications of moving from traditional tree-based ensembles toward more complex neural architectures for monitoring biogas processes.

Additionally, we utilized the SHAP framework to provide interpretability for these models, aiming to connect high-performance predictive outputs with the biological requirements of anaerobic digestion systems. The SHAP analysis allowed us to quantify how parameters like pH, pCO₂, and total ammonia nitrogen contribute to volatile fatty acid predictions, which helped in checking the model logic against known microbiological principles. This approach supports the biological relevance of the results while highlighting interaction patterns, such as the relationship between pH and TAN, which are important for helping operators understand and trust the model’s outputs in industrial facilities.

2. Materials and Methods

2.1. Data Collection

The analytical framework presented here is structured to investigate the viability of forecasting TVFA (M) levels within bioreactor environments through machine learning algorithms trained on essential physicochemical indicators. Functioning as a pilot study, its fundamental objective is to assess the suitability of the chosen input variables, the modeling methodology, and the interpretability approaches before initiating a more extensive and detailed investigation. The proposed architecture adheres to a systematic workflow, depicted in Figure 1, which integrates data collection, data consolidation and preprocessing, feature significance assessment, model training, and performance validation alongside model interpretability analysis.

The data utilized for this research were retrieved from a publicly accessible open-source repository associated with the publication by Zhang et al. (2023) [25]. This source provides physicochemical data recorded over an eight-day operational cycle of CO₂ biomethanization experiments. To assemble a cohesive dataset appropriate for exploratory model construction, raw observations from the entire eight-day duration were consolidated into a single matrix. While the timeframe is relatively short, this aggregation captures various operational phases and provides a sufficient basis for methodological verification within a pilot scope. After the data were integrated, preprocessing routines were executed, and specific variables were categorized as either independent input features or the dependent target output.

The predictive capacity of the model is centered on three primary physicochemical inputs and one specific target variable. Specifically, the framework employs pH, pCO₂, and TAN as the input features, with TVFA (M) serving as the target for prediction. pH acts as a critical measure of system acidity and remains one of the most economical and frequently tracked parameters in anaerobic digestion setups [26]. Despite its straightforward nature, pH levels reflect intricate biochemical shifts and offer indirect insights into the overall stability of the process. pCO₂ indicates the partial pressure of dissolved carbon dioxide, functioning as a marker for microbial metabolic rates and the internal acid–base balance [27]. TAN represents the total concentration of both free ammonia and ammonium ions and performs a complex role in anaerobic digestion; it supports buffering capacity at moderate levels but can lead to inhibitory responses at higher concentrations [28].

TVFA (M) signifies the levels of short-chain volatile fatty acids generated as intermediate metabolites during the anaerobic breakdown of organic substrates. The disproportionate buildup of TVFA (M) is internationally acknowledged as a primary signal of process instability, typically occurring before full system acidification or functional failure [29]. In this preliminary study, the estimation of TVFA (M) is investigated as a proof-of-concept to determine if it can be accurately derived from accessible, low-cost indicators like pH, pCO₂, and TAN. This approach seeks to decrease the necessity for frequent, labor-intensive laboratory chemical testing.

To ensure the integrity of the data restructuring process, the transformation from wide to long format was strictly validated against the primary dataset. This verification step ensures that each reactor-specific observation (e.g., pH and biogas yield) correctly retains its temporal identity and biochemical context. Furthermore, to provide a transparent overview of the data distribution used for model training, the statistical characteristics of the consolidated dataset are summarized in Table 1. The low standard deviation in core parameters such as TAN and pH confirms the stability of the monitored process, while the variance in TVFA (M) reflects the dynamic metabolic shifts targeted by the predictive models.

In this work, the ‘Total VFA’ refers to the cumulative concentration of volatile fatty acids available in the primary dataset, specifically calculated as the sum of the molar concentrations of acetic (HAc), propionic (HPr), butyric (n-HBu and i-HBu), valeric (n-HVa and i-HVa), caproic (Hex), and enanthic (Hep) acids. Due to the specific data collection constraints, this variable serves as the principal operational indicator of the system’s metabolic state, focusing on the most representative components monitored during the process.

To evaluate the significance of the chosen input variables, Mutual Information (MI) analysis was utilized as a preliminary feature evaluation technique. Subsequently, machine learning regression algorithms were trained based on the established input–output structure. The effectiveness of these models was measured using conventional regression metrics, and SHAP analysis was implemented to decode model logic and determine the specific impact of each input parameter on the predicted TVFA (M) outcomes.

2.2. Data Preparation and Feature Engineering

Raw experimental data were gathered continuously over an eight-day period, recording fluctuations in critical physicochemical parameters. To guarantee consistency and facilitate effective machine learning analysis, the eight individual daily datasets were consolidated into a single unified dataset. This integration provided a more comprehensive depiction of the system’s dynamics and enhanced the robustness of the model training phase.

Subsequent to data unification, feature selection was conducted to designate the input and target variables. Specifically, pCO₂, TAN (M), and pH were selected as independent features (X), given that they are standard monitoring parameters representing acid–base equilibrium, metabolic rates, and buffering capacity within the system. TVFA (M) (M) was defined as the dependent variable (y), acting as the prediction target for all downstream modeling tasks. This structured dataset established a clear input-output correlation essential for constructing reliable regression models.

To address missing values within the feature space, a multivariate iterative imputation approach based on the MissForest algorithm was implemented [30]. Every feature containing missing entries was iteratively modeled as a function of the remaining features utilizing an ensemble-based regressor. The imputation procedure was initialized with the median of each feature and refined through multiple iterations to capture non-linear dependencies among process variables. To prevent data leakage, the imputer was fitted solely on the training subset before being applied to the validation subset. In this study, log transformation of the target variable was intentionally omitted. While parametric models often require such transformations to satisfy assumptions of homoscedasticity, the chosen tree-based ensemble methods are non-parametric and inherently robust to skewed distributions. These models partition data based on feature thresholds, making them capable of capturing the dynamic range of VFA concentrations without the need for logarithmic scaling.

Following imputation, feature values underwent z-score normalization to ensure numerical stability and uniform feature scaling. This preprocessing step adjusted the data to possess a mean of zero and unit variance, ensuring that all predictors contributed proportionally to model training while improving stability for algorithms sensitive to feature scales. The scaling parameters were derived from the training data and subsequently used to transform both the training and test sets to preclude data leakage. The standardization process is mathematically defined as [31]:

Z = \frac{X - μ}{σ}

(1)

where X denotes the original feature value, and Z signifies the standardized or scaled value. The symbol μ indicates the mean of the feature calculated from the training dataset, while

σ

represents the standard deviation, also determined using the training data. To ensure the transparency of the modeling process, it is explicitly stated that the target TVFA (M) variable underwent Z-score standardization solely for the training phase to stabilize numerical gradients. Unlike some traditional approaches, no logarithmic transformations were applied, preserving the original variance structure of the biochemical data. All predictive outputs were subsequently back-transformed to the original Molar (M) scale for performance evaluation and visualization.

To assess the predictive significance of each biochemical variable regarding the target parameter (TVFA (M)), a Mutual Information (MI) analysis was performed. MI is a metric of statistical correlation rooted in information theory that quantifies the strength of both linear and non-linear dependencies between variables [32].

To assess the strength of non-linear associations between the process variables, a Mutual Information (MI) matrix was constructed (Figure 2). The analysis revealed that pH shares the highest mutual information with the target variable TVFA (M) (MI = 0.662), suggesting that system acidity is the most critical predictor of Volatile Fatty Acid accumulation in this dataset. TAN (M) exhibited a moderate dependency with TVFA (M) (MI = 0.364), whereas pCO₂ showed the weakest direct association (MI = 0.187).

Additionally, the matrix highlights significant inter-feature dependencies, particularly between pH and pCO₂ (MI = 0.507). This strong correlation reflects the underlying physicochemical relationship governed by the carbonate buffering system.

2.3. Machine Learning Models

To model the target biochemical response under controlled laboratory conditions, we framed the problem as a regression task, where TVFA (M) concentration is predicted based on three process variables: pCO₂, TAN (M), and pH. In this study, a total of eight machine learning models were evaluated to assess predictive performance and robustness. These models include RealTabPFN-v2.5, TorchMLP (tuned), RealMLP (tuned), FastAI MLP (tuned), Random Forest (default), Random Forest (tuned), XGBoost (default), and XGBoost (tuned). This diverse set of models was selected to cover both tree-based and neural network–based learning approaches. To justify the use of these complex ensemble methods, a standard Multiple Linear Regression (MLR) was included as a baseline benchmark. This allows for a clear assessment of whether the added complexity of models like XGBoost and RealTabPFN is necessary to capture the non-linear interactions within the anaerobic process. Model evaluation was conducted using five-fold cross-validation applied to the full dataset. In each fold, 80% of the data was used for training, while the remaining 20% was reserved for validation. This process was repeated five times, with each data subset serving as the validation set exactly once. For every fold, models were trained exclusively on the training portion and evaluated only on the corresponding validation subset. Regarding the validation protocol, we acknowledge that while temporal dependencies are inherent in anaerobic digestion, this study evaluates the model’s capacity for instantaneous mapping of physicochemical states to TVFA (M) concentrations rather than time-series forecasting. Cross-validation was employed to assess the robustness of this mapping across the entire observed biochemical range.

All preprocessing operations were performed within each cross-validation fold to prevent data leakage. Missing values were imputed using a MissForest-inspired iterative imputation strategy, and both input features and target variables were standardized using a StandardScaler fitted exclusively on the training data. The target variable was normalized to zero mean and unit variance prior to model training and inverse-transformed for performance evaluation.

To ensure a fair and systematic comparison, all models were evaluated under a strict cross-validation protocol with explicit safeguards against data leakage. Depending on the modeling paradigm, different hyperparameter optimization strategies were employed, as detailed below.

RealTabPFN-v2.5 was evaluated without explicit hyperparameter tuning. The pretrained model was accessed via the Hugging Face Hub (authentication enabled for secure model retrieval) and used as provided. RealTabPFN is a foundation-style tabular model meta-trained on a large and diverse collection of datasets to generalize across tasks without task-specific tuning. The pretrained REAL checkpoint was trained within each cross-validation fold using only the training data, ensuring fair evaluation and preventing data leakage.

In contrast, for RealMLP (tuned), a fully supervised hyperparameter optimization strategy was employed using nested cross-validation. Within each outer five-fold cross-validation split (≈80% training, 20% validation), model tuning was performed exclusively on the training subset to avoid data leakage. The model pipeline consisted of multivariate imputation, standardization, and an MLPRegressor. Hyperparameters were optimized via grid search with three-fold cross-validation (cv = 3) applied only to the outer-fold training data. The search space included hidden layer configurations (64), (128), (128, 64), L2 regularization strengths {10⁻⁴, 10⁻³, 10⁻²}, initial learning rates {10⁻⁴, 10⁻³} and activation functions (ReLU and tanh). Model selection was based on negative mean squared error, and the best-performing configuration from the inner grid search was subsequently evaluated on the held-out outer validation fold. Performance metrics were then aggregated across all outer folds, ensuring an unbiased estimate of generalization performance under a strict no–data-leakage protocol.

To further explore neural network flexibility, a TorchMLP (tuned) model was implemented using PyTorch (version 2.9.1), allowing full control over the training dynamics and regularization mechanisms. The architecture consisted of fully connected layers with ReLU activations and dropout, followed by a single linear output neuron for regression. Hyperparameter tuning was conducted via a manual grid search within each fold, evaluating combinations of hidden layer configurations (64), (128), (128, 64), learning rates {10⁻³, 5 × 10⁻⁴}, weight decay {0, 10⁻⁴}, and dropout rates {0.0, 0.1} The model achieving the lowest validation RMSE within each fold was selected and evaluated on the held-out validation data.

Similarly, a FastAI MLP (tuned) was trained using the fastai tabular learning framework. The processed data were then converted into fastai TabularPandas objects with explicitly defined training and validation splits. Hyperparameter tuning was conducted via a manual grid search within each fold, exploring different network depths (layer configurations), dropout probabilities, and learning rates. For each fold, the configuration yielding the lowest validation RMSE was selected, and performance metrics were computed on the held-out validation data.

Tree-based ensemble models were included to provide strong non-neural baselines. Two Random Forest regression variants were evaluated to assess the impact of hyperparameter optimization. The non-tuned Random Forest was trained using default model settings and served as a strong baseline, while the tuned Random Forest employed explicit hyperparameter optimization. For the tuned variant, a manual grid search within each fold was performed over the number of trees {300, 600}, maximum tree depth {None, 10, 20}, minimum samples per leaf {1, 2, 4}, and feature subsampling strategies {

\sqrt{p}

, 0.8}. The optimal configuration was selected based on the lowest validation RMSE within each fold and subsequently evaluated on the held-out validation data.

For XGBoost, two XGBoost regression variants were also evaluated to quantify the effect of hyperparameter optimization on gradient-boosted tree performance. The non-tuned XGBoost model was trained using default hyperparameters as a strong baseline. In contrast, the tuned variant employed a manual grid search within each fold, exploring the number of trees {200, 400}, tree depth {3, 5, 7}, learning rates {0.01, 0.05, 0.1}, subsampling ratios {0.8, 1.0, and column subsampling ratios {0.8, 1.0}. The optimal configuration was selected based on the lowest validation RMSE within each fold and subsequently evaluated on the held-out validation data.

To evaluate model performance, standard metrics were computed: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination (R²):

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(2)

R M S E = \sqrt{M S E}

(3)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(4)

R^{2} = 1 - \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}}

(5)

where

y_{i}

represents the true values,

{\hat{y}}_{i}

the predicted values,

\bar{y}

the sample mean, and n is the number of test observations.

No transformation or normalization was applied to the target variable; TVFA values were modeled and predicted directly in their original molar units. Accordingly, all reported predictions and performance metrics are expressed in Molarity (M) without any inverse or back-transformation.

Maintaining the original physical scale (M) for the target variable ensures that the model’s error behavior—specifically MAE and RMSE—remains directly interpretable in the context of anaerobic digestion monitoring. This approach avoids the complexities and potential biases introduced by inverse-transformation, providing a more transparent assessment of the model’s predictive accuracy in real-world units.

To enhance the interpretability of the black-box models and understand the underlying logic of the predictions, SHAP (SHapley Additive exPlanations) analysis was employed [33]. SHAP is a game-theoretic approach that assigns each feature an importance value for a particular prediction, effectively quantifying the contribution of each physicochemical parameter (pCO₂, TAN, and pH) to the final TVFA (M) estimation. Unlike traditional feature importance methods, SHAP ensures consistency and provides local explanations, allowing for a granular look at how specific fluctuations in input variables drive model outputs. By utilizing SHAP summary plots and dependence analysis, we aimed to bridge the gap between high-performance machine learning and the biochemical interpretability of the anaerobic process, ensuring that the model’s decisions align with established scientific principles.

3. Results

3.1. Model Performance Comparison

To identify the most effective predictive algorithm for TVFA (M) concentration, a diverse set of regression models was evaluated, encompassing neural network–based architectures, ensemble tree methods, and modern tabular learning approaches. The investigated models include RealTabPFN-v2.5, TorchMLP, FastaiMLP, RealMLP, Random Forest, and XGBoost, with both tuned and default configurations considered where applicable. Model performance was assessed using R², MSE, RMSE, and MAE, as summarized in Table 2. All predictive outputs were subsequently back-transformed to the original Molar (M) scale for performance evaluation and visualization, ensuring that the reported error metrics reflect the physical reality of the process.

Among all evaluated approaches, RealTabPFN-v2.5 demonstrated the strongest overall performance. It achieved the highest coefficient of determination (R² = 0.889) together with the lowest prediction errors (RMSE = 0.0079 ± 0.0007 and MAE = 0.0056 ± 0.0006), indicating superior accuracy and robustness in modeling TVFA (M) dynamics. The consistently low standard deviations across all metrics further confirm its stable generalization across cross-validation folds.

Tuned neural network models showed competitive but slightly lower performance. TorchMLP (tuned) achieved an R² of 0.847, reflecting effective learning of nonlinear relationships following hyperparameter optimization. In contrast, FastaiMLP (tuned) and RealMLP (tuned) exhibited reduced explanatory power and higher variability, suggesting more limited stability for this prediction task.

Tree-based ensemble methods also delivered strong results. Tuned Random Forest and tuned XGBoost outperformed their default counterparts, achieving R² values above 0.82 and lower error metrics, underscoring the importance of hyperparameter tuning. Nevertheless, their performance remained inferior to that of RealTabPFN-v2.5, particularly in terms of RMSE and MAE. The comparison demonstrates that linear approaches fail to account for the intricate biochemical couplings—such as the pH-TAN-TVFA (M) relationship—thereby validating the selection of non-linear architectures for this soft-sensing task.

3.2. In-Depth Analysis of the Optimal (RealTabPFN-V2.5) Model

To further assess the model’s generalization capability and training stability, a detailed analysis of the best-performing RealTabPFN-v2.5 model was carried out. Learning curves were generated by progressively increasing the fraction of training data and monitoring the corresponding training and validation RMSE. This analysis provides insight into how model performance evolves with additional data and allows assessment of potential overfitting or underfitting effects.

Figure 3 illustrates the evolution of training and validation RMSE for the RealTabPFN-v2.5 model as the fraction of training data increases. Both curves show a pronounced reduction in error during the early stages, indicating efficient learning and a strong ability to capture the underlying relationships in the data even with limited training samples. Notably, a substantial portion of the performance improvement is achieved within the first 30–40% of the training data, after which the rate of error reduction becomes more gradual.

A key observation is the consistently small gap between training and validation RMSE across all training fractions. The close alignment of these curves suggests that the model maintains good generalization and does not exhibit signs of overfitting as additional data are introduced.

Beyond approximately 50–60% of the training data, both training and validation RMSE curves begin to plateau, indicating that the model has effectively converged. Further increases in training data yield only marginal improvements, suggesting that the model has reached a performance ceiling under the current feature representation and data distribution. This stable convergence behavior, combined with low validation error, confirms the robustness of RealTabPFN-v2.5 and supports its superior predictive performance reported in Table 2.

The calibration curve in Figure 4 compares the predicted TVFA (M) values produced by the model with the corresponding observed TVFA (M) measurements. The dashed diagonal line represents perfect calibration, where predictions exactly match observations. The model’s calibration curve closely follows this reference line across the full prediction range, indicating strong agreement between predicted and observed values.

Minor deviations from the ideal line are observed at higher TVFA (M) levels, where the model slightly overestimates the observed values. However, these deviations remain small and systematic rather than random, suggesting well-controlled bias. At lower and moderate TVFA (M) ranges, the predictions align almost perfectly with the ideal calibration, highlighting reliable model behavior in the operating region most relevant for stable process conditions.

The distribution of prediction residuals provides insight into the error characteristics of the proposed TVFA (M) regression model. Figure 5 shows a histogram of residuals that is centered closely around zero, indicating the absence of systematic overestimation or underestimation across the dataset. The symmetric shape of the distribution suggests that positive and negative errors occur with similar frequency.

Most residuals are concentrated within a narrow range around zero, reflecting low prediction error and high precision for the majority of samples. Only a small number of observations appear in the tails of the distribution, indicating that large deviations between predicted and observed TVFA (M) values are rare. This behavior is particularly important for practical deployment, as it implies stable performance under typical operating conditions.

The relationship between model predictions and ground truth TVFA (M) measurements is examined through a parity analysis. Figure 6 presents a scatter plot of predicted versus observed TVFA (M) values, where the dashed diagonal line represents ideal agreement between predictions and observations. The majority of data points are closely clustered around this reference line, indicating strong predictive accuracy across a wide range of TVFA (M) concentrations.

At low to moderate TVFA (M) levels, the predictions show particularly tight alignment with the ideal line, suggesting high reliability in the operating regime most commonly encountered under stable process conditions. As TVFA (M) values increase, a modest increase in dispersion is observed; however, the overall linear trend is preserved, and no systematic bias is evident.

3.3. Feature Importance and Mechanistic Interpretation

A critical step beyond simply evaluating performance is understanding why the model makes its predictions. For this, we employed SHAP (SHapley Additive exPlanations), a state-of-the-art technique that explains the output of any machine learning model. The SHAP analysis provides insights into which features are most important and how they influence predictions.

The global SHAP feature importance summary, presented in Figure 7, ranks the input variables according to their mean absolute SHAP values, reflecting their average impact on the model’s predictions across all samples. Among the considered features, pH emerges as the dominant predictor, exhibiting a substantially higher contribution than all other variables. This indicates that variations in pH play a central role in driving TVFA (M) predictions and strongly influence the model output.

The partial pressure of carbon dioxide (pCO₂) shows a moderate but clearly meaningful contribution, suggesting that gas-phase dynamics also affect TVFA (M) estimation, albeit to a lesser extent than pH. In contrast, total ammonia nitrogen (TAN) exhibits a comparatively smaller SHAP value, indicating a limited but non-negligible influence on the model predictions under the analyzed operating conditions.

To evaluate the robustness of the SHAP-based feature importance analysis, SHAP values were computed independently within each cross-validation fold. For each fold, the mean absolute SHAP value was calculated for every input feature, and the results were subsequently aggregated across folds. Figure 8 presents the mean absolute SHAP values, with error bars indicating the standard deviation across folds.

The results demonstrate a consistent feature importance ranking, with pH exhibiting the highest contribution to model predictions, followed by pCO₂ and TAN (M). The relatively small inter-fold variability observed for all features indicates that the SHAP attributions are stable with respect to data partitioning and are not driven by a specific cross-validation split. While SHAP values reflect model-based explanations rather than causal relationships, the observed stability supports the robustness of the reported interpretability findings.

To further investigate how individual input variables influence the model predictions, SHAP dependency plots were analyzed to capture both the direction and magnitude of feature effects, as well as potential interactions between variables.

The dependency plot for pCO₂ in Figure 9 shows a clear nonlinear relationship with its SHAP values. At low pCO₂ levels, SHAP values are predominantly positive, indicating that low gas-phase CO₂ concentrations contribute to an increase in predicted TVFA (M). As pCO₂ increases, the SHAP values progressively decrease and become negative, suggesting a transition where higher pCO₂ levels reduce the predicted TVFA (M). This behavior reflects a threshold-like response rather than a simple linear trend.

The dependency plot for TAN (M) reveals a predominantly monotonic increasing effect on the model output, as shown in Figure 10. Lower TAN concentrations are associated with negative SHAP values, implying a suppressive effect on predicted TVFA (M), while increasing TAN levels lead to progressively positive SHAP contributions. The smooth transition across the TAN range suggests that the model captures a consistent and interpretable relationship between ammonia concentration and TVFA (M) dynamics. Coloring by pH highlights that higher pH values amplify the positive contribution of TAN, pointing to a meaningful interaction between nitrogen balance and acid–base conditions.

The combined dependency analysis in Figure 11 further highlights interaction effects between pCO₂ and TAN, where similar pCO₂ values can result in markedly different SHAP contributions depending on TAN concentration. At higher pCO₂ levels, samples with elevated TAN exhibit less negative SHAP values compared to those with lower TAN, indicating a partial compensation effect. This interaction suggests that the model does not treat features independently but instead learns coupled biochemical relationships consistent with anaerobic digestion processes.

3.4. Error Stratification and Reliability Analysis Across TVFA (M) Operating Regimes

To assess the reliability of the virtual sensor under conditions relevant to instability prediction, model errors were analyzed separately for stable and overloaded operating regimes using tuned Random Forest models. As shown in Table 3, prediction errors are lower in the low-TVFA (M) steady-state regime, reflecting the limited variability of the target under stable operating conditions. Importantly, in the high-TVFA (M) regime corresponding to process overload, the model maintains bounded error levels with consistent performance across cross-validation folds. Random Forest was selected for this regime-specific analysis due to its robustness against overfitting through bootstrap aggregation (bagging) [34], which is particularly advantageous when the number of available training samples is reduced.

To further evaluate model reliability under different operating conditions, quantiles of the squared prediction error were computed separately for low and high TVFA (M) regimes, as shown in Table 4. In the low-TVFA (M) steady-state regime, squared errors remain uniformly small across all quantiles, reflecting stable operating conditions and limited target variability. In contrast, the high-TVFA (M) regime exhibits higher error levels, as expected under increased process variability and nonlinear dynamics. Importantly, error growth remains gradual and bounded, with the 90th and 95th percentiles indicating controlled worst-case behavior. These results demonstrate that the proposed virtual sensor maintains reliable predictive performance under high-TVFA (M) conditions.

4. Discussion

This study addresses the critical “measurement problem” in anaerobic digestion (AD) by validating a non-invasive virtual sensor for Total Volatile Fatty Acids (TVFA (M)). The primary novelty of this work lies in the systematic benchmarking of a modern tabular foundation model, RealTabPFN-v2.5, against traditional strong baselines (XGBoost, Random Forest) and specialized deep learning architectures (TorchMLP, RealMLP, FastaiMLP). While previous studies have employed standard machine learning to predict AD parameters, this research provides the first distinct evidence that in-context learning models (RealTabPFN) [18] can outperform both gradient boosting and traditional neural networks on small-scale engineering datasets. The RealTabPFN-v2.5 model achieved state-of-the-art performance, with an R² of 0.889 and an RMSE of 0.0079, outperforming the tuned XGBoost (R² = 0.83) and tuned Random Forest (R² = 0.83) baselines. This supports the hypothesis that foundation models, pre-trained on diverse tabular distributions, offer a distinct advantage over models requiring extensive training from scratch, particularly when experimental data is expensive and limited.

The results contribute to the ongoing debate on the suitability of deep learning (DL) methods for tabular data within process engineering applications. In line with recent literature, the findings indicate a clear dominance of tree-based ensemble models, as Random Forest and XGBoost consistently outperformed standard deep learning approaches such as RealMLP and FastAI MLP, reinforcing the notion that deep learning is not all you need for tabular data [17,35]. Importantly, from an industrial applicability perspective, the data efficiency of the RealTabPFN-v2.5 model emerged as a key advantage. Learning curve analyses revealed that the model achieved convergence after observing only 50–60% of the available training data, indicating that effective virtual sensors can be developed with relatively short calibration campaigns. This substantially lowers the barrier to entry for digitalization efforts in biogas plant operations.

Mechanistic validation through explainable artificial intelligence (XAI) represents a critical step toward overcoming the black-box limitation that often hinders the adoption of AI models in industrial control systems [36]. By integrating SHAP analysis, this study demonstrates that the developed model captures biologically meaningful causal relationships rather than spurious statistical correlations. The analysis identifies pH as the dominant predictor of total volatile fatty acid (TVFA (M)) accumulation, which is consistent with established anaerobic digestion (AD) microbiology, where pH functions both as a governing variable for enzymatic activity and as a response variable to acid accumulation [37]. Furthermore, the model successfully represents non-linear gas-phase dynamics, particularly the influence of the partial pressure of carbon dioxide (pCO₂). SHAP dependency plots reveal a threshold behavior in which low pCO₂ contributes positively to TVFA (M) prediction, while this effect reverses at higher concentrations, reflecting complex carbonate system equilibrium shifts during process instability. This capability underscores the advantage of machine learning approaches over linear regression models in capturing such non-linear phenomena. Most critically, SHAP interaction analyses reveal that the model learned the physicochemical coupling between total ammonia nitrogen (TAN) and pH. Specifically, higher pH values amplify the influence of TAN on the model output, accurately reflecting the chemical equilibrium shift toward free ammonia (NH₃). These findings provide strong evidence that the model’s predictions are grounded in established process mechanisms, enhancing its interpretability and trustworthiness for industrial deployment.

Despite the promising results, several important limitations must be acknowledged to appropriately contextualize the findings and delimit the scope of the conclusions. The proposed models were developed and validated using a single, high-quality dataset obtained from controlled laboratory-scale reactors operated under a specific set of feedstock and operational conditions. No external validation was performed using independent datasets from different feedstocks, reactor configurations, scales, or operating regimes. As a result, the present study does not claim immediate general applicability across anaerobic digestion systems broadly, but rather demonstrates a proof-of-concept for data-driven TVFA (M) soft sensing under well-defined laboratory conditions; the generalizability of the proposed models to different feedstocks, reactor scales, and operational regimes has not yet been established.

In full-scale industrial biogas plants, substantially higher variability is introduced by heterogeneous substrates, fluctuating loading rates, seasonal temperature effects, process disturbances, and increased sensor noise. These factors pose a well-known challenge for model transferability, and the robustness observed in this study should therefore be interpreted as conditional on the underlying data distribution. Transitioning from TRL 4 (laboratory validation) toward industrial deployment will require systematic retraining and external validation using multi-site, full-scale SCADA datasets to evaluate domain shift effects and ensure reliable generalization. Regarding the validation protocol, we acknowledge that while temporal dependencies are inherent in anaerobic digestion, this study evaluates the model’s capacity for instantaneous mapping of physicochemical states to TVFA (M) concentrations rather than time-series forecasting. Furthermore, we recognize the potential for reactor-specific leakage due to the long-format data structure; however, this approach was prioritized to maximize the diversity of metabolic states available for training. By providing an early-warning proxy for TVFA (M), the model serves as a decision-support tool that enables operators to implement proactive interventions—such as reducing organic loading rates or adjusting hydraulic retention times—before the system reaches a state of irreversible acidification. This capability translates the prediction metrics into actionable operational value by potentially preventing costly process failures and ensuring stable methane yields. While we acknowledge that certain long-chain fatty acids may fluctuate during severe instability, the available dataset focuses on the core VFA profile which provides sufficient signal for the predictive models. Therefore, the study prioritizes the existing data features that consistently reflect the digestion trend over attempting to infer missing VFA fractions. Moreover, although the study demonstrates the effectiveness of state-of-the-art models under limited data availability, the dataset represents a narrow operational window corresponding to approximately eight days of intensive monitoring. Consequently, the model’s ability to generalize across long-term seasonal dynamics, gradual microbial community adaptation, or fundamentally different substrate compositions remains untested. Addressing these limitations will require future studies that explicitly incorporate cross-plant validation, transfer learning strategies, and domain adaptation techniques. Such efforts are essential to rigorously assess model robustness and to advance machine-learning-based virtual sensing from controlled laboratory environments toward real-world anaerobic digestion applications.

Future research will focus on several key areas to enhance the robustness and practical applicability of the proposed TVFA (M) virtual sensor. First, the dataset will be expanded to include longer operational cycles and diverse feedstock compositions, allowing for a more rigorous assessment of model generalizability across different reactor scales. Second, to address the hydraulic and thermal inertia inherent in anaerobic digestion, future modeling efforts will incorporate time-series feature engineering and lagged variables (e.g., t-1, t-n). The integration of advanced signal processing techniques, such as Savitzky–Golay filtering, will also be explored to mitigate sensor noise while preserving critical process trends. Finally, we aim to transition from offline validation toward real-time implementation by integrating the optimized models into industrial SCADA and IoT-based monitoring frameworks. This will enable proactive decision-making and early-warning alerts, further optimizing process stability and methane yields in full-scale biogas plants.

5. Conclusions

This research demonstrates the high efficiency of integrating state-of-the-art deep learning architectures and neural network models for the real-time prediction of Total Volatile Fatty Acids (TVFA (M)) in anaerobic digestion processes. By utilizing easily measurable parameters such as pH, pCO₂, and TAN, we developed a high-precision soft sensor that effectively bypasses the limitations of traditional, time-consuming laboratory analyses. A key highlight of this study is the comprehensive benchmarking against various SOTA deep learning models, where the RealTabPFN-v2.5, a transformer-based neural network, significantly outperformed other sophisticated frameworks like TorchMLP, Random Forest, and XGBoost. The RealTabPFN-v2.5 model achieved a superior R² of 0.889 and the lowest RMSE of 0.0079, proving its robustness in handling complex, non-linear biological data. Furthermore, the application of SHAP analysis provided essential interpretability for these neural networks, confirming that pH is the primary predictor of TVFA (M) levels and ensuring that the model’s logic aligns with the fundamental biochemical principles of anaerobic digestion. These findings suggest that the implementation of SOTA neural network models provides a reliable and scalable solution for the intelligent monitoring and automated stabilization of biogas production. Future work could integrate these deep learning tools into real-time industrial control systems to enhance the operational and economic efficiency of renewable energy generation. The results emphasize that advanced computational frameworks can bridge the gap between complex biochemical monitoring and practical industrial application, offering a path toward more resilient bioenergy systems. By establishing a rigorous comparison between established machine learning techniques and cutting-edge neural architectures, this work provides a clear roadmap for researchers aiming to deploy artificial intelligence in sustainable waste-to-energy technologies. Ultimately, the transition from manual sampling to automated, neural-network-driven diagnostics represents a transformative step in the modernization of anaerobic digestion facilities, ensuring higher methane yields and reduced risk of process failure across various industrial contexts.

Author Contributions

Conceptualization, B.A., Z.B., N.T., A.B. and B.I.; methodology, B.I., M.M., S.Z., W.W., M.K. and A.K.; software, B.A., Z.B., N.T., A.B. and M.M.; validation, B.I., S.Z., W.W. and M.K.; formal analysis, B.A., Z.B., N.T., A.B., B.I. and A.K.; investigation, S.Z., W.W., M.K. and A.K.; resources, B.A., Z.B., N.T., A.B. and M.M.; data curation, Z.B., A.B., M.M. and A.K.; writing—original draft preparation, B.A., Z.B., N.T., A.B., M.M. and A.K.; writing—review and editing, B.I., M.M., S.Z., W.W. and M.K.; visualization, N.T., M.M. and A.K.; supervision, B.A., Z.B., N.T. and A.B.; project administration, B.I., S.Z., W.W. and M.K.; funding acquisition, B.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP26103739).

Data Availability Statement

The dataset supporting the findings of this study is publicly available at Zenodo: https://eprints.soton.ac.uk/472916/ (accessed on 15 September 2025).

Conflicts of Interest

Authors Zhanel Baigarayeva, Assiya Boltaboyeva and Marlen Maulenbekov were employed by the company LLP “Kazakhstan R&D Solutions”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TVFA (M)	Total Volatile Fatty Acids
AD	Anaerobic Digestion
pH	potential of Hydrogen
pCO₂	partial pressure of carbon dioxide
CO₂	carbon dioxide
TAN	Total Ammoniacal Nitrogen
SOTA	State-of-the-Art
RMSE	Root Mean Squared Error
MSE	Mean Squared Error
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
R²	coefficient of determination
SHAP	SHapley Additive exPlanations
XAI	eXplainable Artificial Intelligence
ANN	Artificial Neural Network(s)
SVM	Support Vector Machine(s)
ML	Machine Learning
IoT	Internet of Things
SCADA	Supervisory Control and Data Acquisition
TRL	Technology Readiness Level
MLR	Multiple Linear Regression
NH₃	ammonia
NH₄⁺	ammonium

References

Sevillano, C.A.; Pesantes, A.A.; Peña Carpio, E.; Martínez, E.J.; Gómez, X. Anaerobic digestion for producing renewable energy: The evolution of this technology in a new uncertain scenario. Entropy 2021, 23, 145. [Google Scholar] [CrossRef] [PubMed]
Archana, K.; Visckram, A.; Kumar, P.S.; Manikandan, S.; Saravanan, A.; Natrayan, L. A review on recent technological breakthroughs in anaerobic digestion of organic biowaste for biogas generation: Challenges towards sustainable development goals. Fuel 2023, 358, 130298. [Google Scholar] [CrossRef]
Wu, D.; Li, L.; Zhao, X.; Peng, Y.; Yang, P.; Peng, X. Anaerobic digestion: A review on process monitoring. Renew. Sustain. Energy Rev. 2019, 103, 1–12. [Google Scholar] [CrossRef]
Rudakiya, D.; Narra, M. Microbial community dynamics in anaerobic digesters for biogas production. In Microbial Rejuvenation of Polluted Environment; Springer: Singapore, 2021; pp. 143–159. [Google Scholar] [CrossRef]
Beschkov, V.N.; Angelov, I.K. Volatile Fatty Acid Production vs. Methane and Hydrogen in Anaerobic Digestion. Fermentation 2025, 11, 172. [Google Scholar] [CrossRef]
Lee, D.J.; Teng, K.H.; Show, K.Y.; Chang, J.S. Effect of volatile fatty acid concentration on anaerobic degradation rate of food waste. J. Environ. Sci. Health A 2015, 50, 1253–1258. [Google Scholar] [CrossRef]
Cruz, I.A.; Chomiak, M.; Cavaleiro, A.J.; Pereira, M.A.; Alves, M.M. An overview of process monitoring for anaerobic digestion. Bioresour. Technol. 2021, 337, 125402. [Google Scholar] [CrossRef]
Kazemi, P.; Steyer, J.-P.; Bengoa, C.; Font, J.; Giralt, J. Robust data-driven soft sensors for online monitoring of volatile fatty acids in anaerobic digestion processes. Processes 2020, 8, 67. [Google Scholar] [CrossRef]
Wang, X.; Rashid, I.; Zhao, Z.; Oladele, M.; Xiang, W.; Huang, Y.; Wazer, E.; McCutcheon, J.; Bollas, G.; Contreras, J.; et al. Machine learning algorithm integrated with real-time in situ sensors and physiochemical principle-driven soft sensors toward an anaerobic digestion data fusion framework. ACS ES&T Water 2023, 3, 1061–1072. [Google Scholar] [CrossRef]
Choi, S.; Kim, S.I.; Yulisa, A.; Aghasa, A.; Hwang, S. Proactive prediction of total volatile fatty acids concentration in multiple full-scale food waste anaerobic digestion systems using substrate characteristics with machine learning and feature analysis. Waste Biomass Valor. 2023, 14, 593–608. [Google Scholar] [CrossRef]
Mahmoodi-Eshkaftaki, M.; Mockaitis, G.; Rafiee, M.R. Dynamic optimization of volatile fatty acids to enrich biohydrogen production using a deep learning neural network. Biomass Conv. Bioref. 2024, 14, 8003–8014. [Google Scholar] [CrossRef]
Kim, H.G.; Yu, S.I.; Shin, S.G.; Cho, K.H. Graph-based deep learning for predictions on changes in microbiomes and biogas production in anaerobic digestion systems. Water Res. 2025, 274, 123144. [Google Scholar] [CrossRef] [PubMed]
Zou, J.; Lü, F.; Chen, L.; Zhang, H.; He, P. Machine learning for enhancing prediction of biogas production and building a VFA/ALK soft sensor in full-scale dry anaerobic digestion of kitchen food waste. J. Environ. Manag. 2024, 371, 123190. [Google Scholar] [CrossRef] [PubMed]
Rutland, H.; You, J.; Liu, H.; Bull, L.; Reynolds, D. A Systematic Review of Machine-Learning Solutions in Anaerobic Digestion. Bioengineering 2023, 10, 1410. [Google Scholar] [CrossRef] [PubMed]
Behera, S.R.; Balasundaram, G. Artificial intelligence in anaerobic digestion: A review of sensors, soft sensors, and machine learning applications. Bioresour. Technol. 2025, 425, 131850. [Google Scholar]
He, L.; Niu, M.; Tiwari, P.; Marttinen, P.; Su, R.; Jiang, J.; Guo, C.; Wang, H.; Ding, S.; Wang, Z.; et al. Deep learning for depression recognition with audiovisual cues: A review. Inf. Fusion 2022, 80, 56–86. [Google Scholar] [CrossRef]
Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Hollmann, N.; Müller, S.; Eggensperger, K.; Hutter, F. TABPFN: A transformer that solves small tabular classification problems in a second. arXiv 2022, arXiv:2207.01848. [Google Scholar] [CrossRef]
Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature 2025, 633, 792–799. [Google Scholar] [CrossRef]
Ruiz, L.M.; Fernández, M.; Genaro, A.; Martín-Pascual, J.; Zamorano, M. Multi-parametric analysis based on physico-chemical characterization and biochemical methane potential estimation for the selection of industrial wastes as co-substrates in anaerobic digestion. Energies 2023, 16, 5444. [Google Scholar] [CrossRef]
Zhang, W. The Impact of Nitrogen Control Strategies and of Biopackaging Degradation on the Implementation of the Anaerobic Digestion of Selected MSW Fractions. Ph.D. Thesis, University of Southampton, Southampton, UK, 2019. [Google Scholar]
Rossi, A.; Morlino, M.S.; Gaspari, M.; Campanaro, S.; Basile, A.; Kougias, P.; Treu, L. Analysis of the anaerobic digestion metagenome under environmental stresses stimulating prophage induction. Microbiome 2022, 10, 125. [Google Scholar] [CrossRef]
Ganeshan, P.; Bose, A.; Lee, J.; Barathi, S.; Rajendran, K. Machine learning for high solid anaerobic digestion: Performance prediction and optimization. Bioresour. Technol. 2024, 400, 130665. [Google Scholar] [CrossRef] [PubMed]
Yildirim, O.; Ozkaya, B. Prediction of biogas production of industrial-scale anaerobic digestion plant by machine learning algorithms. Chemosphere 2023, 335, 138976. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Heaven, S.; Banks, C.J. Validation of two theoretically derived equations for predicting pH in CO₂ biomethanisation. Processes 2023, 11, 113. [Google Scholar] [CrossRef]
Madsen, M.; Holm-Nielsen, J.B.; Esbensen, K.H. Monitoring of anaerobic digestion processes: A review perspective. Renew. Sustain. Energy Rev. 2011, 15, 3141–3155. [Google Scholar] [CrossRef]
Boe, K.; Batstone, D.J.; Steyer, J.-P.; Angelidaki, I. State indicators for monitoring the anaerobic digestion process. Water Res. 2010, 44, 5973–5980. [Google Scholar] [CrossRef]
Yang, J.; Zhang, J.; Du, X.; Gao, T.; Cheng, Z.; Fu, W.; Wang, S. Ammonia inhibition in anaerobic digestion of organic waste: A review. Int. J. Environ. Sci. Technol. 2025, 22, 3927–3942. [Google Scholar] [CrossRef]
Ahring, B.K.; Sandberg, M.; Angelidaki, I. Volatile fatty acids as indicators of process imbalance in anaerobic digestors. Appl. Microbiol. Biotechnol. 1995, 43, 559–565. [Google Scholar] [CrossRef]
Stekhoven, D.J.; Bühlmann, P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef]
Liu, L.; Chen, X.; Petinrin, O.O.; Zhang, W.; Rahaman, S.; Tang, Z.-R.; Wong, K.-C. Machine learning protocols in early cancer detection based on liquid biopsy: A survey. Life 2021, 11, 638. [Google Scholar] [CrossRef]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
Givisis, I.; Kalatzis, D.; Christakis, C.; Kiouvrekis, Y. Comparing explainable AI models: SHAP, LIME, and their role in electric field strength prediction over urban areas. Electronics 2025, 14, 4766. [Google Scholar] [CrossRef]
Yang, Y.; Wang, H. Random forest-based machine failure prediction: A performance comparison. Appl. Sci. 2025, 15, 8841. [Google Scholar] [CrossRef]
Grinsztajn, Y.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? arXiv 2022, arXiv:2207.08815. [Google Scholar] [CrossRef]
Cação, J.; Santos, J.; Antunes, M. Explainable AI for industrial fault diagnosis: A systematic review. J. Ind. Inf. Integr. 2025, 47, 100905. [Google Scholar] [CrossRef]
López-Trujillo, J.; Mellado-Bosque, M.; Ascacio-Valdés, J.A.; Prado-Barragán, L.A.; Hernández-Herrera, J.A.; Aguilera-Carbó, A.F. Temperature and pH optimization for protease production fermented by Yarrowia lipolytica from agro-industrial waste. Fermentation 2023, 9, 819. [Google Scholar] [CrossRef]

Figure 1. Architecture of the multi-model benchmarking pipeline employed for robust TVFA (M) estimation.

Figure 2. Mutual information between features and Target (TVFA (M)).

Figure 3. Learning curves of the RealTabPFN-v2.5 model.

Figure 4. TVFA (M) RealTabPFN-v2.5 calibration curve.

Figure 5. Distribution of prediction residuals for the TVFA (M) RealTabPFN-v2.5 model.

Figure 6. Predicted versus observed TVFA (M) values for the RealTabPFN-v2.5 model.

Figure 7. SHAP Feature Importance Summary.

Figure 8. SHAP stability analysis across cross-validation folds.

Figure 9. SHAP dependence of pCO₂ colored by pH.

Figure 10. SHAP dependence of TAN (M) colored by pH.

Figure 11. SHAP dependence of pCO₂ colored by TAN (M).

Table 1. Statistical characteristics of the monitored anaerobic digestion parameters.

Parameter	Mean ± SD
pCO₂	0.358 ± 0.009
TAN (M)	0.1273 ± 0.0005
pH	7.498 ± 0.031
TVFA (M)	0.0079 ± 0.0053

Table 2. Comparative Performance Metrics of TVFA (M) Prediction Models.

Models	R2	MSE (M²)	RMSE (M)	MAE (M)
RealTabPFN-v2.5	0.889008 ± 0.015600	0.000063 ± 0.000011	0.007924 ± 0.000723	0.005586 ± 0.000590
TorchMLP (tuned)	0.846959 ± 0.024775	0.000087 ± 0.000018	0.009292 ± 0.000976	0.006724 ± 0.000519
Random Forest (tuned)	0.828587 ± 0.034933	0.000096 ± 0.000011	0.009781 ± 0.000566	0.006716 ± 0.000586
XGBoost (tuned)	0.827921 ± 0.034601	0.000096 ± 0.000013	0.009805 ± 0.000635	0.006707 ± 0.000464
Random Forest (default)	0.811978 ± 0.045228	0.000105 ± 0.000011	0.010212 ± 0.000555	0.006806 ± 0.000467
FastaiMLP (tuned)	0.809788 ± 0.041055	0.000106 ± 0.000012	0.010296 ± 0.000604	0.007942 ± 0.000434
XGBoost (default)	0.796295 ± 0.042203	0.000114 ± 0.000010	0.010655 ± 0.000464	0.007269 ± 0.000548
RealMLP (tuned)	0.779525 ± 0.037754	0.000125 ± 0.000028	0.011148 ± 0.001221	0.007952 ± 0.001003

Table 3. Error Analysis by Operating Regime.

Operating Regimes	MSE	RMSE
Stable	0.000014 ± 0.000001	0.003798 ± 0.000170
Overload	0.000107 ± 0.000057	0.010033 ± 0.002876

Table 4. Squared error quantiles reported in MSE.

Operating Regimes	0.50	0.75	0.90	0.95
Stable	0.000036	0.000128	0.000413	0.000791
overload	0.000012	0.000043	0.000114	0.000227

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Amangeldy, B.; Baigarayeva, Z.; Tasmurzayev, N.; Boltaboyeva, A.; Imanbek, B.; Maulenbekov, M.; Zhussupbekov, S.; Wojcik, W.; Kozhamberdieva, M.; Konysbekova, A. Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion. Algorithms 2026, 19, 127. https://doi.org/10.3390/a19020127

AMA Style

Amangeldy B, Baigarayeva Z, Tasmurzayev N, Boltaboyeva A, Imanbek B, Maulenbekov M, Zhussupbekov S, Wojcik W, Kozhamberdieva M, Konysbekova A. Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion. Algorithms. 2026; 19(2):127. https://doi.org/10.3390/a19020127

Chicago/Turabian Style

Amangeldy, Bibars, Zhanel Baigarayeva, Nurdaulet Tasmurzayev, Assiya Boltaboyeva, Baglan Imanbek, Marlen Maulenbekov, Sarsenbek Zhussupbekov, Waldemar Wojcik, Mergul Kozhamberdieva, and Akzhan Konysbekova. 2026. "Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion" Algorithms 19, no. 2: 127. https://doi.org/10.3390/a19020127

APA Style

Amangeldy, B., Baigarayeva, Z., Tasmurzayev, N., Boltaboyeva, A., Imanbek, B., Maulenbekov, M., Zhussupbekov, S., Wojcik, W., Kozhamberdieva, M., & Konysbekova, A. (2026). Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion. Algorithms, 19(2), 127. https://doi.org/10.3390/a19020127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking Tabular Foundation Models for Total Volatile Fatty Acid Prediction in Anaerobic Digestion

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Preparation and Feature Engineering

2.3. Machine Learning Models

3. Results

3.1. Model Performance Comparison

3.2. In-Depth Analysis of the Optimal (RealTabPFN-V2.5) Model

3.3. Feature Importance and Mechanistic Interpretation

3.4. Error Stratification and Reliability Analysis Across TVFA (M) Operating Regimes

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI