Shortening the Standard Testing Time for Residual Biogas Potential (RBP) Tests Using Biogas Yield Models and Substrate Physicochemical Characteristics

: The residual biogas potential (RBP) test is a procedure to ensure the anaerobic digestion process performance and digestate stability. Standard protocols for RBP require a signiﬁcant time for sample preparation, characterisation and testing of the rig setup followed by batch experiments of a minimum of 28 days. To reduce the experimental time to obtain the RBP result, four biogas kinetic models were evaluated for their strength of ﬁt for biogas production data from RBP tests. It was found that the pseudo-parallel ﬁrst-order model and the ﬁrst-order autoregressive (AR (1)) model provide a high strength of ﬁt and can predict the RBP result with good accuracy (absolute percentage errors < 10%) using experimental biogas production data of 15 days. Multivariate regression with decision trees (DTs) was adopted in this study to predict model parameters for the AR (1) model from substrate physicochemical parameters. The mean absolute percentage error (MAPE) of the predicted AR (1) model coefﬁcients, the constants and the RBP test results at day 28 across DTs with 20 training set samples are 4.76%, 72.04% and 52.13%, respectively. Using ﬁve additional data points to perform the leave-one-out cross-validation method, the MAPEs decreased to 4.31%, 59.29% and 45.62%. This indicates that the prediction accuracy of DTs can be further improved with a larger training dataset. A Gaussian Process Regressor was guided by the DT-predicted AR (1) model to provide probability distribution information for the biogas yield prediction.


Introduction
Digestate is a by-product from the anaerobic digestion (AD) process. Due to its high nutrient value, it can be used as soil improver or fertiliser if the digestate is proved to be valorised and can meet relevant quality standards [1]. The digestate stability can be evaluated with a residual biogas potential (RBP) test. The test typically is required to be carried out under mesophilic conditions for at least 28 days with an appropriate inoculumto-substrate ratio and micro-and macronutrients supplemented to avoid the inhibition of biogas production [2]. The digestate is considered to have consistent quality if the RBP test biogas yield is below 0.25 L/g volatile solids (VS), as recommended in the Publicly Available Specification 110 (PAS110), which is a key element of the UK Government's anaerobic digestion quality protocol [3].
The 28-day continuous monitoring of biogas production in an RBP test is timeconsuming and onerous for commercial AD operators. This limits the adoption of RBP tests and regulated markets for digestate. There have been many attempts to find alternative approaches to RBP that offer rapid tests result. These include assessing acid production after the inhibition of methanogenesis [4][5][6] and assessing the digestion of the organic fraction of the digestate after separation of the microbial cell component [7]. Nevertheless, both approaches are of great complexity and further research is needed [7].
Additionally, some researchers have attempted to relate the RBP test results to digestate physicochemical characteristics. For example, the theoretical biogas potential calculated based on the stoichiometric methane conversion from volatile fatty acid (VFA) concentrations and the soluble chemical oxygen demand (sCOD) in the digestate sample were suggested in PAS110 as preliminary pass/fail indicators for the RBP tests [8]. Although these preliminary indicators provide useful information about digestate stability, they cannot be correlated with RBP values or provide reaction kinetics that can be used to reveal further digestion performance information including inhibition. In a previous work reviewing the application of the RBP test for PAS110 [9], correlations between RBP values and various characteristics, including VFA and sCOD, were investigated. Although some low to moderate levels of correlations were found for total VFA, total solids and volatile solids, which account for 40%, 36% and 29% of variation in RBP values, respectively, none of the indicators are sufficiently reliable to predict the RBP values accurately [9].
Other researchers have evaluated using empirical biogas production models, including first-order kinetic and Gompertz models, and experimental biogas yield data from the initial stage of the RBP tests to fit specific accumulative biogas production data from RBP tests [10][11][12]. This has led to a promising experimental and modelling 'hybrid' approach using experimental data collected from a shorter RBP duration (3-7 days instead of 28 days) to calculate model parameters, and then predict the ultimate biogas production. However, the accuracy of prediction is not sufficient to warrant the replacement of RBP with this hybrid approach [12]; therefore, further improvement of the modelling process is required.
In this research, we evaluated the strength of fit for four biogas yield kinetic models including first-order kinetic, modified Gompertz, pseudo-parallel first-order kinetic and autoregressive (AR) time-series to describe the RBP test biogas production process. The models are then calibrated using experimental data collected from shorter RBP tests (5, 10, 15, 20 and 25 days) to calculate model parameters that are then used to predict ultimate biogas production.
In a previous work [9], although using conventional statistical methods, no significant correlations were found between key physicochemical parameters of digestate samples with RBP results. Due to the potential interplay of these parameters, which can influence the RBP results and reaction kinetics, the correlations may be deeply hidden.
Machine learning techniques including multivariate nonlinear regression analysis with decision trees (DT) were applied to predict the parameters of the biogas production model from the physicochemical characteristics of digestate samples. Compared with other multivariate nonlinear regression methods, the DT method is particularly suitable for a training dataset with limited sample size in this study [13]. The uncertainties of the predicted biogas yield were then assessed using a Gaussian process regressor (GPR).
The data processing framework described in this work can potentially have a wider application for other complex biochemical processes that are influenced by multiple physicochemical parameters of the reaction system.

Digestate Samples and RBP Test
The sampling point of the 25 digestate samples was the outlet of the final tank from which the biogas was collected. The AD plants involved in the study are anonymised and coded as ADP1-25. The inoculum was from the anaerobic digester at Millbrook Wastewater (WW) Treatment Plant at Southampton, UK. The RBP test followed the standard procedure described in the PAS110 [3]. Samples were tested in triplicate against two positive controls and three inoculum-only controls.

Analytical Methods
Twenty physicochemical characteristics were analysed for each digestate, including VFAs (total VFA and acetate), total ammoniacal nitrogen (TAN), total Kjeldahl nitrogen (TKN), alkalinity (total alkalinity (TA), partial alkalinity (PA) and intermediate alkalinity (IA)), TS, VS, pH, COD (total COD and sCOD), trace element (TE) concentrations (cobalt (Co), iron (Fe), molybdenum (Mo) and nickel (Ni)), calorific value (CV) and elemental compositions (C, H, N). TAN/TKN and IA/PA were calculated as two extra metrics. TAN/TKN represents the relative contents of ammonia nitrogen and organically bonded nitrogen and thus how ready the substrate is for microorganism degradation [14]. IA/PA is an indicator of VFA accumulation in the AD process. Ripley et al. (1986) [15] suggest IA/PA < 0.3 indicates a stable state of the anaerobic process.
Determination of VFAs is based on the SCA (1979) [16] method Determination of Volatile Fatty Acids in Sewage Sludge. Supernatant layer from digestate centrifugation with 10% formic acid were quantified in a Shimadzu GC-2010 gas chromatograph with a flame ionisation detector and a capillary column type SGE BP-21. TAN and TKN were determined using a Kjeltech digestion block and steam distillation unit, according to the manufacturer's instructions (Foss Ltd., Warrington, UK). Alkalinity was measured by titration with 0.25 N H 2 SO 4 to endpoints of pH 5.75 and 4.3 in order to allow calculation of TA, PA and IA [15]. TS and VS were determined with Standard Method 2540 G (APHA, 2005). Total COD and sCOD were analysed by adapting the closed reflux titrimetric method of 5220C, APHA [17]. TE concentrations were determined using ICP-MS or ICP-OES at a UKAS-accredited commercial laboratory (Severn Trent Services, Coventry, UK) after in-house hydrochloric-nitric acid digestion [18]. CV was measured with a CAL2k-ECO bomb calorimeter (CAL2k, Digital Data Systems, Gauteng, South Africa). Elemental C, H, and N analysis was performed using a Flash EA-1112 elemental analyser (Thermo Finnigan, Cheshire, UK).

Assessing the Strength of Fit for Biogas Production Kinetic Models
Four biogas-production kinetic models commonly used to estimate the kinetic constants of the AD process were compared for their abilities to fit the biogas production of RBP tests. These models include three empirical models (first-order kinetic, modified Gomperz and pseudo-parallel first-order models) and a time-series model (first-order autoregressive). The strength of a kinetic model to accurately fit the experimental biogas production data was measured using R 2 values, which indicate the percentage of the variance in the responses explained by a model.
Additionally, these models were used to predict the 28-day RBP test results by fitting the models with initial 5, 10, 15, 20 and 25 days' experimental data using the Matlab R2021b Curve Fitting Toolbox. Based on the absolute percentage error (APE) between the experimental data and model-predicted results, the accuracy of prediction and duration of experimental data required to obtain a sufficiently accurate prediction were investigated for the following four models: (1) First-order model (FO): The FO model (Equation (1)) is derived from the assumptions that the substrate degradation is a first-order reaction with hydrolysis as the speedlimiting step and the cumulative biogas yield is proportional to the amount of substrate degraded (Equation (2)) [19,20].
where y(t) is the cumulative biogas yield at time t, k is the first-order rate constant, y m is the maximum cumulative gas production, c is the concentrate of the substrate and c 0 is the initial substrate concentration.
(2) Modified Gomperz model (MG): The modified Gomperz model (Equation (3)) is derived from the Gomperz model, which is used to describe the microbial activity and has a signature sigmoid shape [21,22]. It describes the biogas production in terms of the exponential growth rates and lag phase duration of anaerobic degradation microorganisms [11].
where y(t) is the cumulative gas production at time t, y m is the maximum cumulative gas production (mL CH 4 /gVS), R is the maximum gas production rate (mL CH 4 /gVS/d) and λ is the lag phase period or minimum time to produce biogas (days).
(3) Pseudo-parallel first-order model (PP): The pseudo-parallel first-order model (Equation (4)) is considered to be more suitable for describing the biogas yield of mixtures of substrates with different kinetic rates (rapid and slow) [23].
where y(t) is the cumulative gas production at time t, y m is the maximum cumulative gas production (mL CH 4 /g VS), P is the the proportion of the readily degradable material, k 1 is the first-order rate constant for readily degradable material, and k 2 is the first-order rate constant for less readily degradable material.
(4) First-order autoregressive model (AR (1)): AR (1) is a time-series model that predicts the present timestep based on the observations from previous timesteps. The autocorrelation function (ACF) (autocorrelation between timesteps) plots for all the RBP test biogas yield samples gradually trail off (Figure 1a, using ADP20 as an example). Therefore, the biogas production process is an AR process. Many time-series models that essentially model the randomness of the time series data need the trends in the data to be removed, in other words, to ensure the stationarity of the data. However, the application of the AR model does not intrinsically require transforming the data into stationary data. Thus, the biogas yield data were not converted to a stationary process in this study.
(2) Modified Gomperz model (MG): The modified Gomperz model (Equation (3)) is derived from the Gomperz model, which is used to describe the microbial activity and has a signature sigmoid shape [21,22]. It describes the biogas production in terms of the exponential growth rates and lag phase duration of anaerobic degradation microorganisms [11].
where y(t) is the cumulative gas production at time t, ym is the maximum cumulative gas production (mL CH4/gVS), R is the maximum gas production rate (mL CH4/gVS/d) and λ is the lag phase period or minimum time to produce biogas (days).
(3) Pseudo-parallel first-order model (PP): The pseudo-parallel first-order model (Equation (4)) is considered to be more suitable for describing the biogas yield of mixtures of substrates with different kinetic rates (rapid and slow) [23].
where y(t) is the cumulative gas production at time t, ym is the maximum cumulative gas production (mL CH4/g VS), P is the the proportion of the readily degradable material, k1 is the first-order rate constant for readily degradable material, and k2 is the first-order rate constant for less readily degradable material.
(4) First-order autoregressive model (AR (1)): AR (1) is a time-series model that predicts the present timestep based on the observations from previous timesteps. The autocorrelation function (ACF) (autocorrelation between timesteps) plots for all the RBP test biogas yield samples gradually trail off (Figure 1a, using ADP20 as an example). Therefore, the biogas production process is an AR process. Many time-series models that essentially model the randomness of the time series data need the trends in the data to be removed, in other words, to ensure the stationarity of the data. However, the application of the AR model does not intrinsically require transforming the data into stationary data. Thus, the biogas yield data were not converted to a stationary process in this study. For most of the RBP test data, the cumulative biogas production curves typically exhibit a rapid gas-production stage in the initial three days followed by a noticeable reduction in biogas production rate. This means one set of AR model parameters generally will result in a poor fitting for both the rapid-production stage and the later stage after the initial three days. In this study, to simplify the modelling process, the AR (1) model was For most of the RBP test data, the cumulative biogas production curves typically exhibit a rapid gas-production stage in the initial three days followed by a noticeable reduction in biogas production rate. This means one set of AR model parameters generally will result in a poor fitting for both the rapid-production stage and the later stage after the initial three days. In this study, to simplify the modelling process, the AR (1) model was only used to model biogas production from day 4 of the RBP tests. The partial autocorrelation function (PACF) (partial autocorrelation between timesteps) plots of data from day four for most of the biogas time series samples were cut off after one lag (Figure 1b, taking ADP20 as an example). Therefore, the biogas production from day four was an AR (1) process (Equation (5)).
The upper asymptote of an AR (1) process is determined by its unconditional mean µ = c/(1 − β), which corresponds to the ultimate biogas yield of a substrate.
where X t is the response of the present timestep, the X t−1 is the response of the previous one timestep, β is the coefficient, c is the constant and ε t is the white noise. The best-performing model out of the four kinetic models was further studied using decision tree (DT) multivariate regression analysis. DT is a non-parametric supervised learning technique that can be applied for both regression analysis and classification, providing piecewise constant approximations as prediction results [24].

Prediction of Biogas
DTs were trained using all physicochemical characteristics listed in Tables 1 and 2 as predictors to predict the coefficients and constants of the best-performing biogas production kinetic model. First, RBP results of 20 digestate samples (out of the 25-sample dataset) were used to train the DT model with the five remaining samples as the test dataset for model validation. The training process generated 53,130 groups of different combinations of 20 samples when choosing out of the 25. For practical reasons, a subset of 5000 groups were selected randomly to evaluate the absolute percentage error (APE) and mean absolute percentage error (MAPE) (Equation (6)) of the predicted biogas production model parameters and the APE of the calculated ultimate biogas yield. Then, the DT model was trained and validated using the leave-one-out (LOO) cross-validation method, which apply four additional set of data to the training data to verify if the prediction accuracy would improve.
where y i is the model parameter derived by fitting the model to the entire 28-day RBP test data,ŷ i is the model parameter inferred from the physicochemical characteristics by DT and n is the number of training set samples.

Assessing Prediction Uncertainty Using a Gaussian Process Regressor (GPR)
To quantify the uncertainty in the calculated biogas yield using the kinetic model predicted by DT, a Gaussian process regressor (GPR) method was applied in this study. GPR is a kernel-based Bayesian tool to perform nonlinear regression. The process is specified by a mean function m(x) and a covariance function k (x, x'|θ) which defines the covariances between the responses at any two input locations x and x'. θ represents the hyperparameters of the covariance function and their values are learned from the training data by maximising the log marginal likelihood. Once the hyperparameters are decided, the prediction for new input is performed by computing the marginal posterior distribution conditioning on the dimensions with known inputs [25].
The principle of GPR is to predict one timestep further each time by fitting one more datum provided by the biogas-yield kinetic model and the prediction uncertainty bands were also returned. The predictions of the GPR with zero mean and squared exponential kernel and the GPR with linear basis function and squared exponential kernel were compared.

Digestate Characterisation and RBP Test Results
The 28-day RBP test results for each set of RBP samples together with the physicochemical characteristics of each digestate are shown in Tables 1 and 2. Specific cumulative biogas yield data collected during the 28-day testing period were reported elsewhere [9] and were used in the model fitting and training process in this study.

Assessing Strength of Fit for Biogas Production Models
Within the 28 days of the RBP test, it was noticeable that biogas yields of some samples in this study had reached a plateau, whilst others still were showing an upward trend close to the end of the 28-day test. This is clearly due to the different concentrations of readily degradable materials in the digestate samples. To distinguish these two types of digestate samples, the 25 digestate samples were classified into two types based on the absolute average of the daily biogas production change percentages in the last four days: (1) Type I: less than 0.5%; (2) Type II: more than 0.5% (Table 3).  Table 4 shows the R 2 values for the fits of the three empirical biogas production models to RBP test data. Overall, the modified Gomperz model achieved lower R 2 values across the majority of the samples, indicating poorer fitting performance. In contrast, the PP first-order model could describe the biogas production of almost all samples with an R 2 value between 97-99%. The FO model performed better at describing Type I samples, whereas the PP first-order model was more suitable for substrates mixed with materials with different reaction-rate constants, and therefore performed better with Type II samples. However, it is worth noting that there are usually multiple sets of optimal solutions of the estimates of the PP first-order model's parameters (Y m , P, k 1 and k 2 ). This is because the nonlinear least-square error function of this model when fitting a particular set of biogas yield data is not always convex. Therefore, it was not chosen for the study of training DTs to predict the model parameters from the digestate physicochemical characteristics in the following section. In addition, when fitting the PP first-order model in Matlab using the least-square algorithm, the initial value set for the parameter P should avoid 0.5 and k 1 and k 2 should not be the same. Otherwise, the partial derivatives of the error function with respect to k 1 and k 2 are the same, which means the moving direction of these two dimensions are the same and the nonlinear search for the minimum value of the error function value will settle at a local minimum point.
The fitting of the AR (1) model after the initial 3-day rapid biogas production stage was comparable to the PP first-order model, with R 2 values of 99% for the majority of samples (Table 5), regardless of Type I or Type II data. Figure 2 shows the fits of three empirical models and the AR (1) model to the cumulative biogas yield in RBP tests. not be the same. Otherwise, the partial derivatives of the error function with respect to k1 and k2 are the same, which means the moving direction of these two dimensions are the same and the nonlinear search for the minimum value of the error function value will settle at a local minimum point. The fitting of the AR (1) model after the initial 3-day rapid biogas production stage was comparable to the PP first-order model, with R 2 values of 99% for the majority of samples (Table 5), regardless of Type I or Type II data. Figure 2 shows the fits of three empirical models and the AR (1) model to the cumulative biogas yield in RBP tests.   When using these four models to predict the final RBP results based on experimental data collected from a shorter test duration (5, 10, 15, 20 and 25 days), none of the three empirical models and the AR (1) model could achieve a sufficient level of accuracy to replace full-length RBP tests. This conclusion is in agreement with previous studies [12,26]. When using the first-order model, typically the level of prediction accuracy increases when more experimental data points are provided. FO performed moderately better than the MG model at predicting the RBP test result. This result was in agreement with the result of Nielfa et al. (2015) [26] for biomethane potential test biogas production prediction for the organic fraction of municipal solid waste and biological sludge co-digestion. However, both FO and MG models were proved to be unsuitable to predict the RBP result of Type II data.
Using 15 days of experimental data, the PP model could predict the RBP result with APE < 10% in nearly all the samples (both Type I and II). Therefore, the PP model is a preferred option for RBP test-result-prediction when the experiment is half-way through. For the AR (1) model, RBP data fitting and modelling starts from day 4, thus 10, 15, 20 and 25 days of experimental data were used for model fitting. The prediction ability of the AR (1) model was comparable to that of the PP model. Table 6 shows the predictions of RBP test results when fitting an increasing amount of experimental data to four biogas production models (due to the large sample numbers, only seven samples were randomly selected from each group of Type I and II data).

Biogas Yield Prediction from Digestate Physicochemical Characteristics by DT
DTs were first trained with the physicochemical characteristics of 20 digestate samples as predictors and the fitted coefficients and constants of AR (1) models as responses. A total of 5000 groups of splits between the training set and test set were randomly chosen to evaluate the prediction accuracy. The average MAPEs of the predicted AR (1) model coefficients and constants for among 5000 groups of test sets were 4.58% and 72.04%, respectively. The MAPE of the calculated RBP test result at day 28 from the predicted AR (1) model parameters was 52.125%.
With four more samples provided for the training set, the MAPEs of the predictions for the AR (1) model coefficient and constant with the LOO cross-validation method were 4.31% and 59.29%, respectively (

AR (1) Model Prediction Guide GPR
When training the DTs with LOO cross-validation and ADP3 as the test set, the APE of the predicted biogas yield on day 28 using the inferred AR (1) model was 6.87%. This is used as an example for illustration in Figure 3. By accepting one more input each time from the inferred AR (1) model together with the experimental observations of the first three days, the predictions of every one timestep further of the GPR with zero mean and squared exponential kernel were smaller than AR (1)'s predictions for the first few timesteps and then gradually closer to the predictions of the AR (1) model. This was explained by the increased fitted GPR model length scale and vertical scale when receiving more training data. The smaller the length scale, the curvier the underlying function is, and the smaller the vertical scale, the more concentrated the underlying function is around the mean. In contrast, for the GPR with a linear basis function, the predictions surged for the first few timesteps and then approached the AR (1) model predictions. This corresponded to the decreased slope of the fitted linear mean function. In general, given that only one more timestep is predicted at a time, the selection of zero or linear mean function is not of much concern. The 95% confidence interval of the prediction of GPR narrowed when more data were provided. and the smaller the vertical scale, the more concentrated the underlying function is around the mean. In contrast, for the GPR with a linear basis function, the predictions surged for the first few timesteps and then approached the AR (1) model predictions. This corresponded to the decreased slope of the fitted linear mean function. In general, given that only one more timestep is predicted at a time, the selection of zero or linear mean function is not of much concern. The 95% confidence interval of the prediction of GPR narrowed when more data were provided.

Conclusions
This study has demonstrated that it is possible to use RBP experimental data collected in the initial stage of the test to predict the 28-day RBP result in a kinetic model fitting exercise. By fitting 15 days of experimental data from RBP tests to kinetic models, the PP first-order model and the AR (1) model achieved a promising accuracy with APE < 10%.
Further study demonstrated using the decision tree (DT) method that AR (1) model parameters can be predicted from the physicochemical characteristics of the digestate samples. This provides potential to further reduce the data requirement to four days of RBP experimental data and thereby significantly reduce the resting time of a standard 28day RBP test to around four days. It was observed that when more training data were

Conclusions
This study has demonstrated that it is possible to use RBP experimental data collected in the initial stage of the test to predict the 28-day RBP result in a kinetic model fitting exercise. By fitting 15 days of experimental data from RBP tests to kinetic models, the PP first-order model and the AR (1) model achieved a promising accuracy with APE < 10%.
Further study demonstrated using the decision tree (DT) method that AR (1) model parameters can be predicted from the physicochemical characteristics of the digestate samples. This provides potential to further reduce the data requirement to four days of RBP experimental data and thereby significantly reduce the resting time of a standard 28-day RBP test to around four days. It was observed that when more training data were included in the DT machine learning model (from 20 to 24 samples), the prediction accuracy of the RBP result increased by 12.48%. This indicates that collecting more data to include in the model-training process can further improve the prediction outcome.
The framework of predicting kinetic model parameters from the physicochemical characteristics of the substrate can potentially be applied to the yield prediction of the product from other biochemical reaction processes.