1. Introduction
Plastic production has grown steadily in the last 70 years and is expected to increase even further as global population increases [
1]. This puts a significant stress on the environment for several reasons: first, plastics are oil-derived polymers, which depend on a very extractive activity that is not only non-renewable but also potentially harmful to the environment [
2]. Secondly, many traditional plastics are non-degradable under natural conditions and accumulate in the environment [
3]. These materials have to be either recycled (which requires not only energy but also a lot of workforce required for classification, processing and administration), burned for energy recycling (contributing to CO
2 increase in the atmosphere) or stored (which requires an increasing amount of space) [
4]. As a result, plastics can be found polluting all sort of biomes, the oceans being a particularly concerning one [
5].
Under this scenario, biopolymers have been proposed as a green alternative to traditional plastics: they are both renewable (can be produced from natural feedstocks, such as sugars and organic acids) and biodegradable (they can be decomposed by microorganisms) [
6]. Polyhydroxyalkanoates (PHA) are biopolyesters produced by microorganisms that serve them as carbon and energy source [
7]. Under certain (stressful) conditions, PHA is produced and stored inside the cells to be later recycled when needed. A typical case is nitrogen limitation, where a low concentration of this nutrient in the medium will induce PHA production. Thus, cells growing inside a bioreactor can be manipulated to produce biopolymers that, in turn, can be purified and processed to serve as replacement to synthetic plastics [
8,
9]. However, even though PHA production has been known for many years, very few industrial processes using this technology are in operation. The main reason for this is mostly economic: plastic production from oil and gas derivatives is cheaper than its green alternative. If oil and gas prices remain low, biopolymer production would remain challenging unless production costs are reduced. Even if their production is promoted using tax incentives, the increasing demand for plastics would require any bio-based production to have a high productivity.
During the last decades, great effort has been put to extend the field of Process System Engineering (PSE) to the bioprocess industry [
10]. PSE is a field of research that aims to apply mathematical tools to solve scientific and engineering problems of complex production systems [
11]. Design, optimization and control are key activities in any process operation that may be fundamental to achieve economic feasibility. In fact, the application of PSE is ubiquitous in the chemical and petrochemical industry. In the context of biopolymers production, PSE may be a valuable tool to achieve the same levels of productivity and production costs of their synthetic counterpart. However, biological processes are more complex than chemical ones due to several reasons: cell metabolism involves thousands of chemical reactions with internal regulation; there is a great deal of intrinsic variability in production runs and the number of available measurements is very limited compared to the number of variables involved. Thus, the implementation of PSE principles to biomanufacturing is considerably more challenging in comparison to the traditional chemical industry.
Important advances have been done in this direction [
12,
13]. Mathematical models, which lay at the core of this paradigm, have benefited a lot from the combination of several disciplines, ranging from biotechnology, chemical engineering and data science [
14,
15]. Mechanistic models that describe the metabolic behavior of the microorganism are available for several PHA producing strains [
16,
17,
18,
19,
20]. Unfortunately, these models involve systems of equations with several parameters that may require almost complete knowledge of the metabolic network and complex analytical methods (e.g., omic measurement) to estimate them. Furthermore, the extrapolation from predictions about the cell metabolism to the behavior of the cells inside of a bioreactor may be challenging. Non-structured macroscopic models based only on the available measurements in the bioreactor medium (nutrients, biomass, PHA content, etc.) may be easier to develop and even more robust within the experimental region where they were derived [
21,
22,
23]. However, finding a proper structure (i.e., equations representing the physicochemical and biological principles) that covers a wide range of operating conditions is hard to achieve. On the other hand, data-driven models do not need to assume any principles, since they are purely based on correlations found on data. Thus, they are not biased by prior knowledge [
24]. However, the amount of data needed to build these models is usually larger and, more important, extrapolations of the models to previously unseen operating conditions is usually not accurate. In between these approaches, hybrid models combine the fundamental knowledge of mechanistic models together with the flexibility of data-driven models [
25]. The term “hybrid” is generally used with different meanings, and there is not a unique formulation of hybrid models but several versions. In this work, the type of hybrid models considered are the ones that include a traditional formulation of the mass balances of the several species, replacing one or more terms of the balances with machine learning models trained from experimental data [
26]. This is a promising approach that has been around for several decades, but it gained momentum in the last years due to the developments in Artificial Intelligence (AI), Data Science, and computational power [
27]. Hybrid models for PHA production with different structures have been tested with success for batch and fed-batch systems [
28].
In this work, a hybrid model of a continuous bioreactor for the production of PHA is proposed.
Pseudomonas putida strain GPo1 (ATCC 29347 formerly known as
Pseudomonas oleovorans GPo1) efficiently converts octanoate into PHA under dual nutrient-limited growth conditions: both nitrogen and carbon are consumed up to limiting concentrations in the medium [
9]. This operation mode is very important in order to achieve economic feasibility of the production process but is also very challenging from the modeling side because nutrient concentrations in the dual limitation region are difficult to measure, as they are present only in traces. This complicates the identification of the correct deterministic functions for growth and uptake rates in a wide range of concentrations and operation conditions. However, the mechanistic backbone of the hybrid model coupled with an AI method to learn the growth and uptake kinetics is successful in reproducing the bioreactor operation at several process conditions inside and outside the dual nutrient-limited growth regime.
2. Materials and Methods
2.1. Experimental Setup
Two different datasets available in the literature were used in this work to train and test the proposed hybrid model. The experimental setups are very similar: they consist of a continuous bioreactor with a feeding stream containing a concentrated carbon source (sodium octanoate) and a nitrogen source (ammonium sulfate) being part of the remaining medium components. The carbon to nitrogen ratio in the medium feed (
Cin/Nin) is modulated by setting different flow rates of the two feed pumps. Concomitantly, a harvest stream, where culture broth is continuously removed together with residual nutrients, bacteria and PHA (see
Figure 1) keeps the culture volume in the bioreactor constant. Samples were taken periodically, and pH, temperature, dissolved oxygen and volume were controlled and kept at reference values. The same strain,
P. putida GPo1 ATCC 29347 is used in both cases.
The first dataset (DS1) was taken from a series of chemostat experiments [
29]. Five different dilution rates (
D = 0.05, 0.1, 0.2, 0.3 and 0.4 h
−1), were tested with several values of the feeding composition
Cin/Nin, in order to cover three growth regimes: nitrogen limitation, carbon limitation, and dual limitation. The remaining experimental conditions were identical among all the runs. The working volume of the system was 2 L and it was operated as a chemostat. The temperature and pH setpoints were 30 °C and 7.1, respectively. For each dilution rate, different steady state conditions were established before samples were taken. These samples were collected in ice-cooled 50 mL Falcon tubes and centrifugated. The supernatant was frozen at −20 °C until further analysis, whereas the biomass pellet was freeze-dried for PHA analysis. The cell dry weight was determined with a filter method as described in the original publication [
29]. This method retrieved the total biomass that included both the PHA-free biomass as well as the PHA mass. The cellular PHA content was determined by gas chromatography (GC) according to the method proposed by Lageveen et al. [
30], and the difference between the total biomass and the PHA mass yields was the PHA-free (residual) biomass
Xr. The supernatant was analyzed to determine the concentrations of ammonium, with the indophenol method described by Scheiner [
31], and octanoate using GC [
29].
In the second dataset (DS2) [
32], the bioreactor was operated in transient mode: The dilution rate was fixed (
D = 0.3 h
−1) and the
Cin/Nin ratio of the feeding stream was changed over time without letting the culture achieve steady state. No other experimental conditions were changed between experiments. The working volume of this bioreactor was 1.5 L, the temperature setpoint was 30 °C and the pH setpoint was 7.0 (0.1 units below DS1). The sampling method and determinations were similar to the ones in DS1, but ammonium and octanoate were measured using on-line analyzers based on those techniques.
Details about the setups, protocols and methods can be found in the original publications [
29,
32].
2.2. Process Model
Four main species concentrations are used to describe the state of the system: PHA-free biomass of bacteria Xr, poly(3-hydroxyalkanoate) P, carbon C, and nitrogen N (as the equivalent amount of carbon and nitrogen being part of the fed nutrients). The concentrations are expressed in g/L unless stated otherwise. As it was previously stated, the bioreactor can be operated in different regimes depending on the system state and process conditions
A simplified reaction network for PHA production by
P. putida GPo1 using ammonia and octanoate as nutrients is presented in
Figure 2. The cell takes up carbon and nitrogen from the culture broth in order to produce more
Xr and PHA. Nitrogen uptake
qn is directly related to biomass growth while carbon can undergo three different pathways, each one represented by an uptake rate: growth
qg, accumulation (PHA formation)
qc and maintenance
qm. Finally, the PHA production rate
qp is proportional to the carbon accumulation rate
qc. Based on this network, the mass balances for the 4 species are formulated in Equations (1)–(4):
While some assumptions about the reaction network had to be made to formulate the mass balances, the mathematical expression for the growth, uptake and production rates are yet to be defined. A traditional Monod-type formalism can be used as a first approximation:
The specific growth rates
µg and
µC are the maximal values used for the
Xr and PHA formation, respectively. A differentiation was made between the Monod constants for growth (
KC1) and PHA formation (
KC2), whereas
KN is used only for the nitrogen affinity of PHA-free biomass.
C and
N represent the carbon and nitrogen concentrations in the culture broth, respectively. The term
fn accounts for the inhibition of the accumulation and PHA production in the presence of nitrogen in the culture broth. The PHA inhibition constant
kfn represents the threshold concentration of nitrogen required to trigger PHA accumulation:
The remaining rates can be calculated using the stoichiometric biomass formation yields
Yx/c and
Yx/n, the PHA formation yield
Ypha/c, as well as the maintenance coefficient (
mc) as follows:
The maintenance rate
was considered a constant, and the PHA formation yield
Ypha/c was split into two limiting values,
and
, in order to give
enough flexibility to fit the small PHA production present during the growth phase:
The system of equations presented in Equations (1)–(12) includes a set
θ of 11 parameters that need to be fitted from experimental data. The fitting problem involves the minimization of the Mean Square Error (MSE) function, presented in Equation (12):
where
is the experimental measurement of species
j at time
ti in experiment
n, while
is the corresponding model prediction that depends on the model parametrization. The standard deviation
for the experimental measurements in each experiment is used for normalization.
2.3. Hybrid Model Algorithm
Even when Monod-type kinetic models are good approximations for bioreactor operation in certain conditions, these simple representations do not include the cell’s inner regulatory mechanisms and its complex metabolic reactions. Thus, the model presented in
Section 2.2. is not complex enough to capture the behavior of the system under a wide range of process conditions. Different approaches can be taken, e.g., using a detailed model for the growth and uptake rates. This would require more knowledge about the metabolic network of the system and how the cells interact with the environment. A more pragmatic approach is the hybrid modeling approach. Instead of assuming explicit expressions for the growth, carbon uptake, and production rates, neural networks are able to derive them from data:
Here,
NNk is the
k-th output of a single neural network. The mass balances and the remaining rates are equal to the ones previously presented in
Section 2.2. Thus, the hybrid model can be formulated as:
This model includes a multi-output neural network as well as some constant model parameters. The training of the neural network is not straightforward, as the inputs and outputs are not easily available from data. The values of
C and
N during dual nutrient limitation are below the detection limit of the analytical methods. Furthermore, without these values, the rates cannot be calculated directly using the mass balances. An alternative method like the one presented in Narayanan et al. [
33] can be used to calculate the rates indirectly, training the neural network’s weights and biases explicitly (i.e., by minimizing Equation (13)). However, the number of parameters increases rapidly with the number of layers and neurons, and the method can be computationally too intense and thus expensive if a big architecture is required. An alternative is presented here.
As it was stated in
Section 2.2., the mechanistic model is not able to fit the whole range of experimental conditions with a unique set of parameters
θ (the Monod-type expressions are simple approximations that do not consider complex metabolic phenomena that may arise under a wide range of nutrient concentrations). However, the model performs well when fittings are made only for individual experiments. The regression of that experiment returns not only the specific set of parameters but also the calculated values for the concentrations of all the species, even during dual nutrient limitation. By doing so, the growth, uptake, and production rates can be calculated using Equations (5), (6) and (9) with the specific set of parameters
θn and the calculated concentrations of the nutrients. The remaining parameters can be fitted from the complete dataset. The algorithm is presented in
Table 1.
Since hybrid models rely on learning from experimental data, usually they require a larger dataset than purely mechanistic models. When integrating the differential equations, the system reaches a state for which no data in the training set do exist and, therefore, the neural network may return considerable errors in the predicted rates, which will, in turn, have an effect on the mathematical integration step. If the system deviates further from the expected behavior, it may reach new unseen conditions, continuing in a feedback loop. This is usually not the case for mechanistic models, where the structures of the kinetic expressions ensure a stable integration under any process conditions. The performance of the hybrid model can be significantly improved by data augmentation. Artificial data can be created using the mechanistic model to fill the state space for which there is no experimental data available. This way, if the system reaches one of these states during integration, the neural network will make its predictions based on the artificial data, making the process more robust. The algorithm presented in
Table 2 shows how to combine the individual models for the existing experiments in order to create artificial data for augmentation of the existing dataset.
Here, the augmentation of the dataset was performed by simulating new experiments with different dilution rates to mimic the original experimental design: the dilution rate was kept constant while the Cin/Nin ratio changed. A different augmentation approach can be used to generate more artificial data, as long as it covers the state space efficiently.
2.4. Dual Limitation Region Modeling
The dual limitation region is usually presented in the literature as
Cin/Nin vs.
D plots, indicating the combination of variables that, after achieving steady state, will reach dual nutrient limitation [
34]. In principle, it can be described by both types of mathematical models. It is worth noting that the values of both
N and
C never reach zero (otherwise, the growth rates would have a value of zero as well, and there would be a wash out of the bioreactor), so the limiting values that describe the boundary of the growth regimes have to be fixed arbitrarily at a small residual value. If the culture results in values for
C and
N equal or below both of these limiting values, then culture conditions can be considered as being part of the dual nutrient-limited growth regime.
For the case of a continuous culture operated under steady state (chemostat) conditions, the left terms of Equations (1)–(3) would be equal to zero. For carbon limited growth, the carbon concentration would reach
Clim and by consideration of the feed concentration of nitrogen
Nin, as well as Equations (1)–(3), (7)–(9) and (11) the following expressions are derived:
From there, the carbon limitation regime can be defined in terms boundary
Cin/
Nin values of the process conditions:
To solve Equation (23), Equation (20) should be solved first to obtain the value of N, fixing the carbon concentration as Clim. So far, no expressions have been given for rg and qc. Either Equations (5) and (6) can be chosen for the mechanistic model, or Equations (14) and (15) for the hybrid model. In the case of the mechanistic model an explicit expression for Equation (21) can be found, but for the hybrid it has to be solved implicitly.
The same procedure can be applied to the boundary of the nitrogen-limited growth regime:
In this case, solving Equation (21) yields the value of C, fixing the nitrogen concentration as Nlim. Again, both models can be used for the growth and uptake rates.
Finally, it is worth noting that the value of Nin is assumed as constant in Equations (24) and (25). This is done in concordance with the bibliography, where these curves are shown as two-dimensional plots for experiments where Nin is constant always at the same value. As can be seen from the equations, the relationship of the ratio Cin/Nin with the dilution rate D depends on the nitrogen concentration in the feed Nin.
2.5. Model Implementation
All models and algorithms were implemented in Matlab 2020b in an Intel Core i5-9500 CPU. The system of differential equations was solved using function “ode15s”. Mechanistic models were fitted by minimizing Equation (11) with function “patternsearch”, initializing multiple times from different starting points to avoid local minima (also known as a multiple shooting method). The Neural Networks were fitted using function “feedforwardnet”.
3. Results and Discussion
Following the method proposed in
Section 2, the first dataset DS1 was used to fit the mechanistic model nominal parameters
θ0. The experimental data for all experiments together with the mechanistic model predictions are presented in
Figure 3. As can be seen, the model performs well for most of the runs and species, but it struggles to get the carbon concentration close to the carbon-limited growth regime. This is crucial to describe dual nutrient limitation, which is one of the purposes of the model due to its importance for an optimal bioreactor operation.
The nominal model parameters together with an estimation of their 95% confidence intervals are presented in
Table 3. The parameter distributions are not particularly widespread. However, there are certain experimental conditions that the model fails to simulate. This is likely due to limitation in the flexibility of the model, i.e., it is a problem of the model structure rather than a problem of the accuracy of the parameter estimation.
Next, the hybrid model approach was applied to the same dataset. Each individual experiment was fitted and the unique set of parameters
θn was obtained. The model-based augmentation algorithm (presented in
Table 2) was run to obtain artificial datapoints for dilution rates ranging from 0.025 h
−1 to 0.445 h
−1, with a step increase of 0.001 h
−1. The augmented dataset was then used to train the hybrid model with the algorithm presented in
Table 1. The selected neural network was a feedforward net with 4 layers of 5 nodes. The dataset split for training/validation/test was 90/5/5%. The features of the network were
C,
N and
D, with outputs being
rg,
qc and
qp. The hybrid model predictions are shown in
Figure 4. It clearly outperforms the mechanistic model. Especially, it can be seen that the carbon concentration predictions in the supernatant follow the real data much closer in the case of the hybrid model. This can be seen particularly clear in the second experiment (
D = 0.1 h
−1), presented in
Figure 3b and
Figure 4b.
A comparison of both models is presented in
Table 4. The absolute root mean squared error (RMSE) and the R
2 coefficient are presented for each species and each model. It can be seen that while for
Xr and
N the performances are similar, for
C and
PHA the Hybrid model has smaller RMSE and R
2 closer to 1, which indicates a better performance, in agreement with the visual comparison of the fittings in
Figure 3 and
Figure 4.
The performance of the hybrid model was then tested with an independent dataset, DS2. None of the experiments in this dataset was used to fit the model, so it could be used for external validation. Instead of operating at steady state, the feeding profiles in these sets of experiments change dynamically in time. Furthermore, the nitrogen concentration in the feed was kept at 0.15 g/L, the same as in DS1, but it changes in two of the three experiments of DS2. Despite these differences in the bioreactor operation, the hybrid model manages to predict the outcome of the experiments accurately. The predictions of the model and the experimental data are plotted in
Figure 5.
Even though the model performed well in general, there were some discrepancies between the predictions and the experimental data. In experiments 2 and 3 (
Figure 5b,c), there was an offset between the predicted and the measured free biomass. The reasons may be related to the data used for training the model: The experiments were performed under a unique concentration of nitrogen in the feed. Since the PHA-free biomass concentration strongly depends on the nitrogen fed to the bioreactor (as can be seen for example in Equation (23)), the lack of different feeding conditions in the training set may have hampered the extrapolation capabilities of the model, especially for biomass.
Both the mechanistic and the hybrid model were then used to describe the dual nutrient limitation regime in a
Cin/Nin vs.
D plot with the method described in
Section 2.4. The values of
Nin,
Nlim and
Clim had to be fixed arbitrarily. They were chosen to be
Nin = 0.15 g/L,
Nlim = 0.02 g/L and
Clim = 0.03 g/L in order to be comparable with results shown in [
34] and the analytical methods used therein. The predictions of the mechanistic and the hybrid models for DS1 are shown in
Figure 6a,b, respectively. The dual nutrient limitation regime is presented as the gray area, while the black dots represent the experimental conditions where dual limitation starts and terminates. The mechanistic model predicted smoother curves, with the expected shape for a Monod-type kinetic model. However, it did not perform well for some of the dilution rates (especially
D = 0.1 h
−1), predicting a wider dual
C,
N limitation regime. The hybrid model, on the other hand, performed very well for all the dilution rates, as could be expected since one of its main objectives was to accurately fit the carbon and nitrogen concentrations. However, it is worth noting that the shape of the model for unseen conditions (those that have not been tested experimentally) was found to not be as smooth as with the mechanistic model. This effect was mainly due to the imputation algorithm used to generate artificial data, which linearly combines different parametrizations to simulate the additional experimental conditions. If the amount of data for different conditions increases, the hybrid model will improve its performance, resembling more and more the experimental results. This is an important feature of the hybrid model: it is flexible enough to add new data by retraining the neural network, which would correct the predictions in the regions near the new experiments. The mechanistic model, on the other hand, will have to find a compromise between all the experiments with a unique set of parameters that is already having troubles predicting the existing ones.
4. Conclusions
A hybrid model was proposed for the operation of a bioreactor containing the PHA producing P. putida strain GPo1. The model uses a neural network to account for the growth, uptake, and production rates. Since these quantities cannot be measured directly, Monod-type kinetic expressions are used as support to calculate them. Once the neural network is trained, it is used together with mass balances for the main species in order to describe the dynamics of the process under a wide range of operating conditions. The hybrid model performed very well, especially in describing the carbon and nitrogen concentrations, which is key for the description of the dual nutrient-limited growth regime.
The hybrid model presented in this work includes several interesting features and advantages over traditional mechanistic models. The flexibility of the neural network together with the robustness provided by first principles (mass balances and stoichiometric transformations) allows the hybrid model to fit experiments with different process conditions without overfitting, as is shown by the good predictions in the external dataset. Fundamental knowledge about the system can be embedded through the kinetic models used to calculate the rates in the individual experiments (then generalized by the neural network). Furthermore, the assistance of the individual mechanistic models also allows for the calculation of hidden states that cannot be easily measured, like the nutrient concentrations for the dual-limited growth regime.
The hybrid model used here is focused on the dependency of the growth and uptake rates with regards to the feeding conditions and nutrient concentrations. However, different process variables like temperature, pH, or other medium components can be added to the neural networks to study their impact on growth and PHA production (given that the experimental data contains enough variations in their values). Furthermore, the influence of the type of carbon source (e.g., amount of C atoms per molecule) on the polymer structure or properties can be built into the model, provided that the data is available. Of course, more complex models will be more challenging to develop, but probably the production of the experimental data will be more limiting than the modeling effort.
Finally, it is worth mentioning that other machine learning models can be used instead of Neural Network to learn the rates from data. Neural networks present some very useful features (they are very flexible, allow for multi-output responses, are easy to retrain with new data) and are perhaps the most popular method used in hybrid modeling. However, advances are being done in hybrid models with other methods (such and Gaussian Process or Supported Vector Machines) that may present interesting alternatives that should be investigated.
Process System Engineering principles and methods are expected to play a bigger role in bioprocessing in the near future. Mathematical models like the ones presented in this work are important tools that will help to render environmentally friendly technologies more productive, robust, and economically sustainable.