Hybrid Approach to State Estimation for Bioprocess Control

An improved state estimation technique for bioprocess control applications is proposed where a hybrid version of the Unscented Kalman Filter (UKF) is employed. The underlying dynamic system model is formulated as a conventional system of ordinary differential equations based on the mass balances of the state variables biomass, substrate, and product, while the observation model, describing the less established relationship between the state variables and the measurement quantities, is formulated in a data driven way. The latter is formulated by means of a support vector regression (SVR) model. The UKF is applied to a recombinant therapeutic protein production process using Escherichia coli bacteria. Additionally, the state vector was extended by the specific biomass growth rate µ in order to allow for the estimation of this key variable which is crucial for the implementation of innovative control algorithms in recombinant therapeutic protein production processes. The state estimates depict a sufficiently low noise level which goes perfectly with different advanced bioprocess control applications.


Introduction
Producers of recombinant therapeutic proteins are increasingly forced to enhance the batch-to-batch reproducibility of their cultivation runs at a high level of productivity. This is not only required to simplify the downstream processing, but it also increases the productivity. Thus, the process must be kept tightly on its predefined optimized track. The important tools for guaranteeing the quality of the cultivation processes are advanced monitoring and feedback control systems. Various bioprocess monitoring and control techniques are described in literature [1][2][3]. They all suffer from the flaw that the online values of the important controlled process variables (biomass, substrate, and product concentrations) are difficult to measure or to estimate to a sufficient accuracy online, and in the current papers on bioprocess control systems the determination of reliable data from the process is usually insufficiently considered.
Here we follow the current development in many engineering subjects, for example, navigation (e.g., [4]), economics (e.g., [5]), tracking moving objects, and process state estimation (e.g., [5]), to increase the estimation accuracy by means of model supported estimation algorithms which combine the a priori knowledge about the process under consideration and the actual measurement data from the online measurement devices [6][7][8].
The most often employed techniques for state monitoring and estimation are based on Kalman Filters, which are also used in modern bioprocess engineering. The original Kalman Filter algorithm sfGFP [15] under control of the T7-promoter upon induction with IPTG. This protein becomes active within E. coli's cytoplasm and can be detected within the cells with a spectro-fluorimeter.
Importantly, the specific product formation rate π increases monotonically with the cell's specific growth rate µ. The cultivations were performed in a fed-batch mode at a temperature of 30 • C and pH 7.0 in a stirred tank bioreactor with 15 L maximal working volume.
From all data produced and reported in Schaepe et al. [14], we took the records of the three validation experiments S836 to S838 to demonstrate the process supervision with the proposed hybrid version of the UKF. The corresponding feed rate profiles were determined during tracking experiments, which responded to the changing oxygen uptake capabilities of the cells.
During the cultivations, the UKF only uses the online measured offgas data signals, particularly the cumulative oxygen uptake and carbon dioxide formation rates signals, cOUR and cCPR, respectively, to estimate the biomass and product concentrations as well as the specific biomass growth rate.
These data demonstrate that the estimates which only use online measured data from the offgas analysis very closely predict the biomass and product concentration data, which as offline measured data became available much later only.

Process Modeling
Kalman Filters [10] require two models; the first is used to move the elements of the state vector within the state space from one time step t k−1 to the next one t k . The second is the observation model that relates the state vector at time t k to the actual measurement quantities at that time. The Kalman Filter algorithm estimates the current state of process variables, along with their uncertainties.

State Propagation Model
For the state propagation model, a basic ordinary differential equation system describing the propagation of the initial state with time is used. The conventionally used equation system involves the mass balances around the reactor for the state variables. As such, we consider the biomass, the substrate, and the product. As the process was operated in the fed-batch mode, an additional equation is required that takes into account the change of the working volume W with time.
Here, c = [X; S; P] is the state vector with the concentrations, X of biomass, S of the substrate, and P of the product. The specific growth rate µ is taken as an additional state variable which can be estimated during the state estimation procedure. It is assumed to be practically constant and only changed by some modeling noise given by the corresponding element of diagonal covariance matrix V mod . F is the substrate feed rate, and c F is the concentration of the solution fed to the culture. The substrate concentration is the only nonzero element in c F . It was 600 g/L in this concrete case.
The biochemical conversion is described by the volumetric conversion rate, R, which contains the specific conversion rates of the biomass, µ, the substrate, σ, and the product, π. Usually these are modeled by simple or slightly extended Monod expressions. Concretely, the following volumetric conversion rates were taken.
As the specific biomass growth rate µ was taken as a state variable, its value is taken from the current state estimate of the Unscented Kalman Filter.
With the initial conditions c 0 for c, µ 0 for the specific biomass growth rate µ, and W 0 for W, as well as the feed rate profile F(t) and the concentrations c F in the feed, the equation can be solved. The feed rate profiles are manipulated variables and could be measured online ( Figure 1).
With the initial conditions c0 for c, µ 0 for the specific biomass growth rate µ , and W0 for W, as well as the feed rate profile F(t) and the concentrations cF in the feed, the equation can be solved. The feed rate profiles are manipulated variables and could be measured online ( Figure 1). With the feed rate data depicted in Figure 1, the model can be fitted to process data in order to obtain the free parameters of the dynamic process model, the yields Yxs, Yps, Ypx, and the maintenance coefficient ms.

Observation Model
The state vector c at time tk corresponds to a number of quantities that can be measured during the cultivation process. The obvious first question, which of the possible measurement variables reflect the most information about the process' dynamics, can quite easily be answered. For that purpose, it is straightforward to look at the well-established gross reaction equation that describes the biochemical conversion process. It contains, elementwise, the conversion of the significantly changing components with respect to the elements carbon (C), hydrogen (H), oxygen (O), and nitrogen (N). A typical equation is: It is referred to as the stoichiometric equation of the conversion process, where the coefficients a, b, c, d, and e are stoichiometric coefficients or yields.
As the equation considers only those species, the amounts of which are significantly changing during the biochemical conversion process, this equation gives a direct indication of the quantities that should be measured during the process.
Here we are led to the oxygen consumption (O2), the base consumption (NH3), and the carbon dioxide (CO2) formation. The water formation cannot be considered, as its amount is negligible as compared to the water that is part of the cultivation medium. The water production rate cannot be measured accurately enough and is not considered here.
In a practical application, the corresponding rates, the oxygen uptake rate (OUR), the carbon dioxide production rate (CPR), and the base consumption rate (BCR), are usually measured online during the cultivation process. However, in order to reduce the noise level of the measurement signals, it is advisable to replace the original rate signals by their corresponding cumulative rate signals (e.g., the cumulative oxygen uptake rate cOUR). This does not only reduce the noise level, but it additionally plays to the fact that the important state variables, the biomass and the product concentrations, are cumulative quantities as well. With the feed rate data depicted in Figure 1, the model can be fitted to process data in order to obtain the free parameters of the dynamic process model, the yields Y xs , Y ps , Y px , and the maintenance coefficient m s .

Observation Model
The state vector c at time t k corresponds to a number of quantities that can be measured during the cultivation process. The obvious first question, which of the possible measurement variables reflect the most information about the process' dynamics, can quite easily be answered. For that purpose, it is straightforward to look at the well-established gross reaction equation that describes the biochemical conversion process. It contains, elementwise, the conversion of the significantly changing components with respect to the elements carbon (C), hydrogen (H), oxygen (O), and nitrogen (N). A typical equation is: It is referred to as the stoichiometric equation of the conversion process, where the coefficients a, b, c, d, and e are stoichiometric coefficients or yields.
As the equation considers only those species, the amounts of which are significantly changing during the biochemical conversion process, this equation gives a direct indication of the quantities that should be measured during the process.
Here we are led to the oxygen consumption (O 2 ), the base consumption (NH 3 ), and the carbon dioxide (CO 2 ) formation. The water formation cannot be considered, as its amount is negligible as compared to the water that is part of the cultivation medium. The water production rate cannot be measured accurately enough and is not considered here.
In a practical application, the corresponding rates, the oxygen uptake rate (OUR), the carbon dioxide production rate (CPR), and the base consumption rate (BCR), are usually measured online during the cultivation process. However, in order to reduce the noise level of the measurement signals, it is advisable to replace the original rate signals by their corresponding cumulative rate signals (e.g., the cumulative oxygen uptake rate cOUR). This does not only reduce the noise level, but it additionally plays to the fact that the important state variables, the biomass and the product concentrations, are cumulative quantities as well.
Hence, we are looking for models that describe the cumulative rates cOUR and cCPR as functions of the state variables c. As we usually have measurement sampling rates in the order of 1 Hz of these quantities, while the time constant of the changes in the state variables is in the order of 1 h, the cumulation does not influence the measurement information significantly. These signals follow changes in the biochemical kinetics quickly enough, a fact which was already shown in many closed loop control investigations (e.g., [16]).
The classical textbook relationships between the oxygen uptake rates and the biomass concentrations, such as variants of the Luedeking/Piret equation [17], are not accurate enough as an observation model. Hence, it is straightforward to use data driven models for this purpose, in the sense of learning from the experience with measurement data, where mechanistic models are not yet available to a comparable level of accuracy. Various forms of nonlinear regression models (polynomials, feed forward neural networks, etc.) can be used for modelling these relationships [18].
From the many possibilities, we chose the support vector machine approach [19,20], a regression technique that is an advanced kernel approach. Support vector regression (SVR) techniques require less time and expertise than the artificial neural networks to train the model. This is mainly because SVRs are trained with a structured algorithm (quadratic optimization), which has one unique solution, and it consistently produces the same results when trained with identical data and parameters. Data from new cultivation examples can easily be used to extend and improve an existing SVR model without additional tuning of the model parameters. SVR techniques are also more robust for models with multidimensional inputs.
In our Kalman Filter we need a representation of the measurement quantities cOUR and cCPR as a function of the state variables. We took these data from the recombinant protein cultivation experiments [14] and used general radial basis functions or Gaussian bells as kernels.
The observation model describing the cumulative oxygen uptake rate and the cumulative carbon dioxide production rate as nonlinear functions of the biomass concentration is presented as lines in Figure 2. The data points (symbols in Figure 2) were taken from the offline measured biomass concentrations and the cumulative OUR and CPR data measured at the corresponding time instants. As can be seen in Figure 2, all records from the three experiments were used to train the SVM model. A cross validation technique was employed using 70% of the data points for the training and 30% for a validation. Hence, we are looking for models that describe the cumulative rates cOUR and cCPR as functions of the state variables c. As we usually have measurement sampling rates in the order of 1 Hz of these quantities, while the time constant of the changes in the state variables is in the order of 1 h, the cumulation does not influence the measurement information significantly. These signals follow changes in the biochemical kinetics quickly enough, a fact which was already shown in many closed loop control investigations (e.g., [16]).
The classical textbook relationships between the oxygen uptake rates and the biomass concentrations, such as variants of the Luedeking/Piret equation [17], are not accurate enough as an observation model. Hence, it is straightforward to use data driven models for this purpose, in the sense of learning from the experience with measurement data, where mechanistic models are not yet available to a comparable level of accuracy. Various forms of nonlinear regression models (polynomials, feed forward neural networks, etc.) can be used for modelling these relationships [18].
From the many possibilities, we chose the support vector machine approach [19][20], a regression technique that is an advanced kernel approach. Support vector regression (SVR) techniques require less time and expertise than the artificial neural networks to train the model. This is mainly because SVRs are trained with a structured algorithm (quadratic optimization), which has one unique solution, and it consistently produces the same results when trained with identical data and parameters. Data from new cultivation examples can easily be used to extend and improve an existing SVR model without additional tuning of the model parameters. SVR techniques are also more robust for models with multidimensional inputs.
In our Kalman Filter we need a representation of the measurement quantities cOUR and cCPR as a function of the state variables. We took these data from the recombinant protein cultivation experiments [14] and used general radial basis functions or Gaussian bells as kernels.
The observation model describing the cumulative oxygen uptake rate and the cumulative carbon dioxide production rate as nonlinear functions of the biomass concentration is presented as lines in Figure 2. The data points (symbols in Figure 2) were taken from the offline measured biomass concentrations and the cumulative OUR and CPR data measured at the corresponding time instants. As can be seen in Figure 2, all records from the three experiments were used to train the SVM model. A cross validation technique was employed using 70% of the data points for the training and 30% for a validation. Cumulative oxygen uptake and carbon dioxide production rates signals as a function of the biomass concentration X. The curves show a direct evaluation of the support vector regression (SVR) model trained on the data of the cultivations S836, S837, and S838 [14] using the cross validation techniques. cOUR, cumulative oxygen uptake rate; cCPR, cumulative carbon dioxide production rate.

Figure 2.
Cumulative oxygen uptake and carbon dioxide production rates signals as a function of the biomass concentration X. The curves show a direct evaluation of the support vector regression (SVR) model trained on the data of the cultivations S836, S837, and S838 [14] using the cross validation techniques. cOUR, cumulative oxygen uptake rate; cCPR, cumulative carbon dioxide production rate.

Employing the Unscented Kalman Filter
As all Kalman Filters, the Unscented Kalman Filter UKF is a recursive algorithm that determines the estimate c(t k ) at time t k from the previous estimate c(t k−1 ) [11,12].
It first proposes a state vector cˆ(t k ) from the previous estimate c(t k − 1 ) using the nonlinear process model Ψ, (in our concrete application, the model is described by Equation (1), where the actual state vector c(t) is [X; S; P; µ]) and computes the corresponding measurement quantities y(t k ) from cˆ(t k ) using the nonlinear observation model H (in our case, this model is presented by support vector regression equations for cOUR and cCPR). The proposal cˆ(t k ) is then corrected to compute the new estimate c(t k ) using the difference between the actually measured values y (m) (t k ) and the computed values y(t k ): where the matrix K that rules the correction of the proposal cˆ(t k ) depends on the uncertainties of the observations and the transfer model [12]. In this application, the covariance matrixes were taken as diagonal matrices.  Figure 3 shows an example of an UKF state estimation of the biomass and the product concentration from measurement data of the cumulative oxygen uptake rate cOUR and the cumulative carbon dioxide production rate cCPR signals based on data (Cultivations S836, S837, and S838) from Schaepe et al. [14]. The symbols shown in Figure 3 are measurement data that were measured offline. They were not used during the estimate of the state variables, and are only taken to show that the estimates are accurate.

Employing the Unscented Kalman Filter
As all Kalman Filters, the Unscented Kalman Filter UKF is a recursive algorithm that determines the estimate c(tk) at time tk from the previous estimate c(tk-1) [11][12].
It first proposes a state vector ĉ(tk) from the previous estimate c(tk-1) using the nonlinear process model Ψ, (in our concrete application, the model is described by Equation (1), where the actual state vector c(t) is [X; S; P; µ ]) and computes the corresponding measurement quantities y(tk) from ĉ(tk) using the nonlinear observation model H (in our case, this model is presented by support vector regression equations for cOUR and cCPR). The proposal ĉ(tk) is then corrected to compute the new estimate c(tk) using the difference between the actually measured values y (m) (tk) and the computed values y(tk): where the matrix K that rules the correction of the proposal ĉ(tk) depends on the uncertainties of the observations and the transfer model [12]. In this application, the covariance matrixes were taken as diagonal matrices.  Figure 3 shows an example of an UKF state estimation of the biomass and the product concentration from measurement data of the cumulative oxygen uptake rate cOUR and the cumulative carbon dioxide production rate cCPR signals based on data (Cultivations S836, S837, and S838) from Schaepe et al. [14]. The symbols shown in Figure 3 are measurement data that were measured offline. They were not used during the estimate of the state variables, and are only taken to show that the estimates are accurate. The Unscented Kalman Filter software encodes the algorithm described by Wan and van der Merwe [12] (Algorithm 3.1 in that work) and was encoded in Matlab [21]. Therein the SVR regression software was used to train and evaluate the observation model. For that purpose, the generally accessible LIBSVM-software of Chang and Lin [22] was utilized and radial bases functions were used as kernel functions.
Even if the measurement values cOUR and cCPR are artificially distorted by random noise, for example, by 2.5% of the measured values, the Unscented Kalman Filter does not show much different results in the state variables biomass X and product P concentrations, as shown in Figure 4. The Unscented Kalman Filter software encodes the algorithm described by Wan and van der Merwe [12] (Algorithm 3.1 in that work) and was encoded in Matlab [21]. Therein the SVR regression software was used to train and evaluate the observation model. For that purpose, the generally accessible LIBSVM-software of Chang and Lin [22] was utilized and radial bases functions were used as kernel functions.
Even if the measurement values cOUR and cCPR are artificially distorted by random noise, for example, by 2.5% of the measured values, the Unscented Kalman Filter does not show much different results in the state variables biomass X and product P concentrations, as shown in Figure 4. The results for the other two cultivation data records are qualitatively the same, and are thus not repeated here. As already stated above, the UKF algorithm was also used for estimating the specific growth rate of the biomass. Figure 5 presents the typical estimation result of the specific growth rate profile across the cultivation. These estimated online values of biomass, product concentrations, and specific growth rate estimates can then be used in various inferential data analysis and specific growth rate control schemes, as well as for process optimization tasks.

Conclusions
Process supervision is recommended with Unscented Kalman Filters where the dynamic equations are based on mass balances for the biomass, the substrate, and the product, and formulated by well-established ordinary differential equation systems. As the biomass growth kinetics is not a priori known on the same level of accuracy the specific biomass growth rate µ was taken as an unknown, which is estimated in the same way as the other state variables. The less well-known relationships between the state variables biomass, substrate, and product The results for the other two cultivation data records are qualitatively the same, and are thus not repeated here. As already stated above, the UKF algorithm was also used for estimating the specific growth rate of the biomass. Figure 5 presents the typical estimation result of the specific growth rate profile across the cultivation. The results for the other two cultivation data records are qualitatively the same, and are thus not repeated here. As already stated above, the UKF algorithm was also used for estimating the specific growth rate of the biomass. Figure 5 presents the typical estimation result of the specific growth rate profile across the cultivation. These estimated online values of biomass, product concentrations, and specific growth rate estimates can then be used in various inferential data analysis and specific growth rate control schemes, as well as for process optimization tasks.

Conclusions
Process supervision is recommended with Unscented Kalman Filters where the dynamic equations are based on mass balances for the biomass, the substrate, and the product, and formulated by well-established ordinary differential equation systems. As the biomass growth kinetics is not a priori known on the same level of accuracy the specific biomass growth rate µ was taken as an unknown, which is estimated in the same way as the other state variables. The less These estimated online values of biomass, product concentrations, and specific growth rate estimates can then be used in various inferential data analysis and specific growth rate control schemes, as well as for process optimization tasks.

Conclusions
Process supervision is recommended with Unscented Kalman Filters where the dynamic equations are based on mass balances for the biomass, the substrate, and the product, and formulated by well-established ordinary differential equation systems. As the biomass growth kinetics is not a priori known on the same level of accuracy the specific biomass growth rate µ was taken as an unknown, which is estimated in the same way as the other state variables. The less well-known relationships between the state variables biomass, substrate, and product concentrations and the measurement quantities can be modelled to a sufficient degree of accuracy with modern data-driven methods developed in the machine learning community. The support vector machine technique [19] is one example, advanced neural networks and relevance vector machines [23,24] are other alternatives.
The decisive advantage of this type of nonlinear Kalman Filters is that the process and measurement models can be used directly in the estimation algorithms without any change and without the necessity of linearizing the models. The results show that the hybrid UKF method using a support vector regression model as the observation model delivers satisfactory estimates of the state variables, particularly the most important ones, the biomass and the product concentrations, and even the specific biomass growth rate. The example uses real process data in order to show that the estimation technique is not merely a play with software concepts, but leads to process data that are more accurate and reliable than the separate simulated and measured data.
These accurate estimates of the state variables are well suited for advanced process monitoring and control tasks.