Subspace Based Model Identiﬁcation for an Industrial Bioreactor: Handling Infrequent Sampling Using Missing Data Algorithms

: This manuscript addresses the problem of modeling an industrial (Sartorius) bioreactor using process data. In the context of the Sartorius Bioreactor, it is important to appropriately address the problem of dealing with a large number of variables, which are not always measured or are measured at different sampling rates, without taking recourse to simpler interpolation- or imputation-based approaches. To this end, a dynamic model for the Sartorius Bioreactor is developed via appropriately adapting a recently presented subspace model identification technique, which in turn uses nonlinear iterative partial least squares (NIPALS) algorithms to gracefully handle the missing data. The other key contribution is evaluating the ability of the identification approach to provide insight into the process by computing interpretable variables such as metabolite rates. The results demonstrate the ability of the proposed approach to model data from the Sartorius


Introduction
Bioreactors are an important part of many different industries ranging from environmental engineering to bio-pharmaceuticals. Several factors influence the productivity of a bioreactor including mass transfer, heat transfer and concentration of the biocatalyst used to produce the final product [1]. One such bio-pharmaceutical product is monoclonal antibodies, which is the focus of the present work. In this process, one of the key objectives is to maximize the volume specific production of the antibody. While the desired product can be characterized in many ways, the Sartorius Bioreactor is designed to produce the specific protein through a careful manipulation of bioreactor properties. The volumetric production of the monoclonal antibodies is influenced by several parameters such as feed concentrations and cell growth that must be appropriately modeled and controlled.
Many parameters determine the cell growth, death and protein production dynamics. Glucose is a key nutrient for cell growth providing the necessary source of energy and biomass formation; however, excess glucose can lead to production of lactate, leading to increased cell death. Glutamine behaves similarly to glucose, acting as a nutrient promoting cell growth especially in cases of fast cell growth. Both lactate and ammonia have detrimental effects on cell growth with the latter being more potent [2]. External factors such as temperature and pH play an important role in maximizing the effects of the various nutrients on cell growth. Increasing the temperature up to a certain threshold increases cell growth due to increased system dynamics, after which a temperature shift midway variables [14][15][16][17]. Another suitable modeling approach is to use an input output representation of the system. The identified model is equivalent to the subspace identification model through a transformation. The key difference between the two approaches is that subspace identification allows for an explicit determination of the number of states during the identification procedure. Subspace identification consists of two distinct steps: identifying a state trajectory from historical input and output batch data and using a least squares solution to determining the system matrices of a Linear Time Invariant (LTI) system. To achieve these outcomes, subspace identification utilizes a range of techniques including canonical variate analysis [19,20], numerical algorithms [21] and multivariate output error state space algorithms [22]. One common technique to subspace identification is singular value decomposition (SVD) of the matrices [14,16]. However, SVD requires matrices to be full-rank making it unsuitable for handling batch data with missing observations [23]. Thus, subspace identification by itself is unable to handle the metabolite rates with missing measurements coming from the infrequent sampling rates.
As a result of these considerations, a missing data subspace modeling approach using PCA and PLS steps was recently developed [24]. Specifically, the addition of PCA and PLS steps to the subspace identification approach allows for the missing observations to be accounted for. While the use of PCA and PLS techniques is not a novel introduction to model identification (see [25,26]) the reduced latent variable space is marginally affected by missing data. The first step in the approach is to use latent variable methods (PCA followed by PLS) to identify a reduced dimensional space for the variables which accounts for missing observations. The second step replaces SVD with PCA, to handle missing observations, to identify the states of the system whereupon traditional subspace approaches can be utilized. The approach in [24], however, does not directly handle discrete additions (of glucose), and thus is not directly applicable. Another recent result [27] that explicitly handles discrete additions is not applicable due to two reasons--the first is that the results in [27] do not handle missing data, and the second is that a direction application of the approach in [27] coupled with the missing data approach of Patel et al. [24] would lead to batches with almost no data, in turn making the approach inapplicable. Finally, while the results in [24,27] provide a modeling framework, the resultant subspace models do not necessarily provide insight into the process dynamics and could be improved by augmenting with tools to enable easier access to the practitioner.
Motivated by the above considerations, the present manuscript adapts the missing data approach of Patel et al. [24] to specifically handle the discrete addition nature of the Sartorius Bioreactor along with the missing data in the metabolite measurements and develops a data driven dynamic model that also predicts variables that can be much better interpreted by the practitioner. The approach is designed to handle batch data with variable batch length without the need for batch alignment techniques. This approach is utilized to identify two LTI models of the system: one for the concentrations and the other, to provide more insight into the process dynamics, for the metabolite rates. The rest of the paper is organized as follows. Section 2 presents the bioreactor process and overview of traditional subspace identification. In Section 3, an application of the proposed approach to the Sartorius Bioreactor is presented. Finally, concluding remarks are made in Section 5.

Preliminaries
A brief overview of the bioreactor process is presented in this section followed by the missing data subspace identification approach for batch processes.

Bioreactor Process Description
The Sartorius Bioreactor is operated as a fed-batch reactor with nominal or centre point conditions of a pH of 7.1, dissolved oxygen of 60% and a temperature of 36.8 • C. It has a discrete feed input utilized to maintain the glucose concentration in the reactor at 2.5 g/L. After being initialized with a starting cell culture, the process runs for 12 days before the reactor is stopped and the final cell titer is measured. The process has some continuous measurements available such as pH and temperature, but the rest of the measurements are only sampled up to three times a day. The bioreactor has the ability to control temperature, pH and (through discrete additions) the glucose concentrations. The measured outputs are titer, viable cell density (VCD), cell viability, glutamine concentration, lactate concentration, glutamate concentration and ammonia concentration.
Sartorius Bioreactor utilizes a discrete nutrient feed system. Thus, glucose is added to the system in a series of discrete additions in order to maintain the target glucose concentration. The glucose measurement is utilized to determine the glucose addition time, and, at each addition interval, a feed volume (200 mL) with a high glucose concentration is added to the bioreactor, resulting in a sharp (slight) increase in the volume and a larger increase in the glucose concentration. The eventual objective of the model is to be utilized for the purpose of a control strategy such as model predictive control. The utility of the model therefore is in its ability to predict the final protein titer, by using the measured inputs and outputs for up to a given time in the batch, and based on candidate input variables at the end of the batch. Another key objective is to utilize the model to monitor rates of metabolites consumption. These rates provide a useful view into the process and enables making more sense of the model, in turn making the model much more accessible to the practitioner.

Dynamic Modeling of the Bioreactor
In this section, first a dynamic model is identified, with the model output being measured variables. In the next subsection, a dynamic model is identified that uses a combination of measured and calculated variables to directly estimate metabolite consumption rates.

Dynamic Model Identification and Validation Using Measured Outputs
The first model examines the daily metabolite concentrations and their impact on cell titer. In this process, the following measurements are available: glucose concentration (mg/L), temperature setpoint (deg C), pH setpoint, titer (mg/L), viable cell density (VCD) (10 × 10 −5 cells/mL), cell viability (%), glutamine concentration (mg/L), lactate concentration (mg/L), glutamate concentration (mg/L) and ammonia concentration (mg/L). One of the first decisions in developing subspace identification based models is determining the input and output variables that allows for model identification, and it is also in line with process implementation. Thus, pH and the temperature setpoints are selected as two of the input variables. The controller on the process works reasonably well, thus the pH and temperature values pretty closely follow the setpoints. The objective in this work is to determine the effect of these variables on the metabolites and cell titer, not the effect of the pH and temperature setpoints on the pH and temperature. In essence, the temperature and pH directly influence cell growth dynamics and the shifts in the setpoints represents changes in the growth profiles. The other measured variables inside the bioreactor, however, do not cause significant changes in the temperature or pH values and so the measured output values have more noise than useful information and are consequently omitted. Thus only the titer, viable cell density (VCD), cell viability, glutamine concentration, lactate concentration, glutamate concentration and ammonia concentration are chosen as the seven outputs.
Glucose on the other hand poses its own challenge. There are two potential ways to include glucose in the model. The first is to include the glucose addition as an input, and model glucose as an output. In such a scenario, the model would be trying to decipher the effect of glucose addition on the glucose concentration-which is a fairly straightforward mole balance. The other is the effect of the rate of consumption of glucose in its role as a metabolite. While this is possible in principle, every discrete addition of glucose would cause a jump in the glucose measurement, and would in turn cause the states to 'jump'. Such a discrete addition piece could be modeled using the subspace identification approach in [18], but it would lead to having to split the batch into multiple batches-with each batch comprising the time period between discrete additions. While this would be possible in principle (and reasonable for the process considered in [18]), in the present instance, this would lead to each of the batches having very sparse measurements-thereby comprising mostly missing data. In this case, the recently developed missing data approach [24] would not be directly applicable.
Glucose is therefore considered an input in the present manuscript. From a practical standpoint, it is reasonable because the glucose concentration can be readily measured and modified and thus be an input in a controller implementation. The dataset however poses an interesting challenge in this regard because the measurements of glucose are taken before the glucose addition, but not measured right after the glucose addition. The first measure to handle that includes the computation of the glucose concentration right after the glucose addition. This is the more intuitive part and can be computed readily as follows: where V is the volume in L, C G is the concentration of glucose in mg/L and the + and − represent after and before feed addition, respectively The other more important question is how to utilize the newly computed glucose concentration. Again, there are two alternatives and here one approach is clearly incorrect. The first alternative is to add an additional data point in the batch. Thus, right after the data point before the addition, a new point is added where the value of the glucose measurement is changed, but the value of the other variables is kept the same. While this sounds intuitively right, such a choice would provide the model with false information. In particular, it would suggest to the model that the value of the glucose changed in one sampling time while the others stayed the same. This is counter to what happens in the process in that the value of the glucose jumps instantaneously. The implementation of this approach is shown in Figure 1 and it clearly shows how the concentration in the reactor 'increases' between sampling intervals. For example, after Days 3, 6, 7, 8, 9 and 11, the glucose concentration seems to increase slowly over time, which is contrary to what we know happens (i.e., glucose gets consumed). The second and correct adaptation then is to replace the value of the glucose measurement by the newly calculated measurement. As shown in Figure 2, the concentration increases instantly upon glucose addition and the next measured sample shows that the glucose concentration decreases between sampling instances.
Having determined the right set of inputs and outputs, the training input sequence from one batch is shown below in Figure 3. Note that the temperature and pH setpoints are only moved from the center line values to induce variations in the dataset, reflective of the true process, and as such not all batches have these shifts. To identify the model, data from 11 different batch runs were used for training batches. The training batches were chosen to establish the daily operating conditions of the Sartorius Bioreactor with sufficient variation provided by different temperature and pH setpoint changes providing a reasonably rich dataset.

Remark 1.
We recognize that the use of 11 training batches does limit the ability to accurately validate the model. In future work, as more data become available, the identification can be redone to include more batches. What is perhaps more important to recognize is that the model is good for the data range it is used for in the training. Thus, in conjunction with existing model monitoring techniques [28], one can readily monitor if the model continues to be valid for the batch under consideration. If the monitoring technique reveals that the model predictions are diverging from the observations, the model can be retrained using the new on-line measurements in order to improve model accuracy.
Having handled the discrete nature of the input addition, the data driven modeling approach [24] was subsequently implemented to identify a system model. A state space model of order 3 was identified by ensuring the best model fit during the training stages. The subspace identification approach removes any dependent relationships between the inputs and the outputs; therefore, it is possible to model all of the outputs with only three states using the relationships defined in Equation (1). Note that, while there are several inputs and outputs, the choice in the number of states is determined by the ability of the model to explain the correlations between the past inputs and outputs and the future outputs, and, if some of the outputs observed in the data are co-linear, it might be possible to capture the observed process dynamics using fewer states.

Remark 2.
With regards to the number of states, while it is understood that the process dynamics are nonlinear, what the identified model captures is the process dynamics observed in the training data. In fact, one of the strengths of the subspace identification approach is the ability to determine the number of states in the system (as observed in the data) by utilizing linear algebra based techniques. The first principle model of Karra et al. [2] is an alternate approach to process modeling, one that uses first principles techniques to set up the model structure. In future work, there is the possibility of using the first principles model as a part of a hybrid model structure [29] to better predict the process behavior-such a hybrid model design remains outside the scope of the present work.
The modeling identification procedure [24] used is a combination of subspace identification with PCA and PLS techniques to handle variable batch length missing data problems. The identification approach used in this paper identifies an LTI model as follows: Given s measurements (where s represents the length of the data) of the input u (b) [k] ∈ R m and the output y (b) [k] ∈ R l variables from each batch, a model with order n can be identified using the following equations: where the objective is to determine the order n, from cross validation, and the system matrices The system matrices are identified in two stages where first a state sequence is identified and then subsequently the system matrices. The subspace identification approach is carried out using a series of PCA and PLS regressions with non-iterative partial least squares algorithms (NIPALS). The first part of subspace identification is to identify a state trajectory. This is done using PCA by projecting the past inputs and outputs perpendicular to the future inputs. We recognize that the future inputs should be completely independent of any past data, however this step insures we remove any potential correlations as a result of insufficient excitation. Additionally, the future outputs are projected perpendicular to the future inputs. Recognizing that the future outputs are a result of the states and future inputs by removing the correlation between the future inputs the remaining correlation depends on the states. In the next step, PLS is carried out between the newly deflated past inputs and outputs and future outputs. This is done to explain how the past data results in the current states for the future data and to expose the underlying state relationship. Finally, to explicitly identify the state trajectory, where traditional methods [14] utilize singular value decomposition, this approach uses PCA. The end result is a state trajectory that can be used to identify the system matrices using regression techniques.
The key use of NIPALS algorithms in these steps gives this approach the ability to handle missing data. For a more detailed explanation of the approach used in this paper, see the work of Patel et al. [24].

Remark 3.
Sartorius has developed a good first principles model of the bioreactor; however, the parameter estimation problem continues to be a focus of future work. That said, the proposed data driven approach could readily be utilized with the first principles model. Thus, a data driven approach which leverages the process data can be utilized to develop a hybrid model for improved prediction power [29].

Remark 4.
One of the considerations when modeling a dynamic process such as cell growth is that there are different phases in cell growth that occur over the course of one batch. The metabolic response of cells to their environment is complex and therefore strongly nonlinear. This response is also widely believed by biologists to be non-Markov (i.e., cells have memory of historical conditions). In these different phases, the process may behave differently making a linear time invariant model an unsuitable choice. To handle this situation, it is possible to treat each different growth phase as a separate smaller batch. This differs from the traditional batch problem since the beginning of each smaller batch represents the end of the previous smaller batch. These smaller batches would then be used as part of the model identification allowing the identified subspace model to appropriately capture the behavior in each phase. As can be seen in the Application Section, the present data driven modeling approach works reasonably well. In future work, as more batches need to be modeled (and the model will likely be utilized for feedback control purposes), such phase-based identification approaches will be pursued. Finally, another direction of generalization would be to determine good initial conditions for the states based on the measured observations. Presently, the states are initialized at a value which is the average of the value for the batches used in training. In future work, an approach can be followed where the subspace model is better initialized for quick convergence of the state observer and the resultant ability to predict starting from early on in the batch.

Remark 5.
Note that one of the advantages of a first principles modeling approach is that it can be more easily extrapolated. Thus if a first principles modeling approach is used, the resultant rate expressions can be utilized to model the operation of the process in a continuous fashion. On the other hand, a data driven model identified using data from batch operation cannot be directly applied to continuous operation. It can serve as a 'starting point' and adapted using the monitoring based re-identification approach, and, even more quickly retrained if it is utilized as part of a hybrid modeling strategy via leveraging the extrapolation capability of the underlying first principles model.

Dynamic Model Validation
This section illustrates the validation procedure for a new batch. Recall that validation is the key step in model identification, by providing a means to evaluate the successes of a developed model. Note that one of the inherent features of any state space based model is the requirement of the knowledge of initial states. If it is a first principles model, where not all the state variables are measured (as is often the case) using the first principles model would require an initial state estimation process step. In the present instance, the model is a linear state space mode, with the states being a realization of the input output dynamics, and thus, by construction, unmeasured. By the same construction, however, the states are observable from the measured outputs, and thus enable the design of a state observer/estimator. Therefore, for a new dataset, an initial state estimate is first computed before prediction is possible. In the present work, a Luenberger observer design is used at the beginning of the batch until the predicted outputs converge with the process outputs. The observer has the following form:x where L is the observer gain and is chosen to ensure that (A − LC) is stable. The missing data problem has specific implication in this regard and needs to be adequately accounted for. Thus, the above observer cannot be 'implemented' directly when parts of the output are missing. Specifically, when the output measurement is missing, the term used to update the prediction, L(y[k] −ŷ[k]), yields an undefined value. To operate the state observer with missing data, this work uses the linearly interpolated value as the process measurement at time k in order to update the states. Remark 6. The use of linear interpolation for state estimation is only one of the possible approaches. In addition to multirate state estimation, which is a well documented problem, it possible to build a smaller model without missing data for state estimation. This approach involves building a separate subspace model using the continuous output observations. This model can then be used to estimate the states of the system until they converge and the full model can be used for validation. This approach is not considered in the present manuscript, primarily because of the observed success of the modeling approach, but, with increased data availability and modeling challenges, it could very well be included in future work.
After the states have converged, this is where the identified model's predictive capabilities are tested with the missing outputs. The remainder of the batch is predicted using the model; however, as the process measurement comes in, the model uses that estimate with the observer in order to update the state estimate at that specific sampling time.

Remark 7.
While the present illustration utilizes linear interpolation for the state estimator, it does not assume any knowledge of the process instead taking measurements as they become available. Linear interpolation is only used to allow for a good state estimate to be obtained which is not a part of validating the identified model's predictive capabilities. The model is still identified from a dataset with missing values and can be used to predict when process measurements are not available. Note that the model's predictive capability is not limited to a 'next step prediction'; the model predicts to the end of the batch and updates the trajectory with each available measurement to predict the final quality more accurately.
To show the effectiveness of the missing data approach on the Sartorius Bioreactor case study, this section identifies a dynamic model used to predict the quality variables and a dynamic model to identify the metabolite rates. These models are built on training data from the Sartorius Bioreactor and then validated on a separate batch. The error is calculated as the normalized prediction error between the predicted model and the true process outputs. Note that the error is only calculated at the points where process measurements are available. The error is calculated as follows: where y process represents the process outputs,ŷ represents the predicted outputs and predictions represent the number of available measurements. The identified dynamic model is presented below to show how the output profiles are generated.
The prediction errors from both the training data and the validation batch are shown in Table 1. As expected, the validation error is slightly larger than the training error since the validation batch was not used in model identification. Figure 4 shows the training results and Figure 5 shows the validation results from the quality model. In Figure 4, there are model predictions despite the lack of a process measurement because the model keeps track of the states internally allowing it to make predictions at every time step. As shown in both sets of figures, the model is able to accurately predict the trends in the metabolites and more importantly the viable cell density which shows the cell concentration at the end of the batch. This is the key parameter Sartorius uses in downstream processes, and, despite the large amounts of missing data, the trend was accurately predicted.

Metabolite Rate Modeling of the Bioreactor
The metabolite rate model is important for Sartorius in order to see the daily trends in the bioreactor. The goal is to be able to control the reactor overnight based on the end of day predictions. Thus, knowing the trends in the metabolite rates is an important factor when considering what additions need to be before allowing the process to run. In addition to improving the model predictions, knowing the specific metabolite rates is important for ensuring the data driven model matches the physical properties of the system. As described in Section 2.1 the metabolite concentrations have certain effects on the process that must be represented in the data driven model.

Metabolite Rate Model Identification
The metabolite rates are an important part of the bioreactor process as they determine how the outputs from the dynamic mode change with respect to the viable cell density. Analyzing the metabolite rates is an important part of determining the ideal input conditions required for optimal growth in each stage. The specific metabolite rates are calculated as follows: where R m t denotes the metabolite rate for a metabolite m t , x v represents the viable cell density, ix v represents the integrated viable cell density and h represents the sampling interval. The modeling approach calculates metabolite rates using three inputs (glucose concentration, temperature setpoint and pH setpoint) and five outputs (glucose rate, glutamine rate, lactate rate, glutamate rate and ammonia rate), and then builds a model to directly predict the metabolite rates. A metabolite rate model of order 3 was identified based on training fit results. The LTI model is shown below:

Metabolite Rate Model Validation
The training fit and validation error are shown in Table 2 and are similar in magnitude. As shown in Figure 6 for the training data and in Figure 7 for the validation batch, the metabolite rates have a large amount of daily fluctuation. These trends are key to understanding the overnight behavior of the process and the validation fit in Figure 7 shows how the metabolite rate model is able accurately model the rates.   For comparison, the metabolite rates are calculated using the dynamic model, as shown in Figure 8, and compared to the metabolite rates calculated using the measurements. As seen in this figure the rate predictions calculated from the predicted measurements do not match very well with the rates calculated using the measurements themselves. In essence, the errors in the predictions of the variables get much more enlarged when using them in the calculations of the metabolite rate. The calculated rates differ by a magnitude of ten in comparison to the rates calculated based on the measurements. The advantage of directly modeling the metabolite rates is clearly demonstrated as the calculated rate relies on the model predictions of the viable cell density and the metabolites. Thus, small errors in these variables compound resulting in a poor result in the prediction of the metabolite rates. Another drawback of calculating the model parameters is that in the glucose consumption rate. In the modeling approach, glucose is utilized as an input [2]. Therefore, using the calculated parameters to identify the glucose rate is not meaningful when attempting to use this model for control purposes. Given the limitations using calculated rates and the inability to model glucose, the use of a separate model to identify the metabolite rates is necessary.

Conclusions
In this study, the problem of identifying a dynamic batch model with large amounts of missing data was solved using a modified subspace identification procedure. The Sartorius Bioreactor problem had multi-rate sampling and also had discrete inputs from the glucose feed additions, which were modeled as instantaneous additions. When comparing the metabolite rate modeling approach to the overall dynamic modeling approach, the results show that modeling the entire process using missing data methods is more accurate. Additionally, the dynamic modeling approach is able to accurately handle the discrete input problem using an instantaneous addition profile. The key is to use the NIPALS algorithms to gracefully handle missing data, allowing for accurate model predictions of the validation batches.