1. Introduction
Controlling and observing industrial biotechnology processes is a challenging task for bioengineers. The main problems are collecting accurate information regarding the state of the process and its quality. The industry demands the process be as productive as possible, which also contributes to the task’s difficulty. Overcoming these challenges requires high-quality and reliable process data. With concrete and quality data, easier process controllability and higher result repeatability are attainable. Unfortunately, the industry still lacks accurate and real-time measurements, especially for the main focus of almost all industrial cell cultivation processes—synthesized target product concentration. Sampled, time-delayed measurements with additional instruments and time-consuming analyses remain the most common way to determine the product concentration throughout cultivations. In large-scale processes, this problem becomes more acute, with additional hardware costs and the increased possibility of errors. Therefore, the realization and implementation of software sensors that can measure and predict indirect quantities using information collected throughout the process has become more prominent [
1,
2,
3,
4,
5].
Target product concentration estimation in specific cultivations uses soft sensors that consist of various mathematical models [
6]. These range from traditional mechanistic and empirical models to hybrid models, which have become increasingly prevalent for solving the estimation task. The conventional model’s classical shape requires elaboration and the tuning of its parameters to achieve satisfactory results [
7]. Nevertheless, traditional mathematical models remain the fundamental basis of the software sensor, and in some instances, they are the most appropriate way to estimate process variables [
8].
The use of traditional models for product estimation is seen in cultivations of
P. chrysogenum for penicillin concentration [
9], recombinant
E. coli for protein concentration [
10,
11,
12], and yeast fermentations for ethanol concentration [
13]. Among the mechanistic unstructured models, the most popular approach is the extended Kalman filter [
14,
15]. However, the accuracy of the EKF and its results are closely related to the accuracy of the mathematical model, and may also suffer from convergence problems [
16]. Nonetheless, EKF has considerable robustness to changes of initial process conditions, and has proven successful when applied in
S. cerevisiae cultivations [
6,
17].
Applying traditional mathematical models to nonlinear and multidimensional systems may result in numerous errors due to the low flexibility of simple-structure differential equations. Therefore, researchers frequently choose an empirical model as an alternative approach that does not require detailed description of the process, but rather quantitative and qualitative data of the bioprocess. Among these data-driven models, the most successful and commonly applied are ANN (artificial neural networks), PLS (partial least squares), and PCA (principal component analysis)-based soft sensors. The latter, combined with spectroscopy, has been proven to provide satisfactory results in product estimation [
18,
19]. Meanwhile, ANNs have become crucial to hybrid models for product and state estimation [
10,
20]. The use of ANN is prominent not only as an alternative to describing complex parts of the processes, but also as a combination with additional off-gas analysis or spectroscopy data [
21,
22]. However, using such supplementary equipment for data gathering increases the process cost while also requiring added algorithms to compensate for the possible drifts in the gas sensors or data filtering from spectroscopy. Additionally, the estimation becomes time-delayed when taking samples periodically. Generally speaking, ANN-based software sensors, compared with traditional mathematical models, achieve more satisfactory results and require less development time [
10,
23].
A quick overview of the different techniques employed for specific product estimation can be seen in
Table 1.
Our study aims to employ and expand the Luedeking–Piret model [
25], and present an extension of the protein product estimation model based on gathered offline data. This paper improves the previous functional model by adding cell age and extensive model fitting analysis. The purpose of the proposed mathematical model is not to descriptively define the bioprocess, but instead to identify the correct state variables and their interrelationships that maximize synthesized product content.
Section 2: Materials and Methods describes the test object, processes, and operating conditions.
Section 3: Proposed Extension of Akaike Information Criterion presents the modified Akaike criterion for model fitting with the addition of a tuning coefficient.
Section 4: Combined Model Representing Multiple Hypothesis overviews previous similar maximal production rate expressions and proposes an improved model for target protein fitting.
Section 5: System Identification and Parameter Estimation presents the model’s parameter identification methods and the use of cells ages.
Section 6: Model Selection Based on Experimental Model Calibration compares the different models presented.
Section 7: Discussion and Conclusions presents final remarks about the results and model fitting.
3. Proposed Extension of Akaike Information Criterion
The classical form of the Akaike information criterion allows for selecting an informative set of parameters with an inevitable trade-off concerning the model’s fitting uncertainty [
27]. Let
n be the number of observation samples,
k the number of model parameters, and
MSE the mean squared error of the residuals. Then, the Akaike measure is
An alternative is the Bayesian information criterion, or BIC, which contains variance
of errors instead
One of the drawbacks of both
BIC and
AIC is that these criteria are designed to not have a tuning coefficient for minimizing the number of parameters to be used without changing the shape of the likelihood distributions. Another consideration is a tuning coefficient that would involve some theoretic asymptotic maximum number of parameters. In reality, the log-likelihood part of the criterion might not necessarily be related to the average characteristics, but they may also be cumulative characteristics based on the sum of squared residuals,
. This amount divided by the degree of freedom
n recovers
MSE and presents the average discrepancy between the readings
observed at time
and the value estimated by the model
. Such cumulative discrepancy depends on the number of observations
, and has the form of
Therefore, we suggest two entropic criteria for prospective model selection, which have a tuning coefficient
, a likelihood
, and a maximum likelihood
, yielding
The other information measure,
S, in the entropic representation, which can serve equally well, is
Then, one can determine
and
, with which
This links to Equations (1) and (2). In other words,
and
The motivation for tuning to a certain is the need to avoid overfitting with experimental data when a user applies raw AIC or BIC criteria with a likelihood in any probabilistic form. Furthermore, the practical expectation is that the criterion be as generic as possible, and the likelihood’s shape should not require modification. Consequently, an investigator must pick such a set of parameters that mean minimal effort is required to perform a trial when seeking rational bioprocess optimization. For example, only one or two cultivation protocol changes should be made to potentially and noticeably increase the overall total product, i.e., by more than 10 percent or so. It is expected that a biopharmaceutical manufacturer performs as few changes as possible. Simultaneously, the manufacturer must follow for maximal repeatability and standardization according to EU CE labeling, EU medical device (MDR), and US Food and Drug Administration (FDA) regulations at good manufacturing practice (GMP) or GMP-compliant (cGMP) facilities. This is particularly true when service providers provision a CDMO (contract development and manufacturing organization) technology transfer. Therefore, the upstream developers have one or two protocol adaptations or parameters at their disposal for a single experimental iteration consisting of unique experimental development trials or minor online checks.
In this study, we propose generic forms of Equations (4) and (5) that can be used to select such a minimal set of parameters that both reach (the principle of parsimony [
28]) and match (the principle of convex optimization [
29]) the extremum state of the measure.
4. Combined Model Representing Hypothesis with Multiple Elements
The previous study [
11] introduced an additional protein
production yield
parameter to extend the Luedeking–Piret model for fed-batch cultivations [
25,
30,
31]. The model relied on the oxygen uptake rate (OUR) for biomass
X estimation
The addition of production yield
, which represents the oxygen consumption yield for the protein synthesis rate, supplements the previous cell’s oxygen consumption parameters for biomass growth
and maintenance
. The expanded model achieved a pseudo-global estimation of synthesized protein and biomass concentration [
29,
32,
33]. Such a procedure corresponds to pseudo-global offline model calibration. It was assumed that protein yield was a function of biomass concentration in a gray box model [
34].
As shown in a previous work, protein productivity depends on
IPTG (isopropyl-D-1-thiogalactopyranoside) and biomass concentrations at time of induction [
29,
35]. The latter had a significant impact on the model, such that the product formation parameter
became a function of biomass concentration at time of induction. Then, the final estimator form became
The expression of the product model is based on the assumption of the linear dependency of product synthesis on the specific growth rate (SGR) of biomass [
36]
where
is the specific protein accumulation rate (U/g/h),
µ the specific biomass growth rate (1/h), and
the specific protein activity (U/g), where the protein concentration is normalized by biomass concentration. Even though the previous study assumed that the maximum target protein formation rate was linked to the specific substrate consumption rate, the underlying idea is still the same in this study. Finally, the time constant
was assumed to have a self-inhibiting effect [
37].
Over the years, multiple researchers have studied how different process variables and parameters affect the model of
.
Table 3 presents significant historic parametric developments.
D. Levisauskas and others expressed the maximal production rate (
) via the concept of active biomass [
38,
39]. This latter is assumed to be the part of the biomass that is responsible for specific product production. The average cell age identifies the active biomass
at any time
throughout the bioprocess. The expression of average cell age, including the initial biomass boundary condition, is
where
is initial biomass at time of inoculation to a bioreactor. If the latter is assumed to be negligible,
takes the following form
Equation (13) is the recovery of a particular case, shown in Equation (12), taken from D. Levisauskas and others’ research [
38,
39]. Assuming that
, the maximal production rate
at time
is
where
is the growth of biomass throughout the
j-th time interval, and
m (0 <
m < 1) is the relative activity ratio that introduces the linearly increasing and decreasing transient effect of the age. The parameter
m is described by a trapezoid time function, which consists of four model parameters presumably related to each culture.
The most recent functional protein model [
11] relies on the assumption that the maximal specific product concentration value is asymptotically dependent on
SGR. However, the authors identified an apparent effect of
IPTG injection on product synthesis through data analysis. Therefore, the functional model was expanded with the addition of biomass at induction time
where
and
are tuning parameters.
Other researchers [
12] tried one more variation of the maximal product formation model
Such an approach was based on a rational assumption of what inhibits the maximal product formation rate. As far as we know, no efforts were made to test the different hypotheses of various methods with the same datasets originating from different sources. We propose a method of model selection using the principles of parsimony and convex optimization in this study. This is based on Equations (7) and (8).
With the combined approach of both product synthesis models, we include an expanded protein function model, where
is the hypothesis of a mixture of linearly dependent competing models
where 24 model coefficients represent the parametric set of
, as defined in
Here,
are the optimization parameters of the model to be established. All of them contain zero values at the start of the convex search. The subset of linear terms represents the linear term of Equation (18), and some of them are the basis of Monod’s formulation theories [
40,
41]. The matches are depicted in
Table 4.
The novelty of this study is the proposed average cell age at induction time
. As the researchers [
38,
39] did not study the recombinant bioprocess in their work, so far, the effect of IPTG injection has not been assessed. Based on the experimental data, we deduced that the average cell age and specific growth rate during the induction time are the most significant parameters to consider when creating a protein formation model.
6. Model Selection Based on Experimental Model Calibration
We analyzed two datasets in this study, derived from different samples from two independent sites. The first repository consisted of 46 independent experiments and, in total,
readings. The other dataset, from the second site, contained 24 unique biosyntheses and, in total,
protein observations. To use a single
with
in the same model selection routine, we picked a normalized form by reusing two sums of squared residuals (
and
) for each site
This allowed for distributing the average variances of the estimates evenly over both sites’ repositories. After the maximization of Equation (26), a convex search of the data from previous studies gave the results shown in
Table 5. To check for errors at the beginning of product synthesis, we added to the evaluation the criteria of mean absolute error (
MAE).
At first glance, according to the AIC in
Table 5, the investigation from 2019 [
11] improved on the studies from 1999 [
38,
39] and 2003 [
12]. Then, the study of 2003 [
12] improved upon the AIC of 1999 [
11]. However, according to the MAE criterion, which is more relevant to product formation, the oldest assumption in the literature [
38,
39] is more powerful than the newer findings derived over 20 years later. Moreover, if the
AIC were to be followed literally, the overfitting of the overall model would have been favored, as the last row of
Table 5 demonstrates. Such an elaboration led us to further study the product formation model, and search for better ways of selecting a model with fewer parameters and which avoids overfitting by design.
First of all, there is a possible value for the maximum number of coefficients (
) that asymptotically makes the entropic criteria work the same way as the original
AIC and
BIC measures. The maximization of correlation between
AIC and
(Equation (4)), and then
(Equation (5)), generates corresponding
values
and
, which are shown in
Table 6.
Similarly, maximizing the linear relationship between
BIC and
, and then
, provides the data for
Table 7. We asymptotically tuned both
AIC and
BIC on the sum of correlations of 33 models, which together comprised a specific subset of Equation (18). We tried more reproductions with different assumptions in this study. However, those 33 representations comprising Equation (18) are the best set, according to our investigation experience. The maximal parametric complexity we tried was
in this study.
Table 6 and
Table 7 both show that each entropic measure of
S is a more generic quantity that can help restrict the number of expected state variables, thus helping with upstream
CDMO development in the biopharmaceutical industry. Typically, two to four coefficients are preferred in optimal control routines, because the degree of freedom in Hamiltonians intensifies computational requirements. The main reason for this is that, frequently, Hamiltonians are solved numerically or using hybrid approaches, of which arithmetic processing still represents an extensive part. As such, we present experimental findings for a maximal number of model parameters of
, unless specifically stated otherwise.
Before proceeding with model selection, we must check the significance of the tuned model parameters individually. We select
and two other coefficients with state variables and a significant history [
11,
12,
38,
39], which we found to be the best descriptors.
The specific growth rate at time of induction is the most significant parameter from a singleton analysis perspective, as
Table 8 shows. This table offers two insights:
- (a)
There is significant doubt that belongs to the descriptor set;
- (b)
Even if the specific growth rate surpasses the average cell age, the significance of either is still relatively similar. Therefore, there is a high chance that both of them combine in a single nonlinear relationship that is proportional to the maximum product formation rate.
Such thinking led us to construct maximum product expression, as in Equation (18). We will use the maximum number of models assessed during our criterion asymptotic analysis, and set
. The five best model equations that derive from Equation (18) are
Table 9 depicts the parameter values of the models in Equations (29)–(32).
The second additive term, as used in Equations (29)–(32), and the first additive term, as used in Equation (32), is the Monod term, whose coefficients and carry a specific physiological meaning: the maximum specific target protein formation rate is the multiplication ; the denominator additive coefficient defines the average age at which the production formation rate (represented by term ) is halved. The perfect average age for inoculation is somewhere between 1.066 h and 1.3 h, at which point product formation has the highest theoretical rate of acceleration. It remains to be determined whether it is a coincidence that the minimum induction time was 1.14 h for the first site and 1.237 h for the second site.
As the mean absolute error is the smallest for the model with more variables in Equation (29), other maximal counts of model parameters remain to be verified. The asymptotic analysis using
, which is the maximum number of tested parameters per experiment in this study, suggests the following five alternatives:
Table 10 shows another alternative set of coefficients, which verify that the average age has a more substantial effect at the start of product formation. Thus far, Equation (29) gives the best estimate of the total product.
There is still one model to consider, which can improve
MAE to 0.424
However, this model’s RSS is poor, at 14.826. Further increasing the number of parameters starts to reduce the MAE due to overfitting.
7. Discussion and Conclusions
The results of the model selection and the application of enhanced AIC show two things:
- (a)
As regards rational, practical benefits, the proposed entropic measures can help with tuning the maximum count of the model parameters, thus helping devise standardized CDMO procedures for attaining higher product yields from biopharmaceutical efforts;
- (b)
Secondly, both average age and biomass growth values at time of induction, or in other words, at the very start of product synthesis, are crucial. Therefore, the combined model employing Monod structures is the best recommendation for maximizing the total product yield.
Similar to the Akaike information criterion, the Bayesian information criterion can also be viewed as a particular asymptotic enhancement of the entropic expansion of AIC. Such an approach avoids altering the likelihood or re-organization the experiments. Instead, it brings the benefit of adjustability in the maximum number of expected coefficients. Moreover, two entropic values are available for scientists to exploit: relative entropy and Shannon entropy. The experimental model fitting was performed simultaneously on 46 experiments at the first site and 24 fed-batch experiments at the second site. Both locations contained 196 and 131 protein samples, thus giving a total of 327 target product tests using the bioreactor medium.
Regarding the physiological characteristics of any aerobic microbial system, we witnessed that average cell age and the inhibition coefficient are both more relevant, and describe the model better, at the very beginning of product biosynthesis. At the same time, the specific growth rate improves upon the latter overall, when considering the total (recombinant target protein) expression at the end of the experiments.