1. Introduction
Yeast fermentation is one of the most important biomanufacturing processes as a new type of clean energy, and ethanol is an important substitute for fossil fuels. It is mainly produced by the yeast fermentation process. However, the growth of yeast is sensitive to environmental conditions such as temperature, PH, and substrate concentration. The yeast fermentation process involves a complex biochemical mechanism [
1]. Therefore, it is difficult to build an accurate model of the yeast fermentation process.
Roles [
2] modeled the yeast fermentation bioreactor based on the kinetics of yeast growth. In addition to the kinetics of yeast fermentation, Nagy [
3] built a detailed model that involves heat transfer, the dependence of kinetic parameters on temperature, the mass transfer of oxygen, and the influence of temperature and ionic strength on the mass transfer coefficient. Considering the inhibitory effect of ethanol, a stripping model was proposed to further improve the prediction accuracy of the mechanism model [
4]. Further, to recover ethanol from the gas mixture produced by the CO2 stripping method, Rodrigues et al. [
5] proposed an improved mechanism model based on the mass balance equation, stripping, absorption kinetics, and gas−liquid balance. Recently, a new modeling technique considered cell cycling, which can reduce yeast consumption and raw material costs [
1]. Although the above mechanism model can well reflect the process flow and has good extrapolation characteristics, a large number of experiments may be needed to determine the kinetic structure, the parameters, and the mechanism of the yeast fermentation process, which all need to be fully understood.
For the data−driven method, detailed prior knowledge about the yeast fermentation process and mechanisms is not required. The process model is identified through a large amount of process data. Many data−driven methods based on machine learning have been widely used to develop fermentation process models. Ławryńczuk [
6] used a BP neural network to model and control the temperature of the bioreactor. Zhang [
7] employed a least squares support vector machine to establish a nonlinear model and to optimize the control of the process. Smuga−Kogut et al. [
8] took advantage of a modeling method based on the random forest to predict bioethanol concentration. Konishi [
9] estimated bioethanol production from volatile components in lignocellulosic biomass hydrolysates by deep learning. The extreme learning machine (ELM) proposed by Huang et al. [
10] greatly improved the computational speed because it only adjusted the weight of the output layer of the network and did not need to adjust the weight of the network through the gradient descent method like a BP neural network. The speed of learning is faster than a traditional support vector machine and neural network, and a similar generalization performance can be achieved. Sebayang et al. [
11] developed an ELM model to predict the performance of engines fueled by bioethanol. For data−driven modeling, the quality and quantity of the data are crucial to the quality of the model. For the yeast fermentation process, a large number of high−quality data points may be difficult to obtain due to the limitations of measurement techniques, long experimental cycles, etc. Useful data are essential to obtaining a satisfactory model that is based on data−centered modeling methods.
Hybrid modeling combines mechanism modeling with data−driven modeling, which not only considers the dynamics of a process but also reduces the demand for process data. In essence, hybrid modeling uses data to solve the mismatch problem of mechanism models and provides higher modeling accuracy. The structure of hybrid modeling is mainly divided into the serial [
12,
13,
14] and parallel structures [
15,
16]. For hybrid modeling with a parallel structure, the key is to build the residual model (RM), which utilizes the data−modeling method to model the residual between the mechanism model and process. The utility of the RM is to compensate for the mechanism model. At present, in the hybrid modeling framework, most studies use machine learning methods to train the RM, and the prediction accuracy of the nonlinear models obtained is higher than that of linear models [
17]. Su et al. [
15] used a BP neural network to train the RM and verified the excellent prediction accuracy of the hybrid model combined with a BP neural network in the continuous polymerization processes. Niu et al. [
18] established a hybrid model using the least squares support vector machine to predict the substrate concentration and product concentration of a fed−batch fermentation reactor. Chen et al. [
19] applied support vector regression (SVR) and an artificial neural network to the hybrid modeling of continuous pharmaceutical processes. The hybrid modeling method effectively solved the model mismatch problem. The advantage of parallel structure hybrid modeling is that it can build an excellent model without considering the specific reasons leading to the mismatch of mechanism models and making new experiments. It only uses the RM to describe the dynamics that mechanism models cannot describe. However, the above studies are all supervised learning methods, which only consider the correlation between residuals and process inputs (control variables). Other factors related to residuals, such as oversimplification of the mechanism model and inaccurate model parameters, are not taken into consideration.
Semi−supervised learning refers to the fact that the learner combines labeled and unlabeled data to improve the learning performance [
20]. In this paper, semi−supervised learning is considered to train RM, and the mechanism model output is taken as unlabeled data. However, filtering the data is necessary. Otherwise, the training burden of the model will be increased, and irrelevant data will even reduce the prediction accuracy of the model. Xu et al. [
21] proposed a semi−supervised feature selection method (RRPC) based on correlation and redundancy criteria for feature classification, which shows that the combination of labeled and unlabeled data improves feature selection. Considering the unconsidered dynamics, inaccurate model parameters, measurement uncertainty factors in the mechanism model, and model training speed, this paper adopts a parallel hybrid modeling method without considering the source of the mismatch in the mechanism model. The output of the mechanism model is taken as unlabeled data, the appropriate training data are selected through a RRPC algorithm for RM training, and then a set of nonlinear residual models are established by ELM. The established residual models are combined with the mechanism model to form a hybrid model. Finally, the effectiveness of the semi−supervised hybrid modeling method is verified by comparing its modeling accuracy and speed with those of the existing modeling methods.
2. Yeast Fermentation Bioreactor
The continuous yeast fermentation process for ethanol production is considered a simple continuous stirred reactor that involves continuously adding material to the bioreactor and removing the product from the reactor. Biomass (Saccharomyces cerevisiae) and the substrate (glucose) are the two main components of the bioreactor, and ethanol is the main product. The ideal operating conditions of the bioreactor are ingredients that are fully mixed, stirring speed, feed concentration, pH value, and a constant substrate feed flow and outlet flow from the reactor according to the requirements of the process.
The comprehensive model of the yeast fermentation process is as follows [
3]:
where the input
u of the model is the flow rate of the coolant and the output vector
represents the biomass concentration, ethanol concentration, substrate (glucose) concentration, oxygen concentration, reactor temperature, and jacket temperature in the bioreactor, respectively.
Fin,
Csin, and
Tin are the flow rate, concentration, and temperature of substrate feed, F
out is the outlet flow rate of the bioreactor,
Tinj is the temperature of the coolant,
is the oxygen saturation concentration, and
V and
Vj are the volumes of the bioreactor and jacket, respectively. Note:
Fin =
Fout; the total volume of reaction medium
V remains constant.
In Equation (1), the maximum specified growth rate
μx depends on
Tr:
In Equation (4), the oxygen mass transfer coefficient is represented by the following temperature function:
the oxygen saturation concentration
depends on
Tr and
pH (the overall effect of ionic strength):
The overall effect of ionic strength is as follows:
In Equations (4) and (5), the rate of oxygen consumption during biomass growth is:
for clarity, the parameter nomenclature and parameter values involved in models (1–11) can be found in reference [
3], respectively.
3. Semi−Supervised Hybrid Modeling Structure
The above model is the yeast fermentation process model. Although it can capture the main dynamics of the process, unconsidered dynamics, inaccurate model parameters, and measurement errors may lead to residuals between the mechanism model and the real process. The output vector of the actual process is defined as
Yp, and the output vector of the mechanism model is defined as
Ym; then the residual vector of the process is
e = Yp − Ym, where
. Before building RMs, we need to select the appropriate input data for each RM. The training data set is
, including the input of the bioreactor and the output of the mechanism model. Each residual model requires different training data. The input vector is selected for the
ith RM through RRPC as follows:
where
φi represents the input selection function of the
ith RM,
i = 1, 2... 6, and
e(
i) represents the
ith residual variable in the form of MATLAB code.
According to the selected input vector
URMi and residual
e(
i), ELM is used to construct a group of RMs. The output of the
ith RM is:
A group of RMs and mechanism models are combined into a hybrid model with a parallel structure, as shown in
Figure 1.
Yh =
Ym +
YRM represents the predicted value of the hybrid model, where
is to compensate the mechanism model with the RMs.
A residual model with good precision can be trained via supervised learning, but in the real process, the unconsidered model structure and incorrect parameters of a mechanism model may cause a deviation between the mechanism model and the real process. The purpose of this article is to let the output of the mechanism model be considered in the training data during the training of RM. The semi−supervised learning method based on RRPC is used to select appropriate inputs to establish RMs with better accuracy. Meanwhile, ELM is considered to train RMs to maintain good generalization performance and reduce the training time of the model. The semi−supervised hybrid modeling framework is shown in
Figure 1, including the mechanism model, a group of RMs, an input selection module, and an ELM training module. The dashed box represents the offline training process of the
ith RM.