Smart Soft Sensor Design with Hierarchical Sampling Strategy of Ensemble Gaussian Process Regression for Fermentation Processes

Accurate and real-time quality prediction to realize the optimal process control at a competitive price is an important issue in Industrial 4.0. This paper shows a successful engineering application of how smart soft sensors can be combined with machine learning technique to significantly save human resources and improve performance under complex industrial conditions. Ensemble learning based soft sensors succeed in capturing complex nonlinearities, frequent dynamic changes, as well as time-varying characteristics in industrial processes. However, local model regions under traditional ensemble modelling methods are highly dependent on labeled data samples and, hence, their prediction accuracy might get affected when labeled samples are limited. A novel active learning (AL) framework upon the ensemble Gaussian process regression (GPR) model is proposed for smart soft sensor design in order to overcome this drawback. Firstly, to iteratively select the most informative unlabeled samples for labeling with hierarchical sampling based AL strategy, to then apply Gaussian mixture model (GMM) technique to autonomously identify operation phases, to further construct local GPR models without human involvement, and finally to integrate the base predictors by applying the Bayesian fusion strategy. Comparative studies for the penicillin fermentation process demonstrate the reliability and superiority of the recommended smart soft sensing. The cost of human annotation can be dramatically reduced by at least half while the prediction performance simultaneously keeps high.


Introduction
In recent years, artificial intelligence (AI) and machine learning (ML) have contributed to the great advancement of the Industry 4.0 [1,2]. It aims to ensure the high-quality control of production-based industries in the increasingly complex environment, such as the increased process automation, more efficient data analysis, lower human effort, safer working environment, and so on. Trillions of objects are connected to the Internet of Things (IoT), emerging huge amounts of data. Sensors play an important role in real-time and efficient data collection and processing. In industrial plants, many key variables are closely related to the product or process qualities but they can hardly be measured online with conventional hardware sensors. Soft sensors, which aim to realize the real-time prediction of these desired variables at low costs via constructing a satisfactory inference model between quality variables (usually are difficult-to-measure) and input variables (usually are easily accessible with sensors), have attracted growing attention in many industrial applications [3,4]. The data-driven algorithms based soft sensors are much more advantageous and easier for construction with litter process knowledge, select all of them, as they are likely to share same process information. The hierarchical clustering (HC) method provides a feasible approach for exploring spatial information between samples and their neighborhoods [28,29]. Compared with partition clustering algorithms, one of the most important advantages of this algorithm is that it can clearly show the clustering of dataset at different spatial levels [30]. The spatial information can be effectively extracted and utilized by pruning the HC tree with the AL strategy in order to mitigate the data sample selection problem.
Therefore, the motivation of this paper is to design a superior smart soft sensor, which can be referred as ensemble GPR model with hierarchical sampling based AL strategy (AL-EGPR), expecting to positively support the real-time data processing and process control in 4.0 industrial environments. The limitation of traditional supervised learning based regression methods raises nontrivial concerns regarding the efficient utilization of large amounts of unlabeled data. Subsequently, a novel AL strategy is proposed and incorporated into soft sensor modelling method. With the hierarchical sampling strategy, if the new unlabeled sample does not fall into any existing high-density clusters, it is considered to be highly informative and representative. In such cases, a desired number of most dissimilar unlabeled samples can be selected and used for manual annotation in each learning iteration, and, after that, added into the training dataset for the next model construction, until achieving the satisfactory accuracy, or all unlabeled samples have been labeled. Ensemble learning based on GPR model is further introduced to robust soft sensor design, aiming to achieve better generalization than single model-based predictors. Here, we choose the GPR model as the ensemble member as its characteristic probabilistic structure as well as the strong ability to handle abrupt changes and nonlinearity of industrial processes. In this method, the newly updated labeled training dataset is firstly divided into several different local data domains that can be realized by applying the GMM method, and multiple local GPR sub-models can be built for these sub-datasets, respectively. Afterwards, we introduce the Bayesian inference strategy to estimate the posterior probability of each query data sample with respect to local sub-models. Afterwards, the local predictions of GPR sub-models are effectively integrated into final prediction results by applying the finite mixture mechanism. Besides, the Bayesian information criterion (BIC) [18,20] is applied to determine the optimal number of GMM components in attempting to reduce the soft model complexity and enhance estimation ability. The recommended soft sensor has been applied to the prediction of penicillin concentration in the penicillin fermentation process, demonstrating that the high performance can be achieved at a low cost, in terms of the estimation accuracy and converge speed.
The remaining parts of this paper are structured, as follows. Section 2 briefly revisits the principle of the GPR model and GMM method. Section 3 presents the detailed methodology of the AL strategy, including the HC method and adaptive sampling strategy. Section 4 develops the ensemble GPR model based soft sensing technique with the AL strategy. Section 5 evaluates the effectiveness of the AL-EGPR method via the simulation results in an industrial process, and Section 6 concludes this paper.

Gaussian Process Regression
A collection of random variables that all have a joint unknown Gaussian distribution can be significantly seen as a Gaussian process (GP), which has been greatly applied in order to define the desired distribution of flexible models in the field of regression and classification [12,13]. Given the training dataset of m-dimensional variable X(n × m) = [x 1 , x 2 , · · · , x n ] T and y(n × 1) = [y 1 , y 2 , · · · , y n ] T , the output observations with zero-mean Gaussian prior distribution can be represented by: Sensors 2020, 20, 1957 4 of 21 where GP(0, K) denotes the GP with zero-mean and K-covariance characteristics, while the ij-th element in matrix K is correspondingly described by kernel function k x i , x j . In this research, squared-exponential function is used as the desired kernel function, which is defined as: with the unknown positive hyperparameter set Θ = l, σ 2 f , σ 2 n , where l denotes length-scale, M = l −2 I, σ 2 f , and σ 2 n represent the signal variance and noise variance, respectively, while δ i j is the Kronecker operator satisfying δ i j = 1 if given i = j, otherwise, δ ij = 0.
Therefore, the aim of GPR training process is to estimate the hyperparameter set Θ. A log-likelihood function maximization process can be performed to realize the parameter determination, which is represented, as follows: Once the optimal hyperparameter set Θ * is estimated, the GPR model can give an accurate estimation result regarding the distribution of quality variableŷ t for the new test sample x t , which is formulated as: The posterior distribution of GPR output can be expressed by (ŷ t X, y, x t ) ∼ N(µ(ŷ t ), σ(ŷ t ) 2 ) , where µ(ŷ t ) and σ(ŷ t ) 2 denote the posterior mean and the variance of multivariate Gaussian distribution, respectively. In this case, we can describe the estimation results by: where k t = [k(x t , x 1 ), k(x t , x 2 ), · · · , k(x t , x n )] T is the covariance vector matrix between data point x t and training points x 1:n . Finally, the expectation µ(ŷ t ) of the present posterior distribution can be regarded as the estimation resultŷ t of the GPR based predictor.

Gaussian Mixture Model
GMM is commonly employed as an effective probabilistic modelling tool for the sake of approximating the data distribution, which is under the assumption that the distributions of all the data samples can be well approximated by the multivariate Gaussian mixture [21]. Given dataset X(n × m), which is assumed to follow a K-component Gaussian mixture distribution, we suppose that its probability density function is written as: where K is the number of Gaussian components, π k denotes the prior probability of the kth component and it subjects to K k=1 π k = 1, 0 < π k < 1, and θ k = π k , µ k , Σ k denotes the parameter set in the kth Sensors 2020, 20, 1957 5 of 21 Gaussian component, Ξ = {θ 1 , · · · , θ K } = π 1 , µ 1 , Σ 1 , · · · , π K , µ K , Σ K denotes the vector of the GMM parameters. The mean vector µ k and the covariance matrix Σ k specify an unknown multivariate Gaussian distribution p(x i |θ k ), whose probability density function can be formulated by: Expectation maximization (EM) algorithm, which consists of an E step and M step, is practically and extensively applied to estimate GMM parameters. The estimation process is the maximization process of the log-likelihood function defined as: where L(X Ξ) is the likelihood function of X. Given an initial parameter set Ξ (1) , EM algorithm in this way can produce a sequence of GMM parameters Ξ (1) , Ξ (2) , · · · , Ξ (s) , · · · by performing E step and M step successively, where s denotes the iteration times. The E step and M step iterate until they converge, which can be successfully carried out, as follows [21]: E step: Calculate the posterior probability of ith training data point with kth component C k in the sth iteration: M step: Update θ k = π k , µ k , Σ k of the kth component in the (s+1)th iteration by the following equations:

Hierarchical Sampling Strategy Based Active Learning Framework
AL strategy has been developed and introduced to traditional sampling procedure in order to reduce the sampling bias resulted from random selection for unlabeled samples, which is shown in Figure 1. However, traditional AL based soft sensing cannot be able to make fully use of the spatial information between process samples, thus, in this section, the HC method and adaptive sampling strategy are introduced into the AL framework.
Sensors 2020, 20, 1957 6 of 21 AL strategy has been developed and introduced to traditional sampling procedure in order to reduce the sampling bias resulted from random selection for unlabeled samples, which is shown in Figure 1. However, traditional AL based soft sensing cannot be able to make fully use of the spatial information between process samples, thus, in this section, the HC method and adaptive sampling strategy are introduced into the AL framework.

Hierarchical Clustering Algorithm
The HC method has been proven to be valuable to the data clustering. A tree of HC can be obtained by calculating the similarity of different clusters. In the clustering tree, the data samples with different characteristics are the low level of the tree, and the top level of the binary tree can be seen as the root node of the cluster. Additionally, the farther the distance on the cluster tree is, the less similar the two samples are. It has been studied to fully consider the special information between samples during the clustering process.
The first important step of the HC algorithm is to calculate the distances between the data samples. As the most common distance measurement method, the Euclidean distance has been widely introduced to calculate the absolute distance between all given data points in the multidimensional space, which is defined by: Another crucial task is the combination of different clusters. In this study, ward-linkage method, which aims to minimize the total variance of the clusters being merged, is employed to cluster combination [31,32]. The pair of data clusters that lead to the minimum increase in total produced variance, or the error sum of squares (ESS), are selected to merge at each union step to implement this method.
where n is the number of points. Subsequently, ESS is usually given by the following functional relation [31]:

Hierarchical Clustering Algorithm
The HC method has been proven to be valuable to the data clustering. A tree of HC can be obtained by calculating the similarity of different clusters. In the clustering tree, the data samples with different characteristics are the low level of the tree, and the top level of the binary tree can be seen as the root node of the cluster. Additionally, the farther the distance on the cluster tree is, the less similar the two samples are. It has been studied to fully consider the special information between samples during the clustering process.
The first important step of the HC algorithm is to calculate the distances between the data samples. As the most common distance measurement method, the Euclidean distance has been widely introduced to calculate the absolute distance between all given data points in the multidimensional space, which is defined by: where x 1 = [x 11 , x 12 , · · · , x 1m ] and x 2 = [x 21 , x 22 , · · · , x 2m ] represent the data points with m-dimension. Another crucial task is the combination of different clusters. In this study, ward-linkage method, which aims to minimize the total variance of the clusters being merged, is employed to cluster combination [31,32]. The pair of data clusters that lead to the minimum increase in total produced variance, or the error sum of squares (ESS), are selected to merge at each union step to implement this method.
When considering the dataset {x i } n i=1 in one-dimensional space, the variance is expressed, as follows: where n is the number of points. Subsequently, ESS is usually given by the following functional relation [31]: Sensors 2020, 20,1957 7 of 21 In ward-linkage cluster, the ESS value of the newly obtained cluster being merged is taken as the similarity of two clusters, which can be formulated as: where x i represents any data sample of two clusters before merging, c 1 and c 2 is a pair of clusters, o c 1 ∪c 2 is the central data point of the new cluster, and d(x i , o c 1 ∪c 2 ) is the Euclidean distance between each sample x i to o c 1 ∪c 2 . In this way, the clusters with high similarity in measured characteristics are merged, and the complete hierarchical structure can be obtained by repeating the union process. Figure 2 shows the implementation steps of the HC algorithm in detail.
Sensors 2020, 20, x FOR PEER REVIEW 7 of 22 In ward-linkage cluster, the ESS value of the newly obtained cluster being merged is taken as the similarity of two clusters, which can be formulated as: where i x represents any data sample of two clusters before merging, 1 c and 2 c is a pair of clusters, In this way, the clusters with high similarity in measured characteristics are merged, and the complete hierarchical structure can be obtained by repeating the union process. Figure 2 shows the implementation steps of the HC algorithm in detail.

Start
Calculate and obtain the similarity matrix of all data samples: ESS

Set the current number of clusters: q=n
Training dataset X={x1,

x2, xn}
Set ESS as the similarity between different clusters M: Calculate the similarity matrix between any two clusters by ward-linkage method: ESS(ci,

Adaptive Sampling Strategy
A binary tree can represent the HC results, and the adaptive sampling based AL strategy is introduced into the smart soft sensor modelling. It aims to adaptively remove some redundant subtrees that were composed of those nodes that are homogeneous in HC tree according to a certain criterion. Here, the combined process is denoted as pruning. It aims to find an optimal pruning with minimum classification error and selected the most uncertain and informative samples for model training through an iterative process. Those samples that have less similarity with the labeled dataset are generally preferred as they have much useful information. Data sampling probability can effectively be reduced in regions of the space that already have labeled samples with relatively large numbers, which fully considers and makes use of the sample spatial information as compared to random selection (RS) and other sampling strategies.

Adaptive Sampling Strategy
A binary tree can represent the HC results, and the adaptive sampling based AL strategy is introduced into the smart soft sensor modelling. It aims to adaptively remove some redundant subtrees that were composed of those nodes that are homogeneous in HC tree according to a certain criterion. Here, the combined process is denoted as pruning. It aims to find an optimal pruning with minimum classification error and selected the most uncertain and informative samples for model training through an iterative process. Those samples that have less similarity with the labeled dataset are generally preferred as they have much useful information. Data sampling probability can effectively be reduced in regions of the space that already have labeled samples with relatively large numbers, which fully considers and makes use of the sample spatial information as compared to random selection (RS) and other sampling strategies.
Given the labeled dataset as {X L } ∈ R n l ×m and unlabeled dataset as {X U } ∈ R n u ×m , where m denotes the number of measured process variables, n l and n u denote the numbers of the labeled data points and unlabeled data points, respectively, usually n l n u holds. Suppose the HC tree T has n u leaves, the number of data points in a node v ∈ T is expressed as n v . A weight of the node is the proportion of the sample points in T v , which can be represented as: Sensors 2020, 20,1957 8 of 21 where c = 1, 2, · · · , k is all possible classes and n v,c is the number of points that belong to class c. Generally, the class c with the maximal probability can be taken as the classification results of corresponding nodes. However, n v,c is sometimes small, and it might result in serious classification errors since the obtained classification probability p v,c has no enough robustness. The generalization bounds are used to assess the quality of probability estimates in order to address this issue. When considering any given time t, we introduce a confidence interval [p LB v,c , p UB v,c ] to replace p v,c by associating with each node v and class c [32]: where n v (t) . If it incurs at most β times as much as any other classes when class c is taken as the class of the node v: We consider class c to be an admissible class for node v, which implies that (v, c) is admissible at time t. In this study, we set β = 2, in which case: For any node v, several different classes may meet this criterion at time t. It is necessary to determine which class is chosen to be the optimal class for node v. In order to select the optimal class of nodes, the admissibility of all classes of nodes are first calculated, and then choose the class with the greatest probability as the optimal class to which the node belongs among all admissible classes.
The adaptive pruning strategy aims to combine similar subtrees and find an effective pruning in order to minimize the classification error as much as possible, which is directly related to the output of classification results. The classification error can be defined as: Generally, it starts from the root node and toward the leaves, evaluating whether the child nodes should replace its parent node if all of the descendants of the node are able to replace their parents. For each node of HC structure, its class and error are calculated. If it satisfies: where nodes v p and v q are the child nodes of v, in such cases, we can replace node v with its child nodes v p and v q , which aims to reduce the overall classification error. Once the optimal pruning is accomplished, a classification result can be obtained with minimum error. Subsequently, it can query some informative samples to refine p v,c in the iteration procedure, which further reduces the classification error. In this study, the AL strategy is introduced to effectively select queried samples for labeling. Normally, the node v with the minimal value of p LB v,c is chosen to select samples for querying, and then one child node of v is chosen according to its node division. Repeat these two steps until the informative sample is selected and labeled.
After an iterative sampling process, the most dissimilar samples can be selected. As a result, the cost of human efforts and time for labeling can be greatly reduced. With the labeled dataset enlarged, p v,l the value increases while the confidence of classification is improved. Figure 3 presents a schematic illustration of the HC tree and different pruning strategy. Algorithm 1 summarizes the proposed hierarchical sampling strategy under the AL framework. selected and labeled.
After an iterative sampling process, the most dissimilar samples can be selected. As a result, the cost of human efforts and time for labeling can be greatly reduced. With the labeled dataset enlarged, , vl p the value increases while the confidence of classification is improved.

Ensemble GPR Modelling Method Under AL Framework
Traditional soft sensing that is based on the AL strategy only constructs a global model for quality prediction, as shown in Figure 1, it usually ignores the multiphase and multistage characteristics of complex chemical processes. Therefore, a novel smart soft sensing technique with Input: a HC tree of n unlabeled data samples; iteration step n s Process:

1:
Repeat following steps until labeled samples are enough for high-quality soft sensing or all unlabeled samples are labeled.

2:
Choose the node v ∈ T with minimal value of probability p LB v,c , and replace node v with its child nodes v p and v q if it satisfies

3:
Choose one of the child nodes z in the same way, until there are no child nodes, then an informative sample x is selected.

4:
Update p LB u, c of all nodes u ∈ T.

5:
Repeat step2 to step4 until n s unlabeled samples are selected. 6: Query the labels of n s selected data points, and then configure the selected dataset x s . 7: Update the labeled dataset as

Ensemble GPR Modelling Method Under AL Framework
Traditional soft sensing that is based on the AL strategy only constructs a global model for quality prediction, as shown in Figure 1, it usually ignores the multiphase and multistage characteristics of complex chemical processes. Therefore, a novel smart soft sensing technique with an AL strategy based on ensemble learning can be developed for better prediction performance. To guarantee the prediction capability of each ensemble sub-model, the GMM method is applied to obtain a set of local domains from updated training samples. Subsequently, sub-models can be built from different datasets, which, if applied effectively, could highly enlighten the generalization performance of soft sensing model. Besides, BIC criterion is chosen to determine the optimal number of Gaussian components, as it tends to establish a great structure for GMM model [22], which can be formulated, as: where N denotes training data number, K denotes component number, and L(X |Ξ ) represents the maximal values of the log-likelihood function. BIC aims to balance generalization performance with GMM model complexity and the model with the lowest BIC value is preferable.
In this paper, the GPR modelling method is chosen for ensemble model construction due to its better generalization behavior since the proposed AL method has no restrictions on the selected data model structure. In such cases, several GPR sub-models are driven by local datasets. Further, the posterior probabilities for each arbitrary observation x q with respect to all different phases can be formulated, as follows, since we apply Bayesian inference knowledge: Afterwards, the localized GPR models are adaptively incorporated to an ensemble inferential model with posterior probability by applying the finite mixture mechanism. Usually, the final online estimation of key variable is the weighted combination of each individual, which is formulated, as follows: where x q represents the new observation of test samples, C k = x k , y k , k = 1, 2, · · · , K represents kth process phases, and y k q represents the local output. Figure 4 illustrates the comparisons between traditional global GPR modelling method based on RS sampling strategy and ensemble GPR modelling method based on the AL strategy presented in this paper.

Process Introduction
The purpose of this section is obviously to prove the feasibility and superiority of the smart soft sensing method. Penicillin fermentation process (PFP) is traditionally regarded as a typical chemical process with nonlinearity, time-varying, dynamic, multi-batch, and other characteristics, which has been widely applied as a benchmark process in order to evaluate the effectiveness of soft sensor modelling methods. There are three physiological stages: cell growth, penicillin synthesis, and cell autolysis stage. For illustration, Figure 5 shows the detailed flowchart of the PFP. During the cultivation process, many factors, such as temperature, PH, sterile substrate, acid/base and

Process Introduction
The purpose of this section is obviously to prove the feasibility and superiority of the smart soft sensing method. Penicillin fermentation process (PFP) is traditionally regarded as a typical chemical process with nonlinearity, time-varying, dynamic, multi-batch, and other characteristics, which has been widely applied as a benchmark process in order to evaluate the effectiveness of soft sensor modelling methods. There are three physiological stages: cell growth, penicillin synthesis, and cell autolysis stage. For illustration, Figure 5 shows the detailed flowchart of the PFP. During the cultivation process, many factors, such as temperature, PH, sterile substrate, acid/base and cold/hot water flow rates, and dissolved oxygen concentration, can make a difference to penicillin production [13,33,34]. It is significantly important for humans to monitor and predict the penicillin concentration. However, there are many difficulties on penicillin measurement in a direct way due to the cost of hardware sensors. Soft sensor development is an effective solution for realizing the real-time estimation of penicillin concentration. A simulator, named PenSim, has been proposed and widely applied in order to simulate the PFP under different operating conditions [34]. We can easily and effectively collect process data samples of PFP via PenSim platform. The Process Modeling, Monitoring, and Control Research Group of Illinois Institute of Technology developed PenSim, which is available at the website: http://simulator.iit.edu/web/pensim/index.html. A total of 16 process variables can be measured in the simulation plant. Generally, multidimensional datasets with more input variables contain abundant process information that makes a difference to informative model construction. However, some undesired problems, such as information redundancy and complex model structure, may also arise concerning the informative model based soft sensors. Less input variables give litter process information and that based models may lead to inaccurate predictions. In this case, we select seven input variables according to the experience of process engineers, which are enlisted in Table 1

Input Variables
Description Unit u1 Culture volume L u2 Agitator power W u3 PH -u4 Substrate feed temperature K u5 Fermenter temperature K u6 Substrate feed rate g/h u7 Aeration rate L/h A simulator, named PenSim, has been proposed and widely applied in order to simulate the PFP under different operating conditions [34]. We can easily and effectively collect process data samples of PFP via PenSim platform. The Process Modeling, Monitoring, and Control Research Group of Illinois Institute of Technology developed PenSim, which is available at the website: http://simulator.iit.edu/web/pensim/index.html. A total of 16 process variables can be measured in the simulation plant. Generally, multidimensional datasets with more input variables contain abundant process information that makes a difference to informative model construction. However, some undesired problems, such as information redundancy and complex model structure, may also arise concerning the informative model based soft sensors. Less input variables give litter process information and that based models may lead to inaccurate predictions. In this case, we select seven input variables according to the experience of process engineers, which are enlisted in Table 1 Substrate feed temperature K u 5 Fermenter temperature K u 6 Substrate feed rate g/h u 7 Aeration rate L/h

Performance Evaluation of the Proposed AL Strategy
For the AL strategy, in this study, we set the learning step as 20 points, which means that 20 unlabeled samples are assigned with their real labels and become a part of labeled training dataset in each learning iteration. Under AL framework, all of the unlabeled candidates will be queried and labeled after a total of 20 iterations. However, we can stop the iteration process in advance when the soft sensors have satisfactory estimation accuracy, as it is unnecessary to update training dataset. Furthermore, two different global GPR model based soft sensors are developed for performance comparisons, which are designed with the RS strategy and hierarchical sampling based AL strategy, respectively. Here, the following root-mean-square error (RMSE) is traditionally used for an assessment of the soft sensing fit: where n is the number of test data, and y i andŷ i denote the real and estimated values of ith test data point, respectively. The prediction accuracy and reliability of soft models to be tested can be greatly reflected by RMSE. Besides, 10 simulation times are carried out for two sample selection strategies, while the RMSE value is the mean value of 10 experiments in each iteration. In our study, the HC method is introduced to explore the spatial information of all unlabeled samples that were collected in PFP. On the clustering tree, samples and their neighborhoods that sequentially merged in the same spatial level share the similar process information. The higher two samples are connected on the tree, the more dissimilar they are. The hierarchical sampling based AL strategy is then proposed in order to evaluate each unlabeled sample and selected the most valuable ones for labeling. The pruning results are relevant to the estimation ability of the AL based soft sensors. Figure 6 shows the prediction performance of the soft sensors under different pruning results. Here, RMSE values of penicillin concentration are used for model performance evaluation. Generally, an informative and detailed pruning of the HC tree makes a significant difference to superior generalization ability, as seen in Figure 6. However, it might also lead to the high costs of clustering and pruning as well as complex model structure. The number of pruning is recommended to be set as 150 when considering the balance of estimation performance and model complexity, in this case. Figure 7 demonstrates the RMSE values for global GPR models that are developed with AL and RS strategy in each iteration, respectively. The results reveal that, for both sampling strategies, GPR predictors are tested for their estimation accuracy and performance by selecting some unlabeled samples. This is because modelling space is significantly enlarged by labeling unlabeled samples and adding them into labeled dataset pool in each iteration. Obviously, those GPR models with AL strategy perform much better than those that were developed with the RS strategy, as the RMSE index values of the former are much smaller than that of the later in all iterations. The AL based GPR modelling method selects the most dissimilar samples that have the most valuable information for processes, while it cannot be guaranteed under the RS strategy. The estimation performance of GPR models that are based on RS may not be improved; if handled badly, it would even be deteriorated. Because there exists a potential risky issue that some samples in bad quality, usually with environment noise, measurement error, or variable mismatch, may be sampled and labeled for modelling, which might distort the structure of the soft sensing models.
valuable ones for labeling. The pruning results are relevant to the estimation ability of the AL based soft sensors. Figure 6 shows the prediction performance of the soft sensors under different pruning results. Here, RMSE values of penicillin concentration are used for model performance evaluation. Generally, an informative and detailed pruning of the HC tree makes a significant difference to superior generalization ability, as seen in Figure 6. However, it might also lead to the high costs of clustering and pruning as well as complex model structure. The number of pruning is recommended to be set as 150 when considering the balance of estimation performance and model complexity, in this case.  Figure 7 demonstrates the RMSE values for global GPR models that are developed with AL and RS strategy in each iteration, respectively. The results reveal that, for both sampling strategies, GPR predictors are tested for their estimation accuracy and performance by selecting some unlabeled samples. This is because modelling space is significantly enlarged by labeling unlabeled samples and adding them into labeled dataset pool in each iteration. Obviously, those GPR models with AL strategy perform much better than those that were developed with the RS strategy, as the RMSE index values of the former are much smaller than that of the later in all iterations. The AL based GPR modelling method selects the most dissimilar samples that have the most valuable information for processes, while it cannot be guaranteed under the RS strategy. The estimation performance of GPR models that are based on RS may not be improved; if handled badly, it would even be deteriorated. Because there exists a potential risky issue that some samples in bad quality, usually with environment noise, measurement error, or variable mismatch, may be sampled and labeled for modelling, which might distort the structure of the soft sensing models. It can be easily observed that the RMSE index values of AL based GPR soft models heavily decrease after two iterations, which is the same as the RS based models. It means that, for both sampling strategies, the estimate performance has been converged after the first two iteration steps because of the enlarged modelling space. However, the AL strategy converges faster than the RS strategy, especially in the third iteration step. This result shows that less unlabeled samples are recorded for selection and labeling under AL framework. Moreover, during the first three iterations the converge speed that we can find in Figure 6 is greatly higher than that one during the later iterations. Thus, we can infer from the result that the estimation ability of AL based soft model is effectively enlightened during the first three iterations, but the improvement trend after that is greatly limited during the third to twentieth iterations. As iteration number increases, the additional information of the selected samples in later iterations could hardly have a significant impact on high-quality soft sensor development. Besides, the influence of different query sample numbers upon AL based soft sensors is also examined in this study. For this purpose, various AL steps for soft sensing are selected in each iteration, which are set as 5 to 40. Figure 8 illustrates a comparative testing result of the average index values under different point numbers to be labeled in each iteration. Smart soft models with a different number of selected data points are developed and compared under the same iteration. For example, after 10 iterations, the smart soft predictor with 40 learning steps, have queried and add all of the unlabeled samples into previous labeled dataset, while only 50 unlabeled samples are labeled for the soft predictor with five learning steps. In most cases, the prediction ability of the smart sensor is enhanced when more unlabeled data points with useful information are selected and queried in each iteration. However, it also increases the computational burden and needs more human efforts for sample annotation process to model construction. It can be easily observed that the RMSE index values of AL based GPR soft models heavily decrease after two iterations, which is the same as the RS based models. It means that, for both sampling strategies, the estimate performance has been converged after the first two iteration steps because of the enlarged modelling space. However, the AL strategy converges faster than the RS strategy, especially in the third iteration step. This result shows that less unlabeled samples are recorded for selection and labeling under AL framework. Moreover, during the first three iterations the converge speed that we can find in Figure 6 is greatly higher than that one during the later iterations. Thus, we can infer from the result that the estimation ability of AL based soft model is effectively enlightened during the first three iterations, but the improvement trend after that is greatly limited during the third to twentieth iterations. As iteration number increases, the additional information of the selected samples in later iterations could hardly have a significant impact on high-quality soft sensor development.
Besides, the influence of different query sample numbers upon AL based soft sensors is also examined in this study. For this purpose, various AL steps for soft sensing are selected in each iteration, which are set as 5 to 40. Figure 8 illustrates a comparative testing result of the average index values under different point numbers to be labeled in each iteration. Smart soft models with a different number of selected data points are developed and compared under the same iteration. For example, after 10 iterations, the smart soft predictor with 40 learning steps, have queried and add all of the unlabeled samples into previous labeled dataset, while only 50 unlabeled samples are labeled for the soft predictor with five learning steps. In most cases, the prediction ability of the smart sensor is enhanced when more unlabeled data points with useful information are selected and queried in each iteration. However, it also increases the computational burden and needs more human efforts for sample annotation process to model construction. Furthermore, we intend to research the estimation performance of some different soft sensors that are based on AL with the same total number of labeled data points. With the increase of labeled numbers, as shown in Figure 9, the converge speed of the soft sensors becomes higher in early iterations, while the additional process information of the remaining selected samples in later iterations could hardly make a big impact on generalization enhancement of the GPR model. Under AL framework, a desired GPR model can be constructed with less labeled data samples, which is significantly helpful in human effort reduction.

Prediction Results and Discussions
Two other different kinds of smart predictors are built with the updated training data in order to research the influence of the ensemble learning based on GPR model with a great number of Furthermore, we intend to research the estimation performance of some different soft sensors that are based on AL with the same total number of labeled data points. With the increase of labeled numbers, as shown in Figure 9, the converge speed of the soft sensors becomes higher in early iterations, while the additional process information of the remaining selected samples in later iterations could hardly make a big impact on generalization enhancement of the GPR model. Under AL framework, a desired GPR model can be constructed with less labeled data samples, which is significantly helpful in human effort reduction. Furthermore, we intend to research the estimation performance of some different soft sensors that are based on AL with the same total number of labeled data points. With the increase of labeled numbers, as shown in Figure 9, the converge speed of the soft sensors becomes higher in early iterations, while the additional process information of the remaining selected samples in later iterations could hardly make a big impact on generalization enhancement of the GPR model. Under AL framework, a desired GPR model can be constructed with less labeled data samples, which is significantly helpful in human effort reduction.

Prediction Results and Discussions
Two other different kinds of smart predictors are built with the updated training data in order to research the influence of the ensemble learning based on GPR model with a great number of unlabeled data samples. There are four soft sensors built in the present work:

Prediction Results and Discussions
Two other different kinds of smart predictors are built with the updated training data in order to research the influence of the ensemble learning based on GPR model with a great number of unlabeled data samples. There are four soft sensors built in the present work: (1) GPR (GPR based on RS strategy): To iteratively select unlabeled samples for labeling with RS sampling strategy, and to construct a global GPR model. (2) EGPR (ensemble GPR based on RS strategy): Firstly, to iteratively select unlabeled samples for labeling with RS sampling strategy, to further construct local GPR models on different regions divided by GMM method, and finally to integrate the base predictors by applying the Bayesian fusion strategy.   In addition to RMSE, the tracking precision (TP) criterion is also applied to assess the generation capabilities of these soft sensors, which is obtained by: where σ 2 true denotes the variance of the true value and σ 2 error denotes the variance of the error between the output value and true value. TP is the variance correlation between the estimation error and the actual outputs, which can be applied to measure the tracking performance of the regression model. The soft sensing model with the higher TP value is preferable.
In this case, GMM is used to divide the updated process dataset into three subsets. Subsequently, the BIC criterion is applied for structure optimization to avoid the model over-fitting and contribute to data interpretation ability enhancement. The BIC value decreases gradually while Gaussian component number increases, as shown in Figure 10. However, further increases of K value do not cause further decreases of the BIC value. Combined with the prior knowledge of PFP, which consists of three physiological stages, one can be judged that the most optimal component assignment should set as 3.
The estimation results of predictors for penicillin concentration after the 3rd and the 7th iterations are tabulated and compared in Table 3 for detailed analysis on the performance of different soft models. From the results of the RMSE and TP values, the AL-EGPR model based soft sensor obtains the best generalization performance, as it has the lowest RMSE value and the highest TP value for penicillin concentration prediction. When comparing the RMSE and TP values between AL based and RS based GPR models, it can be easily found that the modelling ability becomes highly improved because the informative unlabeled samples are considered. By introducing the AL based hierarchical sampling strategy, the global GPR model and ensemble GPR model can all achieve higher prediction accuracy than other two GPR models with RS strategy after the 3rd and the 7th iterations. When comparing with the AL-GPR model, the AL-EGPR model under the ensemble learning framework performs much better and obtains the smaller error, since it partitions the updated dataset into several subsets for sub-predictor construction. In general, for two different data sampling strategies, the estimation accuracy and model capability can both be improved when iteration step increases. Meanwhile, the models after the 7th sampling iteration can achieve a better prediction performance than those that were developed after the 3rd sampling iteration. It should be noticed that, similar to the previous case, here, also, local GPR based soft sensors yield lower RMSE values and higher TP values, which performs better than the single GPR predictors under ensemble learning framework. However, in the present case, there is a relatively small reduction in RMSE, which can be attributed to the enough samples for labelling and model training. The recommended soft sensor shows its superiority and high performance in modelling the uncertainty of estimation under the complex measurement environment. generation capabilities of these soft sensors, which is obtained by: where 2 true  denotes the variance of the true value and 2 error  denotes the variance of the error between the output value and true value. TP is the variance correlation between the estimation error and the actual outputs, which can be applied to measure the tracking performance of the regression model. The soft sensing model with the higher TP value is preferable.
In this case, GMM is used to divide the updated process dataset into three subsets. Subsequently, the BIC criterion is applied for structure optimization to avoid the model over-fitting and contribute to data interpretation ability enhancement. The BIC value decreases gradually while Gaussian component number increases, as shown in Figure 10. However, further increases of K value do not cause further decreases of the BIC value. Combined with the prior knowledge of PFP, which consists of three physiological stages, one can be judged that the most optimal component assignment should set as 3. The estimation results of predictors for penicillin concentration after the 3 rd and the 7 th iterations are tabulated and compared in Table 3 for detailed analysis on the performance of different soft models. From the results of the RMSE and TP values, the AL-EGPR model based soft sensor obtains the best generalization performance, as it has the lowest RMSE value and the highest   Figures 11 and 12 illustrate the prediction results of test samples by these four soft sensors after the 3rd iteration and the 7th iteration, respectively, to show the prediction performance more intuitively. In Figure 11, the GPR predictor presents the worst prediction performance on account of its RS strategy and global model structure. By contrast, EGPR and AL-EGPR, on the basic of ensemble model structure, further enhance the generalization capability by partitioning the training data into isolated regions for local modelling. The two ensemble model based soft sensing strategies are both able to track the main trend of the penicillin concentration. In comparison, the proposed AL-EGPR soft sensing model as well as AL-GPR model performs much better than GPR and EGPR do, as its prediction output results are closer to the real values. The similar conclusion can be drawn from the prediction results in Figure 12. Generally, the modelling space is enlarged, and it contributes to developing a satisfactory soft sensor with high prediction accuracy. In addition, the dataset partition based ensemble learning is particularly effective to handle the multiphase processes with high complexity, and thus further enhances estimation behavior of regression model.  The estimation error results of four different manners after the 3 rd and the 7 th iterations are given in Figures 13 and 14, respectively, to reveal the effectiveness of the proposed soft sensor further. The closer the error curve is to the zero line, the more accurate the prediction is. By comparing these four prediction error results, we can readily conclude that the global GPR model that is based on RS strategy performs worst among the four soft sensing models and the proposed soft sensor modelling under ensemble learning framework further provides a more accurate prediction on the basis of AL-GPR with active learning strategy.  The estimation error results of four different manners after the 3 rd and the 7 th iterations are given in Figures 13 and 14, respectively, to reveal the effectiveness of the proposed soft sensor further. The closer the error curve is to the zero line, the more accurate the prediction is. By comparing these four prediction error results, we can readily conclude that the global GPR model that is based on RS strategy performs worst among the four soft sensing models and the proposed soft sensor modelling under ensemble learning framework further provides a more accurate prediction on the basis of AL-GPR with active learning strategy. The estimation error results of four different manners after the 3rd and the 7th iterations are given in Figures 13 and 14, respectively, to reveal the effectiveness of the proposed soft sensor further. The closer the error curve is to the zero line, the more accurate the prediction is. By comparing these four prediction error results, we can readily conclude that the global GPR model that is based on RS strategy performs worst among the four soft sensing models and the proposed soft sensor modelling under ensemble learning framework further provides a more accurate prediction on the basis of AL-GPR with active learning strategy.

Conclusions
The data produced from any element of the industrial process drive the implementation of Industry 4.0. Our basic idea is to process large amounts of data with smart data-driven soft sensors that can extract useful process information that is contained in labeled data as well as unlabeled data by means of machine learning and artificial intelligence. The hierarchical sampling based AL strategy has been proposed and introduced into the traditional ensemble GPR modelling method for soft sensing. Under the AL framework, the most representative and uncertainty samples with additional process information are selected and labeled to enlarge the labeled dataset and, thus, lots of human efforts and time costs that are related to labeling samples can be saved. We use the hierarchical sampling strategy rather than the RS to accelerate the convergence process and maximize the prediction capacity of ensemble models with the minimal labeled samples. We have evaluated the recommended soft sensor in penicillin fermentation process, showing that at least half of the time and human resource can be saved.
The exploitation of the hierarchical sampling based AL strategy can be a boost for unlabeled data analysis and processing. It is remarkably effective for engineers to handle the control and modelling problems with a limited number of labeled samples. Another outstanding advantage of our smart soft sensing technique is that the ensemble learning based GPR model can significantly address the strong nonlinear, highly varying, and multiphase characteristics of complex industrial processes. Hopefully, these contributions would provide the leaders of Industry 4.0 with a novel

Conclusions
The data produced from any element of the industrial process drive the implementation of Industry 4.0. Our basic idea is to process large amounts of data with smart data-driven soft sensors that can extract useful process information that is contained in labeled data as well as unlabeled data by means of machine learning and artificial intelligence. The hierarchical sampling based AL strategy has been proposed and introduced into the traditional ensemble GPR modelling method for soft sensing. Under the AL framework, the most representative and uncertainty samples with additional process information are selected and labeled to enlarge the labeled dataset and, thus, lots of human efforts and time costs that are related to labeling samples can be saved. We use the hierarchical sampling strategy rather than the RS to accelerate the convergence process and maximize the prediction capacity of ensemble models with the minimal labeled samples. We have evaluated the recommended soft sensor in penicillin fermentation process, showing that at least half of the time and human resource can be saved.
The exploitation of the hierarchical sampling based AL strategy can be a boost for unlabeled data analysis and processing. It is remarkably effective for engineers to handle the control and modelling problems with a limited number of labeled samples. Another outstanding advantage of our smart soft sensing technique is that the ensemble learning based GPR model can significantly address the strong nonlinear, highly varying, and multiphase characteristics of complex industrial processes. Hopefully, these contributions would provide the leaders of Industry 4.0 with a novel

Conclusions
The data produced from any element of the industrial process drive the implementation of Industry 4.0. Our basic idea is to process large amounts of data with smart data-driven soft sensors that can extract useful process information that is contained in labeled data as well as unlabeled data by means of machine learning and artificial intelligence. The hierarchical sampling based AL strategy has been proposed and introduced into the traditional ensemble GPR modelling method for soft sensing. Under the AL framework, the most representative and uncertainty samples with additional process information are selected and labeled to enlarge the labeled dataset and, thus, lots of human efforts and time costs that are related to labeling samples can be saved. We use the hierarchical sampling strategy rather than the RS to accelerate the convergence process and maximize the prediction capacity of ensemble models with the minimal labeled samples. We have evaluated the recommended soft sensor in penicillin fermentation process, showing that at least half of the time and human resource can be saved.
The exploitation of the hierarchical sampling based AL strategy can be a boost for unlabeled data analysis and processing. It is remarkably effective for engineers to handle the control and modelling problems with a limited number of labeled samples. Another outstanding advantage of our smart soft sensing technique is that the ensemble learning based GPR model can significantly address the strong nonlinear, highly varying, and multiphase characteristics of complex industrial processes. Hopefully, these contributions would provide the leaders of Industry 4.0 with a novel data analysis and modelling method for achieving a better performance of sensors under a small percentage of labeled process data.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: