## 1. Introduction

In recent years, artificial intelligence (AI) and machine learning (ML) have contributed to the great advancement of the Industry 4.0 [

1,

2]. It aims to ensure the high-quality control of production-based industries in the increasingly complex environment, such as the increased process automation, more efficient data analysis, lower human effort, safer working environment, and so on. Trillions of objects are connected to the Internet of Things (IoT), emerging huge amounts of data. Sensors play an important role in real-time and efficient data collection and processing. In industrial plants, many key variables are closely related to the product or process qualities but they can hardly be measured online with conventional hardware sensors. Soft sensors, which aim to realize the real-time prediction of these desired variables at low costs via constructing a satisfactory inference model between quality variables (usually are difficult-to-measure) and input variables (usually are easily accessible with sensors), have attracted growing attention in many industrial applications [

3,

4]. The data-driven algorithms based soft sensors are much more advantageous and easier for construction with litter process knowledge, which bring great convenience to the autonomous and intelligent control in industrial plants when compared with the first-principle model-based soft sensors that heavily rely on the prior knowledge and human experiences [

4,

5,

6,

7]. At present, plenty of multivariate statistical regression techniques such as partial least squares (PLS) [

7,

8], principal component regression (PCR) [

8] models, and ML based techniques such as artificial neural network (ANN) [

8], support vector machine (SVM) [

9,

10], and Gaussian process regression (GPR) [

11,

12,

13,

14] have been introduced to soft sensing.

However, many traditional nonlinear soft sensors tend to construct a desired global model for the estimation, which may perform poorly for processes with strong nonlinear and highly varying characteristics in the wide operation ranges. Ensemble learning methodology is proposed in order to effectively improve the generalization ability of the single predictor, and it has revealed great superiority in stability improvement [

14,

15,

16]. The first step for ensemble model design is to construct a set of individual ensemble components. Several popular component generation approaches are bagging, boosting, clustering, and the subspace method [

17,

18]. Many clustering-based methods, which aim to divide the dataset into different clusters by exploring the internal structure of the objects and the relationship between them, have been verified to be practically and theoretically useful, such as K-means, expectation maximization (EM), fuzzy C-means (FCM), and Gaussian mixture model (GMM) [

19,

20,

21,

22]. Generally, given enough weighted Gaussian-distributed mixture components, the GMM technique makes it successful to smoothly approximate any given non-Gaussian probability density, and each component is considered as a suitable mode that can effectively represent the local distribution [

22]. The prediction combination mechanism is the other step of the ensemble learning, whose criterions include simple averaging rule, weighted averaging combination, stacking strategy, Bayesian posterior probability, and so on [

12,

13,

22,

23]. The Bayesian fusion strategy has been proved to be naturally fit for model combination because of its strong statistical learning ability and efficient utilization for the collected dataset [

13]. It contributes to better stability achievement by reducing the estimation variance.

However, a practical difficulty that is encountered in traditional ensemble modelling methods is the effective utilization of unlabeled dataset. When compared with process variables, the acquirement of key quality variables is much costlier and more time-consuming, as it always needs significant human expect, expensive measure instruments, or laboratory analyses [

7,

24]. Therefore, the historic dataset that was collected from industrial processes would contain a large number of unlabeled samples, which just consist of process variables. This unlabeled data that contain rich process information, if utilized effectively, might greatly advance the development of soft sensing and intelligent process control in Industry 4.0. Traditional semi-supervised learning techniques, including self-training methods, co-training methods, probabilistic generative modelling methods, and graph-based methods [

15], could greatly enhance the generalization behavior of models by exploiting the unlabeled samples, but it also leads to some issues, such as the increase of computational effort and model instability. Besides, this method directly utilizes unlabeled samples to facilitate the learning process without any knowledge of human experts, while the designed model structure greatly influenced the improvement degree [

12]. Hence, we intend to construct a smart modelling framework for ensemble learning method, under which both data information and process engineer knowledge can be driven for the soft sensor.

Fortunately, the active learning (AL) technique shows great effectiveness and superiority in making full use of process dataset, by iteratively selecting valuable unlabeled samples for labeling with the knowledge of human experts. Therefore, the estimation capabilities of the AL based soft sensors can be effectively improved with the minimum time cost and human resource [

24,

25,

26,

27]. For the AL process, the most crucial issue is to determine a criterion that can effectively evaluate the potential quality of each unlabeled data point. Generally, the most meaningful unlabeled data, which consist of useful process information, are expected to be selected for labeling. However, many existing soft sensors under AL framework only consider the information of unlabeled samples and ignore the distribution information and spatial connectivity among them, which might result in more than one sample being selected in one small area during each learning iteration. In fact, it is unnecessary to select all of them, as they are likely to share same process information. The hierarchical clustering (HC) method provides a feasible approach for exploring spatial information between samples and their neighborhoods [

28,

29]. Compared with partition clustering algorithms, one of the most important advantages of this algorithm is that it can clearly show the clustering of dataset at different spatial levels [

30]. The spatial information can be effectively extracted and utilized by pruning the HC tree with the AL strategy in order to mitigate the data sample selection problem.

Therefore, the motivation of this paper is to design a superior smart soft sensor, which can be referred as ensemble GPR model with hierarchical sampling based AL strategy (AL-EGPR), expecting to positively support the real-time data processing and process control in 4.0 industrial environments. The limitation of traditional supervised learning based regression methods raises nontrivial concerns regarding the efficient utilization of large amounts of unlabeled data. Subsequently, a novel AL strategy is proposed and incorporated into soft sensor modelling method. With the hierarchical sampling strategy, if the new unlabeled sample does not fall into any existing high-density clusters, it is considered to be highly informative and representative. In such cases, a desired number of most dissimilar unlabeled samples can be selected and used for manual annotation in each learning iteration, and, after that, added into the training dataset for the next model construction, until achieving the satisfactory accuracy, or all unlabeled samples have been labeled. Ensemble learning based on GPR model is further introduced to robust soft sensor design, aiming to achieve better generalization than single model-based predictors. Here, we choose the GPR model as the ensemble member as its characteristic probabilistic structure as well as the strong ability to handle abrupt changes and nonlinearity of industrial processes. In this method, the newly updated labeled training dataset is firstly divided into several different local data domains that can be realized by applying the GMM method, and multiple local GPR sub-models can be built for these sub-datasets, respectively. Afterwards, we introduce the Bayesian inference strategy to estimate the posterior probability of each query data sample with respect to local sub-models. Afterwards, the local predictions of GPR sub-models are effectively integrated into final prediction results by applying the finite mixture mechanism. Besides, the Bayesian information criterion (BIC) [

18,

20] is applied to determine the optimal number of GMM components in attempting to reduce the soft model complexity and enhance estimation ability. The recommended soft sensor has been applied to the prediction of penicillin concentration in the penicillin fermentation process, demonstrating that the high performance can be achieved at a low cost, in terms of the estimation accuracy and converge speed.

The remaining parts of this paper are structured, as follows.

Section 2 briefly revisits the principle of the GPR model and GMM method.

Section 3 presents the detailed methodology of the AL strategy, including the HC method and adaptive sampling strategy.

Section 4 develops the ensemble GPR model based soft sensing technique with the AL strategy.

Section 5 evaluates the effectiveness of the AL-EGPR method via the simulation results in an industrial process, and

Section 6 concludes this paper.

## 3. Hierarchical Sampling Strategy Based Active Learning Framework

AL strategy has been developed and introduced to traditional sampling procedure in order to reduce the sampling bias resulted from random selection for unlabeled samples, which is shown in

Figure 1. However, traditional AL based soft sensing cannot be able to make fully use of the spatial information between process samples, thus, in this section, the HC method and adaptive sampling strategy are introduced into the AL framework.

#### 3.1. Hierarchical Clustering Algorithm

The HC method has been proven to be valuable to the data clustering. A tree of HC can be obtained by calculating the similarity of different clusters. In the clustering tree, the data samples with different characteristics are the low level of the tree, and the top level of the binary tree can be seen as the root node of the cluster. Additionally, the farther the distance on the cluster tree is, the less similar the two samples are. It has been studied to fully consider the special information between samples during the clustering process.

The first important step of the HC algorithm is to calculate the distances between the data samples. As the most common distance measurement method, the Euclidean distance has been widely introduced to calculate the absolute distance between all given data points in the multidimensional space, which is defined by:

where

${x}_{1}=[{x}_{11},{x}_{12},\cdots ,{x}_{1m}]$ and

${x}_{2}=[{x}_{21},{x}_{22},\cdots ,{x}_{2m}]$ represent the data points with

m-dimension.

Another crucial task is the combination of different clusters. In this study, ward-linkage method, which aims to minimize the total variance of the clusters being merged, is employed to cluster combination [

31,

32]. The pair of data clusters that lead to the minimum increase in total produced variance, or the error sum of squares (ESS), are selected to merge at each union step to implement this method.

When considering the dataset

${\left\{{x}_{i}\right\}}_{i=1}^{n}$ in one-dimensional space, the variance is expressed, as follows:

where

$n$ is the number of points. Subsequently, ESS is usually given by the following functional relation [

31]:

In ward-linkage cluster, the ESS value of the newly obtained cluster being merged is taken as the similarity of two clusters, which can be formulated as:

where

${x}_{i}$ represents any data sample of two clusters before merging,

${c}_{1}$ and

${c}_{2}$ is a pair of clusters,

${o}_{{c}_{1}\cup {c}_{2}}$ is the central data point of the new cluster, and

$d({x}_{i},{\mathit{o}}_{{c}_{1}\cup {c}_{2}})$ is the Euclidean distance between each sample

${x}_{i}$ to

${o}_{{c}_{1}\cup {c}_{2}}$. In this way, the clusters with high similarity in measured characteristics are merged, and the complete hierarchical structure can be obtained by repeating the union process.

Figure 2 shows the implementation steps of the HC algorithm in detail.

#### 3.2. Adaptive Sampling Strategy

A binary tree can represent the HC results, and the adaptive sampling based AL strategy is introduced into the smart soft sensor modelling. It aims to adaptively remove some redundant subtrees that were composed of those nodes that are homogeneous in HC tree according to a certain criterion. Here, the combined process is denoted as pruning. It aims to find an optimal pruning with minimum classification error and selected the most uncertain and informative samples for model training through an iterative process. Those samples that have less similarity with the labeled dataset are generally preferred as they have much useful information. Data sampling probability can effectively be reduced in regions of the space that already have labeled samples with relatively large numbers, which fully considers and makes use of the sample spatial information as compared to random selection (RS) and other sampling strategies.

Given the labeled dataset as

$\left\{{X}_{L}\right\}\in {R}^{{n}_{l}\times m}$ and unlabeled dataset as

$\left\{{X}_{U}\right\}\in {R}^{{n}_{u}\times m}$, where

$m$ denotes the number of measured process variables,

${n}_{l}$ and

${n}_{u}$ denote the numbers of the labeled data points and unlabeled data points, respectively, usually

${n}_{l}\ll {n}_{u}$ holds. Suppose the HC tree

$T$ has

${n}_{u}$ leaves, the number of data points in a node

$v\in T$ is expressed as

${n}_{v}$. A weight of the node is the proportion of the sample points in

${T}_{v}$, which can be represented as:

where

$c=1,2,\cdots ,k$ is all possible classes and

${n}_{v,c}$ is the number of points that belong to class

$c$. Generally, the class

$c$ with the maximal probability can be taken as the classification results of corresponding nodes. However,

${n}_{v,c}$ is sometimes small, and it might result in serious classification errors since the obtained classification probability

${p}_{v,c}$ has no enough robustness.

The generalization bounds are used to assess the quality of probability estimates in order to address this issue. When considering any given time

$t$, we introduce a confidence interval

$[{p}_{v,c}^{LB},{p}_{v,c}^{UB}]$ to replace

${p}_{v,c}$ by associating with each node

$v$ and class

$c$ [

32]:

where

${\Delta}_{v,c}\left(t\right)=\frac{{d}_{v}\left(t\right)}{{n}_{v}\left(t\right)}+\sqrt{\frac{{d}_{v}\left(t\right){p}_{v,c}\left(t\right)\left(1-{p}_{v,c}\left(t\right)\right)}{{n}_{v}\left(t\right)}}$,

${d}_{v}\left(t\right)=1-\frac{{n}_{v,c}\left(t\right)}{{n}_{v}\left(t\right)}$.

If it incurs at most

$\beta $ times as much as any other classes when class

$c$ is taken as the class of the node

$v$:

We consider class

$c$ to be an admissible class for node

$v$, which implies that

$\left(v,c\right)$ is admissible at time

$t$. In this study, we set

$\beta =2$, in which case:

For any node $v$, several different classes may meet this criterion at time $t$. It is necessary to determine which class is chosen to be the optimal class for node $v$. In order to select the optimal class of nodes, the admissibility of all classes of nodes are first calculated, and then choose the class with the greatest probability as the optimal class to which the node belongs among all admissible classes.

The adaptive pruning strategy aims to combine similar subtrees and find an effective pruning in order to minimize the classification error as much as possible, which is directly related to the output of classification results. The classification error can be defined as:

Generally, it starts from the root node and toward the leaves, evaluating whether the child nodes should replace its parent node if all of the descendants of the node are able to replace their parents. For each node of HC structure, its class and error are calculated. If it satisfies:

where nodes

${v}_{p}$ and

${v}_{q}$ are the child nodes of

$v$, in such cases, we can replace node

$v$ with its child nodes

${v}_{p}$ and

${v}_{q}$, which aims to reduce the overall classification error.

Once the optimal pruning is accomplished, a classification result can be obtained with minimum error. Subsequently, it can query some informative samples to refine ${p}_{v,c}$ in the iteration procedure, which further reduces the classification error. In this study, the AL strategy is introduced to effectively select queried samples for labeling. Normally, the node $v$ with the minimal value of ${p}_{v,c}^{LB}$ is chosen to select samples for querying, and then one child node of $v$ is chosen according to its node division. Repeat these two steps until the informative sample is selected and labeled.

After an iterative sampling process, the most dissimilar samples can be selected. As a result, the cost of human efforts and time for labeling can be greatly reduced. With the labeled dataset enlarged, ${p}_{v,l}$ the value increases while the confidence of classification is improved.

Figure 3 presents a schematic illustration of the HC tree and different pruning strategy. Algorithm 1 summarizes the proposed hierarchical sampling strategy under the AL framework.

**Algorithm 1.** The proposed hierarchical sampling strategy under AL framework. |

**Input**: a HC tree of $n$ unlabeled data samples; iteration step ${n}_{s}$
**Process:**
- 1:
Repeat following steps until labeled samples are enough for high-quality soft sensing or all unlabeled samples are labeled. - 2:
Choose the node $v\in T$ with minimal value of probability ${p}_{v,c}^{LB}$, and replace node $v$ with its child nodes ${v}_{p}$ and ${v}_{q}$ if it satisfies ${\stackrel{~}{\epsilon}}_{v,c}{\stackrel{~}{\epsilon}}_{{v}_{p},{c}_{p}}+{\stackrel{~}{\epsilon}}_{{v}_{q},{c}_{q}}$. - 3:
Choose one of the child nodes $z$ in the same way, until there are no child nodes, then an informative sample $x$ is selected. - 4:
Update ${p}_{u,c}^{LB}$ of all nodes $u\in T$. - 5:
Repeat step2 to step4 until ${n}_{s}$ unlabeled samples are selected. - 6:
Query the labels of ${n}_{s}$ selected data points, and then configure the selected dataset ${x}^{s}$. - 7:
Update the labeled dataset as ${x}_{new}^{l}\leftarrow [{x}^{l}+{x}^{s}]$, ${y}_{new}^{l}\leftarrow [{y}^{l}+{y}^{s}]$, the unlabeled dataset ${x}_{new}^{u}\leftarrow [{x}^{u}-{x}^{s}]$.
**Output**: Newly labeled dataset ${x}_{new}^{l}\leftarrow [{x}^{l}+{x}^{s}]$, ${y}_{new}^{l}\leftarrow [{y}^{l}+{y}^{s}]$. |

## 4. Ensemble GPR Modelling Method Under AL Framework

Traditional soft sensing that is based on the AL strategy only constructs a global model for quality prediction, as shown in

Figure 1, it usually ignores the multiphase and multistage characteristics of complex chemical processes. Therefore, a novel smart soft sensing technique with an AL strategy based on ensemble learning can be developed for better prediction performance. To guarantee the prediction capability of each ensemble sub-model, the GMM method is applied to obtain a set of local domains from updated training samples. Subsequently, sub-models can be built from different datasets, which, if applied effectively, could highly enlighten the generalization performance of soft sensing model. Besides, BIC criterion is chosen to determine the optimal number of Gaussian components, as it tends to establish a great structure for GMM model [

22], which can be formulated, as:

where

$N$ denotes training data number,

$K$ denotes component number, and

$L(X|\Xi )$ represents the maximal values of the log-likelihood function. BIC aims to balance generalization performance with GMM model complexity and the model with the lowest BIC value is preferable.

In this paper, the GPR modelling method is chosen for ensemble model construction due to its better generalization behavior since the proposed AL method has no restrictions on the selected data model structure. In such cases, several GPR sub-models are driven by local datasets. Further, the posterior probabilities for each arbitrary observation

${x}_{q}$ with respect to all different phases can be formulated, as follows, since we apply Bayesian inference knowledge:

Afterwards, the localized GPR models are adaptively incorporated to an ensemble inferential model with posterior probability by applying the finite mixture mechanism. Usually, the final online estimation of key variable is the weighted combination of each individual, which is formulated, as follows:

where

${x}_{q}$ represents the new observation of test samples,

${C}_{k}=\{{x}^{k},{y}^{k}\}$,

$k=1,2,\cdots ,K$ represents

kth process phases, and

${y}_{q}^{k}$ represents the local output.

Figure 4 illustrates the comparisons between traditional global GPR modelling method based on RS sampling strategy and ensemble GPR modelling method based on the AL strategy presented in this paper.