Online-Dynamic-Clustering-Based Soft Sensor for Industrial Semi-Supervised Data Streams

In the era of big data, industrial process data are often generated rapidly in the form of streams. Thus, how to process such sequential and high-speed stream data in real time and provide critical quality variable predictions has become a critical issue for facilitating efficient process control and monitoring in the process industry. Traditionally, soft sensor models are usually built through offline batch learning, which remain unchanged during the online implementation phase. Once the process state changes, soft sensors built from historical data cannot provide accurate predictions. In practice, industrial process data streams often exhibit characteristics such as nonlinearity, time-varying behavior, and label scarcity, which pose great challenges for building high-performance soft sensor models. To address this issue, an online-dynamic-clustering-based soft sensor (ODCSS) is proposed for industrial semi-supervised data streams. The method achieves automatic generation and update of clusters and samples deletion through online dynamic clustering, thus enabling online dynamic identification of process states. Meanwhile, selective ensemble learning and just-in-time learning (JITL) are employed through an adaptive switching prediction strategy, which enables dealing with gradual and abrupt changes in process characteristics and thus alleviates model performance degradation caused by concept drift. In addition, semi-supervised learning is introduced to exploit the information of unlabeled samples and obtain high-confidence pseudo-labeled samples to expand the labeled training set. The proposed method can effectively deal with nonlinearity, time-variability, and label scarcity issues in the process data stream environment and thus enable reliable target variable predictions. The application results from two case studies show that the proposed ODCSS soft sensor approach is superior to conventional soft sensors in a semi-supervised data stream environment.


Introduction
In the process industry, real-time estimations of key quality parameters are of great importance for process monitoring, control, and optimization. However, due to technical and economic limitations, these key parameters related to product quality and process status cannot be measured online. Instead, soft sensor technology, as an important indirect measurement tool, has been widely used in the process industry. The core of soft sensors is constructing mathematical models between easy-to-measure secondary variables and a primary variable. In recent years, rapid advances in machine learning, data science, computer, and communication technologies have stimulated development of data-driven soft sensor techniques [1,2]. Typical data-driven soft sensor modeling approaches include principal component regression (PCR), partial least squares (PLS), neuro-fuzzy systems Over the years, the research on data streams has mainly focused on classification and clustering, while there are few studies on data stream regression. Among of them, the research on data stream regression mainly focuses on solving concept drift problems in nonstationary environments. The most commonly used algorithm is AMRules [20], which is the first streaming rule-learning algorithm for regression problems that learns ordered and rule-free sets from data streams. Another classical algorithm is fast incremental model tree with drift detection (FIMI-DD) [21], a method for learning a regression model tree that identifies changes in tree structure through explicit change detection and informed drift adaptation strategies. Until now, many existing data stream regression algorithms have been implemented based on the above two algorithms to improve performance while achieving better prediction results. The main work is summarized as follows.
(1) Rule-based data stream regression algorithms. Shaker et al. [22] proposed a fuzzy rule-learning algorithm called TSKstream for data stream adaptive regression. The method introduces a new TSK fuzzy rule induction strategy by combining the merits of the rule induction concept implemented in AMRules with the expressive power of TSK fuzzy rules, which solves the problem of adaptive learning from evolving data streams. Yu et al. [23] proposed an online multi-output regression algorithm called MORStreaming, which learns instances based on topological networks and correlations between outputs based on adaptive rules and can solve the problem of multiple output regression in the data stream environment. (2) Tree-model-based data stream regression algorithms. Gomes et al. [24] proposed an adaptive random forest algorithm capable of handling data stream regression tasks (ARF Reg). The algorithm uses the adaptive sliding window drift detection method and experiments with the original Page Hinkley test inside each FIMI-DD to detect and adapt to drift. Zhong et al. [25] proposed an online weight-learning random forest regression (OWL-RFR). This method focuses on a sequential dataset problem that has been ignored in most studies on online RFs and improves the predictive accuracy of the regression model by exploiting data correlation. Subsequently, Zhong et al. [26] proposed an adaptive long short-term memory online random forest regression, which designs an adaptive memory activation mechanism to handle static data streams or non-static data streams with different types of conceptual drift. Further, some researchers have attempted to introduce online clustering for dealing with data stream regression modeling. Ferdaus et al. [27] proposed a new type of fuzzy rules based on the concept of hyperplane clustering for data stream regression problems called PALM; it can automatically generate, merge, and adjust hyperplane-based fuzzy rules in a single pass, which can effectively handle the concept drift of each path in the data stream, with advantages of low memory burden and low computational complexity. Song et al. [28] proposed a data stream regression method based on fuzzy clustering called FUZZ-CARE. The algorithm can accomplish dynamic identification, training, and storage of three patterns, and the affiliation matrix obtained by fuzzy C-means clustering indicates affiliation of subsequent samples of the corresponding pattern. This method can address the concept drift problem in non-stationary environments and effectively avoid the problem of under-training due to lack of new data.
It is evident from the above studies that state identification of a process data stream is the key to obtaining high prediction accuracy from data stream regression models. For this reason, data stream clustering has been widely used to achieve local process state identification. Unlike traditional offline, single, fixed number clustering methods, data stream clustering has the advantage of online incremental learning and updating, which can provide concise representations of discovered clusters and enable processing of new samples in an incremental manner for clear and fast detection of outlier points. Generally, data stream clustering can be classified into hierarchical methods, partition-based methods, grid-based methods, density-based methods, and model-based methods [29]. Hierarchical data stream clustering algorithms use tree structure, have high complexity, and are sensitive to outliers. The representative ones are ROCK [30], evolution-based technique for stream clustering (E-stream) [31], and its extension, HUE-stream [32]. Partition-based clustering algorithms partition data into a predefined number of hyperspherical clusters, such as CluStream [33], Streamkm++ [34], and adaptive streaming k-means [35]. Grid-based clustering algorithms require determining the number of grids in advance, and they can find arbitrarily shaped clusters and are more suitable for low-dimensional data, such as WaveCluster [36], a grid-based clustering algorithm for high-dimensional data streams (GCHDS) [37], and DGClust [38]. Density-based algorithms form micro-clusters by radius and density, can find arbitrarily shaped clusters, and automatically determine number of clusters, which is suitable for high-dimensional data, and are capable of handling noise, such as DBSCAN [39], DenStream [40], online clustering algorithm for evolving data streams (CEDAS) [41], MuDi-Stream [42], and an improved data stream clustering algorithm [43]. Performance of model-based clustering algorithms is mainly influenced by the chosen model, such as CluDistream [44] and SWEM algorithm [45].
Among the above-mentioned clustering methods, density-based clustering algorithms are frequently used due to their advantages, such as not requiring the number of clusters to be determined in advance, abilities of identifying outlier points, handling noise, and finding clusters of arbitrary shapes, and their applicability to high-dimensional data. Although traditional density-based clustering algorithms for data streams can discover clusters of arbitrary shapes, the generated clusters cannot evolve and overcome unstable data streams well. To address this issue, Hyde et al. proposed an improved algorithm of CODAS [46], called CEDAS [41], which is the first fully online clustering algorithm for evolving data streams. It consists of two main phases. The first phase establishes clusters, which enable updating, generation, and disappearance of clusters, while the second phase consists of forming macro-clusters from micro-clusters, which can handle changing data streams as well as noise characteristics and provide high-quality clustering results. However, this algorithm requires radius and threshold to be defined in advance, which has a large influence on the clustering results. Thus, a method of buffer-based online clustering (BOCEDS) was proposed to automatically determine clustering parameters [47]. In addition, CEDGM has been proposed by using a grid-based approach as an outlier buffer to handle multi-density data and noise [48]. Considering the effectiveness of online clustering to overcome data stream noise and achieve high-quality clustering results, this paper aims to build on it to achieve online dynamic clustering for industrial semi-supervised data streams and thus build a high-performance data stream soft sensor model.
Despite the availability of numerous methods proposed for data stream classification and clustering problems, so far, few attempts to study soft sensor applications from the perspective of process data streams have occurred. Since it is very common that numerous unlabeled data and a small number of labeled data are generated with the process data streams in the process industry, this paper focuses on soft sensor modeling for industrial semi-supervised data streams and aims to address the following issues: (1) as with traditional soft sensor methods, data stream soft sensor models also need to effectively deal with process nonlinearity; (2) it is desirable to empower soft sensor models with online learning capabilities for capturing the latest process states to prevent model performance deterioration; (3) it is appealing to mine both historical and new data information to avoid catastrophic forgetting of historical information by the newly acquired model; (4) performance of soft sensor models needs to be enhanced by semi-supervised learning using both labeled and unlabeled data.
To solve the above-mentioned problems, an online-dynamic-clustering-based soft sensor method (ODCSS) is proposed for semi-supervised data streams. ODCSS is capable of handling nonlinearity, time-variability, and label scarcity issues in industrial data streams. Two case studies have been reported to verify the effectiveness and superiority of the proposed ODCSS algorithm. The main contributions of this paper can be summarized as follows.
(1) An online dynamic clustering method is proposed to enable online identification of process states concealed in data streams. Unlike offline clustering, this method can automatically generate and update clusters in an online manner; a spatio-temporal double-weighting strategy is used to eliminate obsolete samples in clusters, which can effectively capture the time-varying characteristics of process data streams. (2) An adaptive switching prediction strategy is proposed by combining selective ensemble learning and JITL. If the query sample is judged to be an outlier, JITL is used for prediction. Otherwise, selective ensemble learning is used. The method facilitates effective handling of both gradual and abrupt changes in process characteristics, which enables preventing high soft sensor performance from deteriorating in time-varying environments. (3) Online semi-supervised learning is introduced to mine both labeled and unlabeled sample information, thus expanding the labeled training set. This strategy can effectively alleviate the problem of insufficient labeled modeling samples and can obtain better prediction performance than supervised learning.
The rest of the paper is organized as follows. Section 2 provides details of the proposed ODCSS approach. Section 3 demonstrates the effectiveness of the proposed method through two case studies. Finally, Section 4 concludes the paper. A brief introduction of GPR, selftraining, and JITL can be found in Appendix A, Appendix B, Appendix C, respectively.

Proposed ODCSS Soft Sensor Method for Industrial Semi-Supervised Data Streams
Soft sensor modeling for data stream remains challenging for the following reasons. First, process data are often characterized by strong nonlinearity and time-variability, which makes the linear and nonadaptive models function badly. Second, many current data stream regression approaches rely on a single-model structure, thus limiting their prediction accuracy and reliability. Third, in industrial processes, it is often the case that labeled data are scarce but unlabeled data are abundant. In such situations, conventional supervised data stream regression models are ill-suited for semi-supervised data streams. Therefore, we propose a new soft sensor method, ODCSS, for industrial semi-supervised data streams. The main steps of ODCSS include: (1) online dynamic clustering; (2) adaptive switching prediction; and (3) sample augmentation and maintenance. The details are described in the following subsections.

Problem Definition
Semi-supervised data streams. Assuming that the data streams are the continuous sequence containing n ( n → ∞) instances, i.e., D = {s 0 , s T , s 2T , . . . , s t−T , s t }, where s t is the sample arriving at time t and s t = {x t ,y t }, where x t denotes input features, y t is the label of the sample. In the context of soft sensor applications, y t corresponds to the hard-tomeasure variables in industrial processes, such as product concentration, catalyst activity, etc. In the data streams, the ideal situation is that all data are labeled, which allows us to perform supervised learning. However, it is often expensive and laborious to obtain labels, thus creating a mixture of a small number of labeled samples and a large number of unlabeled samples, which are called semi-supervised data. Therefore, we assume that the data streams consist of successive arrivals of high-frequency unlabeled samples and low-frequency labeled samples, as denoted by D t = {s l 0 , s u T , . . . , s u mT−T , s l mT , s u mT+T , . . . , s u t−T , s l t }, where s u * and s l * represent unlabeled and labeled samples, respectively. Regression task for soft sensor modeling. Suppose semi-supervised data streams D t has been obtained up to time t. Given an online obtained x t , unknown label y u t is required to be estimated. Thus, mathematical model f t should be constructed based on coming semi-supervised data streams D t ; that is, As can be seen from Equation (1), soft sensor modeling for a data stream has the following characteristics. (1) Modeling data D t changes and accumulates over time. After a period of time, numerous historical data and a small set of recent process data will be obtained. Thus, it is crucial to coordinate the roles of historical data and the latest data for soft sensor modeling. If only the latest information is addressed while ignoring historical valuable information, the generalization performance of the model cannot be guaranteed. Contrarily, focusing only on historical information will make the soft sensor model unable to capture the latest process state. (2) In most cases, the true value ofŷ u t is unknown, and only a few observed values are obtained through low-frequency and large-delay analysis. Traditional supervised learning can only effectively use labeled samples, ignoring the information from unlabeled samples. In practice, unlabeled samples also contain rich information about the process states. Thus, it is also an important way to improve soft sensor models by fully exploiting labeled and unlabeled samples through a semi-supervised learning framework.
(3) Stream data D t usually implies a complex nonlinear relationship between inputs and outputs. Therefore, the idea of local learning is often considered to obtain better prediction performance than a single global model. In addition, as the process runs and D t changes, f t is not constant and often exhibits significant time-varying characteristics. Therefore, to prevent degradation of the prediction performance of f t , it is necessary to introduce a suitable adaptive learning mechanism to achieve an online update of f t .

Online Dynamic Clustering
Industrial process stream data D t often contains rich process state information; however, traditional global modeling is difficult to obtain high prediction accuracy because local process states cannot be well characterized. For this reason, clustering algorithms are often used to implement process state identification. However, traditional clustering algorithms are usually implemented offline and the resulting clusters remain unchanged once the clustering is completed. Such approaches are not suitable for handling data streams that evolve in real time. Thus, in data stream environments, local process state identifications need to be performed dynamically in an online manner. To this end, an online dynamic clustering (ODC) method based on density estimation is proposed to achieve online state identification of process data streams.
Traditionally, offline clustering algorithms for batch data can usually obtain multiple clusters at the same time. In contrast to this, ODC processes the data online one by one and assigns them to the appropriate clusters. Without loss of generality, Euclidean distance is chosen to measure the similarity between samples in this paper. The calculation formula is as follows: where x i and x j represent two arbitrary samples.

Initialization
The ODC process requires setting two important initial parameters: cluster radius R and minimum density threshold M. Given a dataset, the most appropriate cluster radius needs to be selected based on the data features, i.e., the maximum allowable distance from the cluster center to the cluster edge. When the distance between data points is less than the radius and the number of data reaches the minimum density threshold, cluster C can be formed. The average of all sample features in the cluster is calculated as cluster center c. Clustering is an unsupervised process and is completed using features only, where clustering center c is simply calculated as

of 29
where n is the sample size in the cluster and x i is the i-th sample. The first cluster is constructed to store information related to the clusters, including clustering center c and the data samples stored in the cluster. Following this, online dynamic clustering is performed for sequentially arrived query sample x t .

Updating the Cluster
Assume that multiple clusters have been obtained, i.e., {C m t } M m=1 , and a new query sample x t arrives; then, Euclidean distances d between x t and the existing cluster centers are calculated as If d m t ≤ R, x t is included to cluster C m t . Considering the boundary fuzzy property between different process states, a softening score strategy is used to group the sample points into all eligible clusters. In addition, if d m t ≤ 2R/3, cluster center c will be updated to accommodate concept drift: where C m t is the m-th cluster, d m t is the Euclidean distance between the new sample and the m-th cluster, n is the number of samples in the m-th cluster, and c and c ' denote the cluster centers before and after updating.
It should be noted that the shape of the cluster is not fixed but further evolves with migration of cluster center c.

Generating the New Cluster
Since the process state is always changing, the clusters need to adapt to new state changes as samples accumulate. Thus, it is desirable to generate new clusters to accommodate concept drift as the process data grow. If Euclidean distance d between x t and cluster center c is larger than the radius, this sample is regarded as an outlier. In such a case, distances d o t between the existing outliers are calculated, and a new cluster is generated if the number of outliers with distances less than the radius reaches minimum density threshold M. The center of the new cluster is calculated using Equation (3). The remaining outliers that do not form clusters are retained and exist separately in space.

Removing Outdated Data in the Cluster
Ideally, data streams are an infinitely increasing process. However, the update burden of clusters as well as the computational efficiency of soft sensor models will grow as the process data accumulates. Hence, it is appealing to remove the outdated samples from the clusters.
Since Euclidean distance only considers spatial similarity and tends to ignore the temporal relevance of samples, a spatio-temporal double weighting strategy is proposed to consider both spatial and temporal similarity between historical and recent samples. For this purpose, the spatio-temporal weights are calculated to eliminate the least influential samples in the cluster on the query sample x t : where d i is the distance from the i-th sample in the cluster to the cluster center and t i is the time interval between the i-th sample in the cluster and query sample x t , and α is a parameter controlling the influence of spatio-temporal information. The smaller the weights are, the smaller the corresponding history samples have an influence on query sample x t . The spatio-temporal weights are sorted in ascending order, and a fixed proportion of historical samples are removed. The pseudo code for the ODC method is given in Algorithm 1. C(c) = x 1 ; %% Cluster center 3: C(Data) = x 1 ; %% Save data 4: %% Save outlier, initially empty 5: Calculate distances d t between x t and all cluster centers c using Equation (3). 6: if d t ≤ R %% Updating the cluster 7: x t is stored in all clusters where d t ≤ R; 8: if d t ≤2R/3 9: Update cluster center c using Equation (5); 10: end if 11: else if d t > R %% Generating a new cluster 12: x t is stored in the outliers; 13: Calculate distances d o t between samples stored in outliers; 14: if Generate a new cluster; 16: x t is stored in the new cluster; 17: Calculate new cluster center c using Equation (3); 18: end if 19: end if %% Removing the outdated data in the cluster 20: Calculate the spatio-temporal weights for each sample in the cluster using Equation (6); 21: Sort w in ascending order; 22: According to the sorting, delete the samples with the smallest weights in the cluster and fix the proportion of the deleted samples to P; OUTPUT: Cluster results C

Adaptive Switching Prediction
By applying online dynamic clustering, query sample x t can be assigned to either existing clusters or outliers. In comparison with samples within clusters, outliers reveal significantly different statistical characteristics of process variables. Thus, an adaptive switching prediction method is proposed by combining adaptive selective ensemble learning with JITL to achieve real-time predictions for within-cluster samples and outliers, respectively. In addition, GPR is used as the base modeling technique, which is a nonparametric regression model that can learn arbitrary forms of functions with advantages such as smoothness, parameter adaption, and strong capability of fitting nonlinear data.

Adaptive Selective Ensemble Learning for Online Prediction
Suppose that M t clusters have been obtained at moment t, for which M t GPR base models { f m t } M m=1 are built. When query sample x t arrives, prediction is achieved using an adaptive selective ensemble learning strategy if x t is classified into clusters. Three main key steps are described as follows: Step 1: evaluate distances d between x t and its corresponding cluster center c, and with small distances d.
Step 2: provide m t prediction values {ŷ t,1 ,ŷ t,2 , . . . ,ŷ t,m t } based on the obtained models, and use simple averaging rule to obtain final prediction outputŷ t : Sensors 2023, 23, 1520 9 of 29 Step 3: if a new labeled sample or high-confidence pseudo-labeled sample is added to the clusters, the corresponding GPR models will be rebuilt.
It is worth noting that new labeled samples are often obtained by offline analysis, while pseudo-labeled samples are obtained by self-training, which is detailed in Section 2.4. Figure 1 shows the schematic diagram of the adaptive selective ensemble learning framework.
Step 2: provide prediction values { , , , ,..., , } based on the obtained models, and use simple averaging rule to obtain final prediction output : Step 3: if a new labeled sample or high-confidence pseudo-labeled sample is added to the clusters, the corresponding GPR models will be rebuilt.
It is worth noting that new labeled samples are often obtained by offline analysis, while pseudo-labeled samples are obtained by self-training, which is detailed in Section 2.4. Figure 1 shows the schematic diagram of the adaptive selective ensemble learning framework.

Just-In-Time Learning for Online Prediction
If query sample is judged to be an outlier, JITL is used for prediction. Since the outliers are samples deviating from the clusters, if the outliers are predicted by using the models built from clusters, the predictions may deviate greatly from the actual values. Therefore, by using all labeled samples as the database, a small-size dataset =

Just-In-Time Learning for Online Prediction
If query sample x t is judged to be an outlier, JITL is used for prediction. Since the outliers are samples deviating from the clusters, if the outliers are predicted by using the models built from clusters, the predictions may deviate greatly from the actual values. Therefore, by using all labeled samples as the database, a small-size dataset D simi = {X simi , y simi } similar to query sample x t is constructed to build a JITGPR model for online prediction ofŷ t .
Thus far, various similarity measures have been proposed for JITL methods [49], including Euclidean distance similarity, cosine similarity, covariance weighted similarity, Manhattan distance similarity, Pearson coefficient similarity, etc.

4:
Calculate the average ofŷ t,i using Equation (7) to obtain final prediction outputŷ t ; 5: if there is an update to the samples in the i-th cluster 7: Rebuild a new GPR model using the updated samples; 8: else if the samples in the i-th cluster are not updated 9: Keep the old GPR model; 10: end if 11: end for 12: else if x t is judged to be an outlier %% Just-in-time learning for online prediction 13: Select the most similar samples to the x t as training set D simi from the historical labeled samples; 14: Build a JITGPR model with D simi ; 15: Predict x t using the JITGPR model; 16: Obtain finally predicted resultŷ t ; 17: end if OUTPUT: Prediction resultŷ t

Sample Augmentation and Maintenance
Although the production process produces a large number of data records in the form of streams, the proportion of labeled samples is small. In practice, for arbitrary query sample x t , its label can be estimated by using the adaptive switching prediction method. Such predictions are called pseudo labels, which can be used to update the model if they are highly accurate. However, the actual labels for most unlabeled samples are unknown due to absence of offline analysis. For this reason, we borrow the idea of self-training, a widely used semi-supervised learning paradigm, to obtain high-confidence pseudo-labeled samples and then update the models.
One main difficulty of self-training is defining confidence evaluation criteria for selecting high-quality pseudo-labeled samples. Thus, we attempt to evaluate improvement of prediction performance before and after introducing pseudo-labeled samples. The specific steps are described as follows: Step 1: select a certain proportion of the labeled samples similar to query sample x t as the online validation set and use the remaining labeled samples as the online training set.
Step 2: build two GPR models based on the training set before and after adding the pseudo-labeled data {x t ,ŷ u t }, respectively. Step 3: evaluate the prediction RMSE values of the two models on the validation set, and then the improvement rate (IR) can be calculated as: where RMSE and RMSE are the root mean square errors of the GPR model on the validation set before and after the pseudo-labeled sample is added to the training set.
Step 4: if the IR value of the pseudo-labelŷ u t is greater than confidence threshold IR th , {x t ,ŷ u t } is added to the corresponding cluster to update the training set. Otherwise, this sample is removed from the clusters.
Although the latest pseudo-labeled samples added to the clusters can improve the prediction performance for the query sample, accumulating too many pseudo-labeled samples can cause error accumulation, so timely deletion of the outdated historical samples is very essential for reducing prediction deviations. To make full use of the information from the recent unlabeled samples, the sample deletion procedure is started only when the latest true label is detected, and the pseudo-labeled data that have the least impact on query sample x t are deleted. The above process is accomplished by online dynamic clustering, which can reduce the update burden of clustering on the one hand and improve the prediction efficiency of soft sensor models on the other hand.

Implementation Procedure of ODCSS
The overall framework of the ODCSS soft sensor method is illustrated in Figure 2.
With the process data arriving in the form of stream, ODCSS is implemented mainly through three steps: online dynamic clustering, adaptive switching prediction, and sample augmentation and maintenance.
where RMSE and RMSE are the root mean square errors of the GPR model on the validation set before and after the pseudo-labeled sample is added to the training set.
Step 4: if the value of the pseudo-label is greater than confidence threshold , { , } is added to the corresponding cluster to update the training set. Otherwise, this sample is removed from the clusters.
Although the latest pseudo-labeled samples added to the clusters can improve the prediction performance for the query sample, accumulating too many pseudo-labeled samples can cause error accumulation, so timely deletion of the outdated historical samples is very essential for reducing prediction deviations. To make full use of the information from the recent unlabeled samples, the sample deletion procedure is started only when the latest true label is detected, and the pseudo-labeled data that have the least impact on query sample are deleted. The above process is accomplished by online dynamic clustering, which can reduce the update burden of clustering on the one hand and improve the prediction efficiency of soft sensor models on the other hand.

Implementation Procedure of ODCSS
The overall framework of the ODCSS soft sensor method is illustrated in Figure 2.
With the process data arriving in the form of stream, ODCSS is implemented mainly through three steps: online dynamic clustering, adaptive switching prediction, and sample augmentation and maintenance.    Step 1: when query sample x t arrives at time t, online dynamic clustering is performed to include x t to clusters or recognize x t as an outlier.
Step 2: if x t belongs to clusters, first select the GPR models corresponding to the nearest m t (m t ≤ M t ) clusters, and then obtain a set of predicted values {ŷ t,1 ,ŷ t,2 , . . . ,ŷ t,m t } based on the selected GPR models; finally, calculate the average value of predicted values to obtain final prediction outputŷ t . If x t belongs to outliers, use a JITGPR model to obtain final prediction outputŷ t .
Step 3: the confidence of {x t ,ŷ u t } in the cluster is evaluated based on the proposed strategy in Section 2.4. If IR exceeds IR th , the obtained pseudo-labeled sample is added to the clusters to update the models. Otherwise, this sample is discarded.
Step 4: when actual label y t of x t is available, the sample {x t ,y t } is used to update the training set and base models, while the corresponding pseudo-labeled sample is removed. Meanwhile, the outdated samples are removed by using the proposed ODC method.

Methods for Comparison
In this section, the proposed ODCSS soft sensor method is evaluated through applications to the Tennessee Eastman (TE) chemical process and an industrial fed-batch chlortetracycline (CTC) fermentation process. The compared methods are as follows: (i) MWGPR: moving window Gaussian process regression model. (ii) JITGPR [49]: just-in-time learning Gaussian process regression. (iii) OSELM [50]: a sequential learning algorithm called online sequential extreme learning machine, which can learn not only the training data one by one but also block by block (with fixed or varying length). (iv) OSVR [51]: online support vector machines for regression, which achieves incremental updating through moving window strategy. (v) PALM [27]: parsimonious learning machine, a data stream regression method, which utilizes new fuzzy rules based on the concept of hyperplane clustering. It can automatically generate, merge, and adjust the fuzzy rules based on the hyperplane. The authors propose two types of PALM models, type-1 and type-2, each of which can be divided into local and global updating strategies. To get closer to the idea of local modeling in this paper, we select type-2 PALM with better performance and local update strategy as the comparison method. (vi) OSEGPR: online selective ensemble Gaussian process regression. The basic idea of this approach is that, assuming m GPR models have been established by time t, when a new query sample comes, a global GPR model is established using all already obtained historical samples, then the prediction performance of all retained models on an online validation set. Next, part models with high performance are selected to provide the ensemble prediction results. The above process is repeated as new query samples arrive. (vii) SS-OSEGPR: semi-supervised online selective ensemble Gaussian process regression, which introduces unlabeled samples to OSEGPR. Using the confidence evaluation strategy in Section 2.4 of this paper, we select pseudo-labels with high confidence to expand the training set and update the model. (viii) ODCSS S : a degenerated version of the proposed online-dynamic-clustering-based soft sensor modeling for industrial supervised data streams. That is, the online soft sensor modeling process is completed using only the labeled data streams. (ix) ODCSS: the proposed online-dynamic-clustering-based soft sensor modeling for industrial semi-supervised data streams.

Experimental Setup and Evaluation Metrics
In order to obtain high-performance prediction results for each soft sensor model, the key model parameters need to be chosen carefully. Especially, the number of modeling samples for the compared methods are set to the same as the number of initial training samples for ODCSS, including the width of moving window for MWGPR, the size of local modeling samples for JITGPR, the size of initial training samples for OSELM, and the size of online validation data for OSEGPR and SS-OSEGPR. In addition, with reference to [50], the prediction block in OSELM is set to 1, and the number of hidden neurons should be smaller than the initial number of training samples. The parameter setting of PALM method refers to [27]. The two parameters of R and M related to clustering in the ODCSS method are determined according to [41], which are adjusted according to different application scenarios. The remaining model parameters are determined within a reasonable range through the trial-and-error method. Moreover, for JITGPR, OSEGPR, and SS-OSEGPR, the best similarity measure is selected from Euclidean distance similarity, cosine similarity, covariance weighted similarity, Manhattan distance similarity, and Pearson correlation coefficient similarity, whose definitions can be found in [49]. Further, the Matern covariance function with noise term is used for all GPR based models.
To evaluate the prediction performance of soft sensor models, the following evaluation metrics are considered, including root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R 2 ) [52]. Among them, RMSE and MAE are used to measure the closeness between the predicted and true outputs. MAPE is a measure of the variation of dependent series from its modelpredicted level. The smaller the MAPE, the better the model performance, and, for perfect fit, the value of MAPE is zero [53]. R 2 is the square of Pearson's correlation coefficient. It represents a squared correlation between the actual output and the predicted output and measures how much of the total variance in the output variable data can be explained by the model. The closer R 2 is to 1, the better the performance of the models. Usually, if the value of R 2 is greater than 0.5, the model predictions can be judged as satisfactory [54]. The above evaluation metrics are defined as follows: where n test represents the number of test samples, y test,i andŷ test,i are the actual and predicted values of the ith test sample, respectively, and y test is the mean value of the test outputs. The computer configurations for experiments are as follows. OS: Windows10 (64 bit); CPU: Inter (R) Core (TM) @ i7-10700 (2.90 GHz × 2); RAM: 16.00 GB; and MATLAB version: 2018b.

Process Description
TE chemical process has been widely used to test control, monitoring, and fault diagnosis models [55]. The TE process flow diagram is shown in Figure 3, which mainly consists of several operating units, such as a continuous stirred reactor, splitting condenser, gas-liquid separation tower, vapor extraction tower, reboiler, and centrifugal compressor. Three gas reactors directly enter the reactor through A, D, E, and C and a certain amount of feed A enters through the condenser. The numbers 1-13 are the stream orders, representing A feed, B feed, C feed, D feed, Stripper, reactor feed, reactor product, circulation, purification, separation liquid, product, condenser water and condenser water, respectively.
consists of several operating units, such as a continuous stirred reactor, splitt condenser, gas-liquid separation tower, vapor extraction tower, reboiler, and centrifu compressor. Three gas reactors directly enter the reactor through A, D, E, and C an certain amount of feed A enters through the condenser. The numbers 1-13 are the stre orders, representing A feed, B feed, C feed, D feed, Stripper, reactor feed, reactor prod circulation, purification, separation liquid, product, condenser water and conden water, respectively.  TE process involves a total of 12 manipulated variables and 41 process variables, and 22 of the process variables are easy to measure and the remaining 19 are difficult to measure. It should be noted that the sampling interval for 22 process variables is 3 min, whereas, for 19 difficult variables, it is 6 min. To validate the performance of the proposed soft sensor model, the input variables selected for this case study are listed in Table 1, which includes 23 easily measured process variables and 9 manipulated variables, taking the E component in stream 11 as the primary variable under the conditions with G/H mass ratio being 40/60 and the production rate being 19.45 m 3 /h. The obtained data are further divided chronologically into two subsets: the initial training set with 50 labeled samples and 1601 samples arriving online to simulate process data streams, including 1200 unlabeled samples and 451 labeled samples. Note that both the labeled and unlabeled samples from the data streams are used for online modeling and prediction, whereas only the labeled samples are used to assess prediction performance of soft sensor models.

Parameter Settings
The optimal parameters for different algorithms are set as follows: (i) MWGPR: the width of the moving window is set to 50. (ii) JITGPR: the number of local modeling samples is set to 50, and the best similarity is covariance weighted similarity. (iii) OSELM: prediction block is set to 1 to provide one prediction value at a time, the number of hidden neurons is set to 45, and the number of initial training samples used in the initial phase is set to 50. (iv) OSVR: penalty parameter C is set to 10, tuning parameter for kernel function g is set to 0.01, and precision threshold p is set to 0.001. (v) PALM: the rule merging mechanism involves parameters b 1 , b 2 , c 1 , and c 2 . b 1 and b 2 are used to calculate the angle and distance between two interval-valued hyperplanes, which are set to 0.02 and 0.01, respectively. c 1 and c 2 are thresholds for the rule merging conditions defined in advance, which are set to 0.01. The remaining parameters are set as in the original paper. (vi) OSEGPR: the number of online validation samples is set to 50, the ensemble size is set to 5, and Manhattan distance similarity is chosen.
(vii) SS-OSEGPR: the number of online validation samples is set to 50, the ensemble size is set to 4, the confidence threshold for selecting pseudo-labels is set to 0.03, and Euclidean distance similarity is chosen. (viii) ODCSS S : clustering radius R is set to 8, minimum density threshold M is set to 10, and maximal ensemble size m is set to 2. (ix) ODCSS: clustering radius R is set to 9, minimum density threshold M is set to 10, controlling parameter α is set to 0.4, proportion of deleted data P is set to 0.5, confidence threshold IR th is set to 0.1, and maximal ensemble size m is set to 2.  Table 2 compares the best prediction performance of different soft sensor methods. OSELM has the highest RMSE, MAE, MAPE, and the lowest R 2 , implying the worst performance. This is mainly because the method has poor local learning ability and cannot effectively characterize the local characteristics of the process. In contrast, JITGPR, MWGPR, and OSVR adopt the idea of local modeling and have stronger capability of handing local process features, so their prediction accuracy is significantly improved. Although PALM also has the ability of online dynamic clustering, the method cannot well handle abrupt-change concept drift and thus provides poor performance. Unlike the single-model methods, OSEGPR predicts by combining various global models with different performance, but its performance is only comparable to that of MWGPR and OSVR, which is mainly due to insufficient local process characterization. Compared with OSEGPR, SS-OSEGPR introduces semi-supervised learning and selects high-confidence pseudo-labels to expand the labeled training set, thus improving model performance to some extent, but its prediction errors are still high. Among all the compared methods, the proposed ODCSS method provides the lowest RMSE, MAE, MAPE, and the highest R 2 . In comparison, ODCSS obtains better results than ODCSS S , mainly due to introduction of semisupervised learning. Overall, when using RMSE as the baseline, the prediction accuracy of the proposed ODCSS method compared to MWGPR, JITGPR, OSELM, OSVR, PALM, OSEGPR, and SS-OSEGPR is enhanced by 20%, 25.7%, 48.7%, 18.7%, 29.8%, 17.8%, and 15.2%, respectively. After adding pseudo-labeled samples, ODCSS shows a performance improvement of 3.6% compared to ODCSS S . The excellent performance of ODCSS is mainly attributed to four aspects. First, online dynamic clustering helps to effectively achieve local representation of complex process features. Second, adaptive switching prediction can effectively deal with gradual-and abrupt-change concept drift and can effectively overcome the problem of model degradation. Third, the adaptive selective ensemble strategy can maximize use of information of historical samples and the latest samples, while JITL is good at addressing predictions corresponding to outliers. Fourth, introduction of semi-supervised learning can make full use of information of unlabeled samples and thus improve the performance of the model.  In the ODCSS method, online dynamic clustering is the key to realizing online local learning, which is crucial to ensure model performance. To graphically illustrate the clustering process, the 32 input variables of the TE data are dimensionally reduced using PCA to represent the dynamic clustering process of TE by three-dimensional variables, as shown in Figure 5. The red triangles in the figure indicate the outliers, and the blue and pink circles represent the two clusters formed. Figure 5a shows the first cluster and 3 outliers formed when the 15th sample arrives. When the 29th sample point arrives, some outliers are accumulated, as shown in the red triangle in Figure 5b. The 30th sample point arrives and the outliers within the radius accumulate to the set threshold, and a new cluster is formed from the outliers, as shown by the pink point in Figure 5c. As time increases, new samples are accumulated in the already built clusters, while some of the less influential historical unlabeled samples are removed, as shown in Figure 5d-f. Among them, Figure 5f shows the final clustering graph formed after the last sample arrives. As can be seen from the figure, the final number of samples in the cluster is not very large because almost all the labeled samples are retained, while several pseudo-labeled samples are gradually eliminated in the clustering process through spatio-temporal weighting. In the ODCSS method, online dynamic clustering is the key to realizing online local learning, which is crucial to ensure model performance. To graphically illustrate the clustering process, the 32 input variables of the TE data are dimensionally reduced using PCA to represent the dynamic clustering process of TE by three-dimensional variables, as shown in Figure 5. The red triangles in the figure indicate the outliers, and the blue and pink circles represent the two clusters formed. Figure 5a shows the first cluster and 3 outliers formed when the 15th sample arrives. When the 29th sample point arrives, some outliers are accumulated, as shown in the red triangle in Figure 5b. The 30th sample point arrives and the outliers within the radius accumulate to the set threshold, and a new cluster is formed from the outliers, as shown by the pink point in Figure 5c. As time increases, new samples are accumulated in the already built clusters, while some of the less influential historical unlabeled samples are removed, as shown in Figure 5d-f. Among them, Figure 5f shows the final clustering graph formed after the last sample arrives. As can be seen from the figure, the final number of samples in the cluster is not very large because almost all the labeled samples are retained, while several pseudo-labeled samples are gradually eliminated in the clustering process through spatio-temporal weighting.  To ensure the efficiency of the online dynamic clustering algorithm, the clus radius and threshold should be determined carefully. For this purpose, as shown in 3, we evaluated the prediction performance of ODCSS with the combination of clus radius ∈ [8,9,10,11,12,13] and minimum density threshold ∈ [10,12,14] after the other four parameters, i.e., controlling parameter to 0.05, the ratio of the amo deleted data to 0.4., the confidence threshold to 0.1, and the maximum ense size to 2. The overall performance of the proposed method is better than the com son methods; that is, within a reasonable range of parameters, this method can over the influence of parameter changes on the prediction performance and has better sta Table 3. Performance comparison of the proposed method under different parameters for p tion of E composition in stream 11. To ensure the efficiency of the online dynamic clustering algorithm, the clustering radius and threshold should be determined carefully. For this purpose, as shown in Table 3, we evaluated the prediction performance of ODCSS with the combination of clustering radius R ∈ [8,9,10,11,12,13] and minimum density threshold M ∈ [10, 12,14] after fixing the other four parameters, i.e., controlling parameter α to 0.05, the ratio of the amount of deleted data P to 0.4., the confidence threshold IR th to 0.1, and the maximum ensemble size m to 2. The overall performance of the proposed method is better than the comparison methods; that is, within a reasonable range of parameters, this method can overcome the influence of parameter changes on the prediction performance and has better stability. As a data-streams-oriented soft sensor, ODCSS performs model building and maintenance and target variable prediction in an online manner. As new samples are accumulated, its prediction performance gradually changes. Figure 6 presents the evolving trends of the cumulative predicted RMSE using different soft sensor methods, that is, the predicted RMSE of the test samples from the first to the current prediction. Not surprisingly, in the early stage of prediction, large prediction errors are observed for all methods due to insufficient labeled samples. In particular, OSELM has poor performance throughout the prediction process. With accumulation of labeled samples, the prediction performance of the remaining methods is improved. It is worth noting that JITGPR and MWGPR obtain good prediction performance in the range of about 50-260 of the test samples but show large error growth of 260-400. The prediction performance of PALM starts degenerating at about the 50th test sample. In contrast, both ODCSS S and ODCSS have the smallest prediction errors throughout the prediction process. Comparing ODCSS S with ODCSS, we can observe that ODCSS S shows an increase in prediction errors during the stage of 370-400 samples, while ODCSS maintains a low prediction error all the time. These results fully illustrate the prediction accuracy and reliability of ODCSS. accumulated, its prediction performance gradually changes. Figure 6 presents the evolving trends of the cumulative predicted RMSE using different soft sensor methods, that is, the predicted RMSE of the test samples from the first to the current prediction. Not surprisingly, in the early stage of prediction, large prediction errors are observed for all methods due to insufficient labeled samples. In particular, OSELM has poor performance throughout the prediction process. With accumulation of labeled samples, the prediction performance of the remaining methods is improved. It is worth noting that JITGPR and MWGPR obtain good prediction performance in the range of about 50-260 of the test samples but show large error growth of 260-400. The prediction performance of PALM starts degenerating at about the 50th test sample. In contrast, both ODCSSS and ODCSS have the smallest prediction errors throughout the prediction process. Comparing ODCSSS with ODCSS, we can observe that ODCSSS shows an increase in prediction errors during the stage of 370-400 samples, while ODCSS maintains a low prediction error all the time. These results fully illustrate the prediction accuracy and reliability of ODCSS. To further assess whether there are significant differences between ODCSS and other methods, the Statistical Tests for Algorithms Comparison (STAC) platform [56] is applied. For this purpose, the test set is equally divided into 15 groups in chronological order to obtain the prediction RMSE values on each group. Based on these RMSE values, a nonparametric Freidman test with a Finner post hoc method is performed. Especially, with To further assess whether there are significant differences between ODCSS and other methods, the Statistical Tests for Algorithms Comparison (STAC) platform [56] is applied. For this purpose, the test set is equally divided into 15 groups in chronological order to obtain the prediction RMSE values on each group. Based on these RMSE values, a non-parametric Freidman test with a Finner post hoc method is performed. Especially, with ODCSS as the control method and the significance level as 0.05, Friedman test is conducted on the group RMSE values of different methods and the statistical test results are provided in Table 4. According to the principle of Freidman test, null hypothesis H 0 in this experiment is that there is no difference between the compared methods. The p-value indicates the probability of supporting hypothesis H 0 . When the p-value is less than 0.05, the statistical results show that null hypothesis H 0 is rejected, which means that there is a difference between the compared methods. As shown in Table 4, MWGPR, JITGPR, OSELM, OSVR, PALM, OSEGPR, and SS-OSEGPR reject H 0 hypothesis, which indicates that the prediction performance of ODCSS is remarkably different from these methods in TE process. In addition, we can also observe that ODCSS S accepts H 0 hypothesis. This is mainly because it is a degraded version of the proposed ODCSS method without adding pseudo-labeled data, and introduction of semi-supervised learning has not gained significant performance improvement. Additionally, to further explore the performance of the compared methods at different stages of testing, Table 5 lists the RMSE performance of the compared methods on a subset of testing set and ranks the performance of the proposed ODCSS method. It can be seen that ODCSS has poor prediction performance in the early stage (testing subsets 1-6), and gradually (testing subsets 11-15) achieves the best prediction results. The main reason is that, in the early stage of prediction, the proposed method contains few data in the clusters formed by local process state identification, and some samples are independent in space in the form of outliers, so the local model established does not have enough learning ability, thus resulting in poor prediction performance. With accumulation of data volume, the proposed ODCSS method gradually learns different local process states, so the best prediction performance is achieved in the late stage of prediction. These results imply that the proposed method has strong learning capability in data stream environments.

Process Description
With development of science and technology in pharmaceutical, food, biological, and chemical industries, as well as agriculture, microbial fermentation has made a great impact on human daily life. As one feed antibiotic additive, chlortetracycline has become the most used bacterial growth promoter in the farming industry due to its advantages of bacterial inhibition, growth promotion, high feed utilization, and low drug residues. At the Sensors 2023, 23, 1520 20 of 29 same time, with expansion of production demand and scale, enterprises have implemented higher requirements for automation and the intelligence level of the fermentation process.  Figure 7 shows the flow chart of the CTC production process. CTC fermentation is an intermittent production process, and each batch of fermentation takes 80-120 h, which mainly occurs through batch and fed-batch operation stages. Many parameters for this process have been measured online using hardware sensors; however, the biomass concentration, substrate concentration, amino nitrogen concentration, and viscosity are usually not available online and can only be analyzed through offline sampling.
In this paper, the CTC fermentation process data from Charoen Pokphand Group are used for model evaluation, where substrate concentration is used as the difficult-to-measure variable and variables listed in Table 6 are used for auxiliary variables. With an online sampling interval of 5 min and offline analysis interval of 4 h, a total of 15 batches of data from the same fermenter were collected for experiments. The first two batches of labeled samples, including a total of 43 samples, are taken as the initial training set, while the remaining 13 batches of data are used for online prediction, including 1015 unlabeled samples and 324 labeled samples. The labeled and unlabeled samples are used for online modeling and prediction, and the labeled samples are used for model performance evaluation. Table 6. Input variables of soft sensor models for industrial fed-batch CTC fermentation process.

No.
Variable Description 1 Cultivation time (min) 2 Temperature Dissolved oxygen concentration (%) 5 Air stream rate (m 3 /h) 6 Volume of air consumption (m 3 ) 7 Substrate feed rate (L/h) 8 Volume of substrate consumption (L) 9 Volume of ammonia consumption (L) fermentation process. Figure 7 shows the flow chart of the CTC production process. CTC fermentation an intermittent production process, and each batch of fermentation takes 80-120 h, whic mainly occurs through batch and fed-batch operation stages. Many parameters for th process have been measured online using hardware sensors; however, the bioma concentration, substrate concentration, amino nitrogen concentration, and viscosity a usually not available online and can only be analyzed through offline sampling. In this paper, the CTC fermentation process data from Charoen Pokphand Group a used for model evaluation, where substrate concentration is used as the difficult-t measure variable and variables listed in Table 6 are used for auxiliary variables. With a online sampling interval of 5 min and offline analysis interval of 4 h, a total of 15 batch of data from the same fermenter were collected for experiments. The first two batches labeled samples, including a total of 43 samples, are taken as the initial training set, whi the remaining 13 batches of data are used for online prediction, including 1015 unlabele samples and 324 labeled samples. The labeled and unlabeled samples are used for onlin

Parameter Settings
Similar to the case study of the TE chemical process, the optimal parameters for CTC process are determined as follows: (i) MWGPR: the width of the moving window is set to 43. (ii) JITGPR: the number of local modeling samples is set to 43, and the best similarity is cosine similarity; (iii) OSELM: prediction block is set to 1 to predict one value at a time, the number of hidden neurons is set to 38, and the number of initial training samples used in the initial phase is set to 43. (iv) OSVR: penalty parameter C is set to 24, tuning parameter for kernel function g is set to 0.02, and precision threshold p is set to 0.007. (v) PALM: the optimal parameters are the same as TE industrial process. b 1 , b 2 , c 1 , and c 2 are set to 0.02, 0.01, 0.01, and 0.01, respectively; (vi) OSEGPR: the number of online validation sample sets is set to 43, the ensemble size is set to 5, and Manhattan distance similarity is chosen. (vii) SS-OSEGPR: the number of online validation sample sets is set to 43, the ensemble size is set to 5, the confidence threshold for selecting pseudo-labels is set to 0.1, and Manhattan distance similarity is chosen. (viii) ODCSS S : clustering radius R is set to 4.9, minimum density threshold M is set to 12, and maximal ensemble size m is set to 2; (ix) ODCSS: clustering radius R is set to 4.8, minimum density threshold M is set to 14, controlling parameter α is set to 0.6, proportion of deleted data P is set to 0.9, confidence threshold IR th is set to 0.1, and maximal ensemble size m is set to 3. Table 7 compares the best prediction performance of different soft sensor methods on the CTC process. As can be seen, OSVR and OSELM achieve poor performance on substrate concentration prediction. JITGPR provides better performance than MWGPR and PALM performs better than other single-model methods. In comparison, the proposed ODCSS method still shows the lowest RMSE, MAE, MAPE, and the highest R 2 , implying the best prediction performance. Overall, the performance improvement of the proposed ODCSS approach compared to MWGPR, JITGPR, OSELM, OSVR, PALM, OSEGPR, and SS-OSEGPR methods is 21.4%, 16.2%, 25.8%, 26.3%, 11.7%, 10.6%, and 9%, respectively, when using RMSE as the baseline. After adding the pseudo-labeled data, the proposed ODCSS shows a 7% performance improvement compared with ODCSS S. These results further confirm that ODCSS has significantly better prediction accuracy than traditional soft sensors for semi-supervised data streams.     In addition, the dynamic clustering process is also illustrated through three-dimensional variables obtained by PCA. As shown in Figure 9, the red triangular points are outliers and the remaining colors correspond to clusters. Although some of the data are mixed after visualization using PCA, it does not affect the understanding of the clustering Observed values (d) OSVR Observed values (f) OSEGPR In addition, the dynamic clustering process is also illustrated through three-dimensional variables obtained by PCA. As shown in Figure 9, the red triangular points are outliers and the remaining colors correspond to clusters. Although some of the data are mixed after visualization using PCA, it does not affect the understanding of the clustering process in this paper. As Figure 9b shows, when the 48th sample arrives, the size of data in the cluster and outliers increases over time. The 49th sample arrives and the outliers within the radius accumulate to the set threshold and a new cluster is formed, as shown by the pink points in Figure 9c. When the data in a cluster are accumulated to a certain extent, the obsolete samples in the cluster are deleted in order to improve operation efficiency and assure prediction accuracy, as shown in Figure 9d,e. Figure 9f-i shows the repetition of the above process.

Prediction Results and Discussion
Sensors 2023, 23, x FOR PEER REVIEW 24 of 30 process in this paper. As Figure 9b shows, when the 48th sample arrives, the size of data in the cluster and outliers increases over time. The 49th sample arrives and the outliers within the radius accumulate to the set threshold and a new cluster is formed, as shown by the pink points in Figure 9c. When the data in a cluster are accumulated to a certain extent, the obsolete samples in the cluster are deleted in order to improve operation efficiency and assure prediction accuracy, as shown in Figure 9d,e. Figure 9f-i shows the repetition of the above process.
(a) The first cluster and some outliers, at 15 x (b) Accumulate more outliers, at 48 x (c) Generate new cluster from outliers, at 49 x (d) Accumulate more data , at 67 x (e) Delete outdated data in the cluster, at 68 x (f) Accumulate more outliers, at 359 x (g) Generate new cluster from outliers, at 360 x (i) Generate new cluster from outliers, at 581 x (h) Accumulate more outliers, at 580 x Figure 9. Illustrations of online dynamic clustering for CTC process.
Moreover, we further explore the influences of the parameters on the proposed OD-CSS algorithm, as shown in Table 8. After fixing clustering radius as 4.8, minimum density threshold as 14, and maximum ensemble size as 3, prediction performance of ODCSS under different combinations of the ratio of the deleted data ∈ [0.4,0.6,0.8], controlling parameter ∈ [0.7,0.9], and confidence threshold ∈ [0.05, 0.2,0.15] are compared. These results show the strong stability and excellent accuracy of the proposed OD-CSS method when using varying model parameters. Moreover, we further explore the influences of the parameters on the proposed ODCSS algorithm, as shown in Table 8. After fixing clustering radius R as 4.8, minimum density threshold M as 14, and maximum ensemble size m as 3, prediction performance of ODCSS under different combinations of the ratio of the deleted data P ∈ [0.4, 0.6, 0.8], controlling parameter α ∈ [0.7, 0.9], and confidence threshold IR th ∈ [0.05, 0.2, 0.15] are compared. These results show the strong stability and excellent accuracy of the proposed ODCSS method when using varying model parameters. Similar to TE process, a Freidman test with a Finner post hoc method is conducted in order to further assess different soft sensor methods in the CTC fermentation process. For this purpose, the test set was divided in batch order and a total of 13 batches of RMSE values are obtained, and then a non-parametric Friedman test is performed. The statistical test results are also tabulated in Table 9. It can be readily observed that, similar to TE process, MWGPR, JITGPR, OSELM, and OSVR reject the H 0 hypothesis, which indicates that the prediction performance of ODCSS is remarkably different from these methods in CTC process. In comparison, PALM, OSEGPR, SS-OSEGPR, and ODCSS S accept the H 0 hypothesis, which reveals that there is no significant difference between ODCSS and the other compared methods in terms of the overall prediction performance. To further explore the prediction capability of the different methods on local prediction stages, the RMSE values from different soft sensor methods on the 13 test batches are presented in Table 10. As can be seen from the table, the proposed ODCSS method shows poor prediction performance in the first two batches of prediction and then provides the best prediction accuracy in most later batches. Similar to TE process, these results once again confirm the strong online learning capability of ODCSS. However, we can also notice that ODCSS S provides poor prediction RMSE for batch 11. Such a problem may be addressed by introducing other adaptation mechanisms to further enhance the online learning capability of ODCSS in accommodating complex data stream environments.

Conclusions
This paper presents an online-dynamic-clustering-based soft sensor (ODCSS) for industrial semi-supervised data streams. By applying online dynamic clustering to process data streams, ODCSS enables automatic generation and update and deletion of obsolete samples, thus realizing dynamic identification of process state. In addition, an adaptive switching prediction method combining online selective ensemble with JITL is used to effectively handle gradual and abrupt time-varying features, thus preventing model degradation. Moreover, to tackle the label scarcity issue, semi-supervised learning is introduced to obtain high-confidence pseudo-labeled samples online. The proposed ODCSS is a fully online soft sensor method that can effectively deal with nonlinearity, time variability, and shortage of labeled samples in industrial data streaming environments.
To verify the effectiveness and superiority of the proposed ODCSS method, two application cases are considered. Meanwhile, seven representative soft sensor methods and ODCSS S (without pseudo-labeled samples) are compared with the proposed ODCSS. From the RMSE, MAE, MAPE, and R 2 , it is evident that the proposed method outperforms the other compared methods in terms of all evaluation metrics. Especially, in TE process, with RMSE as a baseline, ODCSS improves prediction accuracy by 48.7% compared to OSELM, and introduction of semi-supervised learning improves prediction performance by 3.6% compared to ODCSS S . For the CTC fermentation process, although ODCSS does not show significant differences from some methods in terms of overall testing performance, the superiority of the proposed method becomes more and more obvious with accumulation of streaming data and advancement in online learning. Both the TE and CTC application results confirm that the proposed ODCSS method can well address time-variability, nonlinearity, and label scarcity problems and thus achieve high-precision real-time prediction of subsequent online arrived samples by using only very few labeled samples and adding high-confidence pseudo-labeled data.
Currently, there is still a lack of research on soft sensor modeling for data streams, and this study is only a preliminary attempt. There are still several issues requiring further attention. First, although the proposed algorithm has good prediction accuracy, computational burden of online modeling will inevitably increase with accumulation of process data streams. Thus, how to improve the efficiency of online modeling is also a major concern. Second, with evolution of process data streams, optimal model parameters also change, so it is appealing to adjust the model parameters adaptively. Third, the proposed method only considers identifying local features based on the spatial relationship between samples. For data streams, temporal relationships between samples are also worth noting. Fourth, as streaming data accumulate, mining the hidden features of streaming data using incremental deep learning is also an interesting research direction. These issues remain to be studied in our later work.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data cannot be made public because of privacy restrictions.

Conflicts of Interest:
The authors declare no competing financial interest.

Appendix A. Gaussian Process Regression
The Gaussian process (GP) is a set of any finite random variables with joint Gaussian distribution [57]. For dataset D = {X, y} = {x i , y i } n i=1 , x and y represent the input and output matrices. The regression model can be described as follows: where n represents the number of samples in the dataset, x and y represent the input and output vector, respectively. f (·) denotes the unknown function. ε is the Gaussian noise with zero mean and variance σ 2 n . From the perspective of function space, a Gaussian process can be specified by covariance function C x, x ' and mean function m(x) as follows: Therefore, the Gaussian process can be described as follows: After preprocessing the data, it is assumed that the training sample set is a zero mean Gaussian process: y ∼ GP(0, C) where C is an n × n covariance matrix with C ij = C x i , x j , which represents the ijth element in C, and 0 represents a zero matrix. When new test sample x * arrives, the prediction meanŷ * and σ 2 * variance are given as where k * = [C(x * , x 1 ), · · · , C(x * , x n ))] T represents the covariance of x * and training inputs.