1. Introduction
There are many complex problems in the soft sensor modeling of industrial processes, such as process nonlinearity, process dynamics, quality variables lagging behind process variables, and quality variables not matching process variables [
1,
2,
3]. When dealing with complex industrial processes and massive industrial data, shallow neural networks are not enough to deal with these high-dimensional data and dynamic process scenarios, which leads to problems such as insufficient prediction accuracy and poor generalization of soft sensor models [
4,
5].
With the development of artificial intelligence, the feature representation ability of deep neural networks has become more and more prominent [
6]. The deep neural network transforms the original input into higher dimensional and more complex features, which can better simulate the nonlinearity and dynamics of the industrial process so as to obtain a more accurate soft sensor model [
7]. Deep neural networks can be divided into two categories: static networks represented by deep belief network (DBN) [
8] and stacked auto-encoder (SAE) [
9] and dynamic networks represented by recurrent neural network (RNN) [
10]. In a variety of static networks, it is generally assumed that there is no time-dynamic dependence between samples in the industrial process. In order to establish a soft sensor model, the static model constructs an augmented matrix according to the input dimension of the data to overcome the process dynamics [
11], but determining how to reasonably construct the augmented matrix can only rely on expert experience.
The correlation of each sample in the process data hides the dynamic information of the industrial process, and capturing this information is the key to improving the accuracy of the soft sensor model. Sequential model RNN has been applied to industrial processes [
12]. To solve the long-term dependence of RNN, soft sensor models using long short-term memory (LSTM) [
13] or gated recurrent unit (GRU) [
14,
15] have achieved good prediction results in some industrial scenarios. Based on LSTM, Yuan et al. constructed a supervised LSTM unit for learning the hidden dynamic information related to quality in the industrial process and verified the effectiveness of the proposed supervised LSTM soft sensor model on multiple datasets [
16].
In the research of the sequential model, Liu et al. proposed sample convolution and interaction networks (SCINet) [
17] for time series forecasting. SCINet enhances the original sequence’s predictability by capturing features’ temporal dependence at different temporal resolutions [
18]. It has a stronger feature extraction ability and superior long-term and short-term prediction ability in wind power prediction and stock trading [
19].
All the above sequential models are designed as end-to-end models. Although these models have powerful dynamic feature extraction capabilities, their regressors are generally single- or multilayer, fully connected networks. When performing regression fitting on the extracted features, the performance of the models is limited. Ensemble models in machine learning have strong generalization ability by constructing multiple sub-regressors, but such ensemble models cannot be trained directly with deep networks.
Therefore, to combine the advantages of both, some researchers have used deep neural networks for extracting complex dynamic features and used ensemble models as regressors to transfer dynamic features [
20,
21]. Lian et al. used DBN combined with a particle swarm optimization algorithm to extract features as the support vector regression (SVR) input to establish a soft sensor model, achieving good results [
22]. Wang et al. used DBN to extract the features of auxiliary variables and input them into the extreme learning machine for training to obtain the soft sensor model [
23]. Fan et al. proposed a hidden layer feature extractor based on continuous restricted Boltzmann machine (CRBM) and established the CRBM-SVR soft sensor model [
24].
In the historical data of many industrial processes, due to the inconsistency of sampling frequency, the high cost of quality variable analysis, and the restriction of field environment, the number of process variables that are easy to collect is generally more than that of quality variables that are difficult to collect and analyze. If only the labeled samples are used, and the valuable information contained in the unlabeled samples is ignored, the performance of the soft sensor model will also be restricted due to the limited prior information.
Semi-supervised soft sensor modeling evolves from generating pseudo-labels to using unsupervised pre-training and supervised fine-tuning. To use a large number of unlabeled samples to improve the performance of the model, Li et al. proposed a semi-supervised ensemble SVR soft sensor model by using the idea of generating pseudo-labels in ensemble learning and used an extended pseudo-labeled dataset to improve the performance of the soft sensor model [
25]. Considering the dynamic characteristics of industrial processes and the mismatch between quality samples and process samples caused by irregular sampling, Tang et al. proposed a historical feature fusion attention semi-supervised LSTM (HFFA-SSLSTM) soft sensor model to extract the historical dynamic information of samples, which significantly improved the prediction accuracy [
26].
To obtain a better soft sensor model, it is important to consider the process dynamic characteristics of the samples and the generalization ability of the model. Combined with the mismatch problem between the process samples and the quality samples, these existing problems and the shortcomings of the existing methods motivate this research.
Considering the existing problems of industrial processes, this paper proposes a semi-supervised soft sensor modeling method based on SCINet dynamic feature extraction. First, the sample convolution and interaction network with an encoder–decoder structure extracted effective dynamic features unsupervised. Then, the dynamic features were transferred to the XGBoost model [
27], which had a strong generalization ability to establish a soft sensor model. The feature extraction ability of the deep network and the strong model’s generalization ability were considered. Finally, the effectiveness of the proposed method was verified by the example analysis of the debutane column dataset [
28] and the sulfur recovery dataset [
29].
The main contributions of this paper are as follows:
(1) A dynamic feature extractor based on SCINet and autoencoder was designed to extract the dynamic features of all samples in an unsupervised manner, which makes full use of the process information contained in unlabeled samples. (2) The dynamic features of labeled samples were transferred to the XGBoost model to train a regressor, which fully combines the feature extraction ability of deep neural network and the generalization ability of ensemble model.
The rest of the paper is organized as follows: The network structure of SCINet is introduced in
Section 2.
Section 3 describes the XGBoost model.
Section 4 presents a dynamic feature based XGBoost model and explains how dynamic features are transferred to XGBoost and the modeling process. In
Section 5, full experimental validation is carried out on industrial datasets for performance evaluation. Finally, the conclusion is given in
Section 6.
2. SCINet
SCINet [
17] is a hierarchical network that enhances the predictability of an original time series by capturing the time dependence of features at multiple time resolutions. SCINet has a binary tree structure, as shown in
Figure 1a. The basic component of SCINet is the SCI-Block, as shown in
Figure 1b. In each SCI-Block, the original sequence is decomposed into two sub-sequences. Different convolution modules extract homogeneous and heterogeneous information from the decomposed sub-sequences. Through interactive learning and information complementarity, new sequence representations are formed. To capture the dynamic features at different time granularities, SCINet downsamples the input sequence
F in the time dimension to obtain
Fodd and
Feven. These two sub-sequences have a relatively coarse time resolution and retain most of the information of the original sequence. In each SCI-Block, four different convolution modules, namely
,
ϕ,
η, and
ρ, are employed to extract the features of
Fodd and
Feven. To deal with the potential information loss problem caused by downsampling, an interactive learning strategy is introduced in SCI-Block to facilitate the information exchange between subsequences. Firstly,
Fodd and
Feven are mapped to the hidden layer states through two convolution modules,
and
ϕ. This stage can be regarded as performing scaling transformations on
Fodd a
Feven, as in Formula (1). Then, using the other two convolution modules
η and
ρ, the scaled features after the first stage are further mapped and interacted to obtain two updated features,
and
, as shown in Formula (2).
where
,
ϕ,
η, and
ρ are convolutional filters, and
exp is the exponential transformation.
The input data can be divided into
k (
k = 1, 2, …,
K) time sequence;
. For each time sequence,
, where
T is the window length, is decomposed layer by layer, and is processed by different levels of SCI-Blocks, which can effectively learn the features at different temporal resolutions. Feature information from previous layers is accumulated, meaning deeper layers contain information from time-scale features at shallower layers. The concatenated features are accumulated with the original sequence
Xseq to obtain the hidden layer dynamic features,
hv. The
hv is fed to the fully connected layer for decoding to obtain the predicted sequence
, where
is the step length of the refined sequence. Then, the absolute error loss function between the predicted value and the true value is given as follows:
3. XGBoost Model Analysis
The eXtreme Gradient Boosting ensemble tree model (XGBoost) is a widely used method in ensemble learning and has achieved excellent results in many regression and classification problems [
27]. A common method to train a good model is to minimize the loss function on the training data, that is, the empirical risk minimization function. The model trained with this objective function has a relatively high complexity. To avoid this problem, model complexity, namely the structural risk minimization function, is usually introduced into the objective function of the integrated model, as shown in Equation (4).
where
n represents the number of samples, and
xn and
yn are the inputs and outputs of the samples, respectively. In Equation (4), the objective function consists of two parts: the training error of the model and the regularization term that controls the complexity of the model. The regularization term is obtained by summing the regularization terms of all subtrees, and
is the regularization term of the
mth tree.
The XGBoost model sets different weights for the CART tree using the scoring function. These CART trees are combined in a weighted form to form a strong learner, effectively reducing the model error and variance. The regularization term is introduced to control the complexity of the model, which is obtained by adding the regularization terms of all subtrees. The complexity of each tree is shown as Equation (5):
where
p represents the number of leaf nodes,
γ is the regularization coefficient controlling the number of leaf nodes,
λ is the regularization coefficient controlling the weight of leaf nodes, and
is the weight of the
jth leaf node.
5. Experiment
5.1. Evaluation Index
To verify the effectiveness of the proposed model, this paper selects mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (
R2), three metrics that are used to evaluate the performance of soft sensor models quantitatively. The formulas of these metrics are as follows:
To eliminate the influence of different dimensions of the data, in the data preprocessing stage, all datasets used for experimental validation are preprocessed using standardization for each process variable and quality variable. The hardware platforms and software versions used in this experiment are as follows: CPU: Intel(R) Core(TM) i7-9700 (3.00 GHz); memory: 16 GB; operating system: Windows 11 (64-bit). All code is implemented in Python (3.7.13), and the main Python libraries used are PyTorch (1.11.0) and XGBoost (1.5.0).
5.2. Case 1. Debutane Column Process
In this section, experiments are carried out on the debutane column’s industrial process to prove the effectiveness of the feature transfer. In this process, there is a complex coupling relationship between samples, which has strong process dynamics.
Figure 5 shows the schematic diagram of the debutanizer. The description and units of primary process and quality variables are listed in
Table 1. Process variables (
u1,
u2, …,
u7) and quality variable (butane concentration,
y) are collected every 15 min. Butane concentration must be analyzed using a gas chromatograph, which takes 30 min. Therefore, the acquisition time of the quality variable lags the process variable by 45 min. This lag prevents the production process from obtaining real-time feedback information in a timely manner, affecting the subsequent process’s control.
The acquisition of the quality variable time lags behind the process variables by four time steps, and a soft sensor model is needed to measure butane concentration in real time. Fortuna developed a nonlinear autoregressive moving average (NARMA) model as shown in (11) [
28]. At the
i moment, only the butane concentrations at the
i-4 moment and the moment before are available from the database. Butane concentrations at
i-3 to
i are obtained by iteration of the model prediction, which leads to larger errors.
where
represents the predicted value of the soft sensor model. To avoid this accumulation of iterative errors, SCI-XGBoost predicts butane concentration at time
i using only historical data, including time
i and its predecessor, as follows.
The debutane column dataset has a total of L = 2394 samples. A sliding window is used to construct a sequence dataset on the debutane column dataset, and the data are sliced by sliding from front to back in the time dimension. It is known that the step length of quality variable lag P = 4, and sliding window size T = 16; then, the number of samples L and the number of the sequence dataset N satisfy the relationship N = L − T – P + 1, so there is a total of N = 2375 sequence samples. The sequence dataset is divided according to the rule of training set; validation set/test set = 7:1:2, and the number of samples in each part is 1662:238:475.
Regarding the hyperparameters of the SCI-XGBoost model, the optimal values of some parameters are determined by a grid search algorithm and multiple attempts. Parameters such as the size of the convolution kernel in the convolutional layer of the feature extractor, the multiple of the extended dimension of the hidden layer, and the maximum depth of downsampling are used to determine the optimal parameters through several trials. An adaptive learning rate adjustment strategy improves the neuron weights during training. The learning rate will be reduced slightly if the training loss does not decrease after five consecutive iterations. Additionally, training can be halted to prevent overfitting when the model’s loss stabilizes and shows little to no decrease, even if the maximum number of iterations has not yet been reached. The grid search method is also used to determine the critical parameters of the parameter setting of the XGBoost regressor. After many trial experiments, the model trained in
Table 2 performs best on the validation set.
In order to verify the effectiveness of the SCI-XGBoost model, ANN, XGBoost, LSTM, SLSTM, and SCINet were selected as comparison models. For the non-sequential models, ANN and XGBoost, the model’s input is the data after one-dimensional expansion of the sequence data to increase their dynamic learning ability. The network structure of ANN is [128-64-16-1], and the activation function is Relu. For both LSTM and SLSTM, the number of hidden layer neurons is set as 100, the input sequence length is the same as the input window length T of SCI-XGBoost, the learning rate is 0.01, and the number of iterations is 100.
Under the above hyperparameter settings, the prediction curves of various models on the test set of the debutane column are shown in
Figure 6, and the evaluation metrics of the models are shown in
Table 3. When only the sequence data of
k-4 and its previous time are used to predict the quality variable at time
k, the predicted value of SLSTM has a large deviation from the true value at some peak moments, and the prediction performance is not as good as LSTM and SCINet. LSTM has worse prediction performance than SCINet at some peak moments. SCI-XGBoost has better performance than SCINet, which indicates that even if the dynamic features are extracted through SCINet, it is difficult to achieve the regression accuracy of the ensemble model only through the fully connected layer, which proves the strong generalization ability of XGBoost. Compared with XGBoost, the predicted value of XGBoost is significantly deviated from the true value at the peak time, indicating that XGBoost has insufficient ability to capture the dynamic characteristics of the process. Therefore, SCI-XGBoost has strong dynamic feature capture ability and better generalization performance, so the prediction effect is better than that of other models.
For further comparison, the boxplot comparisons of absolute prediction errors on the testing dataset are shown in
Figure 7 for these six models. Compared with other models, SCI-XGBoost has the narrowest box, and the bottom is also the closest to zero.
For the XGBoost model, when an input feature has a higher proportion of the number of splits in the CART tree-splitting process, the feature is more important to the model.
Figure 8 shows the bar chart of the proportion of splits in the XGBoost model using dynamic features. Compared with the XGBoost model, which does not use dynamic features, some dynamic features are important, similar to the original features. The importance of other dynamic features is much higher than that of the original features, indicating that the added dynamic features are effective in improving the performance of XGBoost.
5.3. Case 2. Sulfur Recovery Unit
To verify the effectiveness of the semi-supervised model SSCI-XGBoost and prove the ability of the unsupervised dynamic feature extractor to extract the dynamic information of unlabeled samples, the sulfur recovery industrial process dataset was selected for experimental verification.
The sulfur recovery unit is an important device for dealing with industrial waste gas, which removes gases such as SO
2 and H
2S in acidic airflow. Its simplified industrial process flow is shown in
Figure 9. The exhaust gas treated by the sulfur recovery unit still has residual SO
2 and H
2S. To avoid pollution, the concentrations of SO
2 and H
2S must be monitored before the exhaust gas is discharged into the atmosphere. The strong corrosion of acid gas requires the sensor to be removed and replaced frequently, so using a soft sensor to predict the gas concentration can reduce the monitoring cost. The sulfur recovery dataset [
29] is described in
Table 4. The dataset consists of five process variables (
u1,
u2, …,
u5) and two mass variables SO
2 (
y1) and H
2S (
y2).
The sulfur recovery dataset has a total of L = 10,081 samples. A sliding window is used to construct a sequence dataset on the dataset, and the sliding window size is T = 16. The number of samples L and the number of the sequence dataset N satisfy the relationship N = L − T + 1, so there is a total of N = 10,066 sequence samples. The sequence dataset is divided according to the rule of training set; validation set/test set = 7:1:2, and the number of samples in each part is 7046:1026:1994.
In order to verify the effectiveness of the semi-supervised model, it is assumed that some quality variables are missing, and the quality variables of some samples in the training set are randomly erased. With some quality variables missing, the training samples of the sulfur recovery dataset are organized as shown in
Figure 10 (
T = 3 is assumed to simplify the representation, whereas
T = 16 during the actual training process). The training set is divided into a labeled sample training set and an unlabeled sample training set, both of which can be used for training the unsupervised feature extractor, and the labeled sample training set is used for training the XGBoost regressor.
Regarding the parameter setting of SSCI-XGBoost, referring to the parameter setting of SCI-XGBoost on the debutane column dataset and combined with the grid search algorithm and repeated experiments on this dataset, the final optimal hyperparameter table of SSCI-XGBoost on the sulfur recovery dataset is shown in
Table 5.
SS-SAE, XGBoost, and SCI-XGBoost were selected as comparison models. SS-SAE is a semi-supervised model, while XGBoost and SCI-XGBoost only use labeled samples for model training. On the sulfur recovery dataset, the input of the SS-SAE method refers to the paper [
29], which provides the sulfur recovery dataset; the hyperparameter settings refer to the paper [
20], which proposed the SS-SAE method and carried out the hyperparameter settings. To overcome the process dynamics, the network structure of SS- SAE is set as [20-16-12-6-1], the pre-training learning rate is 0.1, the number of pre-training iterations is 10, the fine-tuning learning rate is 0.01, and the number of fine-tuning iterations is 60.
For the training set, it is assumed that there are three cases for the proportion of labeled samples: 33%, 50%, and 67%. Under the above hyperparameter settings and different proportions of labeled samples, SSCI-XGBoost and other models are trained, respectively, and the experimental results of each model on the test set are shown in
Table 6.
From
Table 6, it can be seen that the performance of SCI-XGBoost is better than XGBoost in the case of three different proportions of labeled samples, indicating that the proposed model can fully extract process dynamic information. This result is consistent with the experimental results in the previous section to verify the effectiveness of feature extraction using the debutane tower dataset, which once again verifies the effectiveness of feature transfer. SSCI-XGBoost performs better than SCI-XGBoost trained only with labeled samples under different proportions of labeled samples, which indicates that the unsupervised dynamic feature extractor can capture the hidden dynamic information in unlabeled samples as a supplement. In the case of 33% and 50% labeled samples, the performance of XGBoost is better than SS-SAE, indicating that the ensemble model has stronger generalization ability when the number of labeled samples is small. The performance of the semi-supervised model SS-SAE is not as good as SSCI-XGBoost, which indicates that the dynamic feature extractor is more capable of capturing the process dynamic information.
In order to more intuitively represent the performance improvement brought by SSCI-XGBoost using unlabeled samples, the RMSE broken lines of the three models XGBoost, SCI-XGBoost, and SSCI-XGBoost are shown in
Figure 11. In the case of three different proportions of labeled samples, the RMSE values of SCI-XGBoost are smaller than those of XGBoost, which indicates that the feature transfer is effective. The performance of SSCI-XGBoost is better than SCI-XGBoost, which proves that the feature extractor captures the hidden information of the unlabeled samples. Combining the dynamic feature extraction ability of a sequential model and the generalization performance of an ensemble model, the semi-supervised soft sensor model SSCI-XGBoost, based on unsupervised dynamic feature extraction, can fully capture the hidden dynamic information in unlabeled samples to maximize the model performance.
6. Conclusions
In this paper, an unsupervised dynamic feature extractor is designed based on SCINet and AE. The hidden layer features encoded by the dynamic feature extractor are transferred to the XGBoost ensemble model with stronger generalization performance, and a semi-supervised soft measurement model is established. Aiming at the problem of the mismatch in the quantitative relationship between process variables and quality variables caused by the inconsistent sampling rate of sensors, to fully capture the potential dynamic information in industrial processes, and to effectively utilize unlabeled samples and labeled samples, a semi-supervised soft measurement modeling method based on dynamic feature extraction is proposed.
The dynamic feature extractor designed in this paper does not adopt the stacked model. In future work, when the industrial process data dimension is higher, the data volume is larger, the dynamic feature extractor can be extended to a stacked model for processing. Meanwhile, due to the design of residual connections in the unsupervised dynamic feature extractor, the dimensions of the encoded features are the same as the input ones. Too large feature dimensions may produce disturbing information and degrade the performance of the regressor. Therefore, adding an attention layer to the extractor to reduce the dimension of the features may achieve better modeling results.