Development of Robust and Physically Interpretable Soft Sensor for Industrial Distillation Column Using Transfer Learning with Small Datasets

In the development of soft sensors for industrial processes, the availability of data for data-driven modeling is usually limited, which led to overfitting and lack of interpretability when conventional deep learning models were used. In this study, the proposed soft sensor development methodology combining first-principle simulations and transfer learning was used to address these problems. Source-domain models were obtained using a large amount of data generated by dynamic simulations. They were then fine-tuned by a limited amount of real plant data to improve their prediction accuracies on the target domain and guaranteed the models with correct domain knowledge. An industrial C4 separation column operating at a refining unit was used as an example to illustrate the effectiveness of this approach. Results showed that fine-tuned networks could obtain better accuracy and improved interpretability compared to a simple feedforward network with or without regularization, especially when the amount of actual data available was small. For some secondary effects, such as interaction gain, its interpretability is mainly based on the interpretability of the corresponding source models.


Introduction
Soft sensors are virtual sensors instantly estimating hard-to-measure variables, such as concentration, which is traditionally measured online by low frequency laboratory analysis through inputting easy-to-measure variables, such as pressure, temperature, and flowrate. In the past few decades, soft sensors have been extensively studied and implemented in the process industries. Typically, soft sensors can be divided into two general classes: the model-driven (white box) and the data-driven (black box). The model-driven soft sensors were commonly based upon first-principle models, while the data-driven ones were usually based on regression techniques such as principal component analysis, partial least squares, neuro-fuzzy systems, support vector machines, and artificial neural networks (ANNs).
Recently, with the advanced progressions in deep learning, ANN variants once again caught the attention of process engineers due to their power in nonlinear regression ability. However, the ANN variants were black boxes, usually difficult to interpret by domain knowledge [1]. Such a drawback held scientists and engineers from further implementing ANNs on the systems they were focusing on, thus slowing down their popularizing rates. With these concerns, explainable artificial intelligence (AI), which aims to make AI interpretable and trustworthy, became a focusing field of machine learning [2]. For process engineering and control, it was also critical and necessary to implement interpretable models into the processes to make sure the predictions of these models were not merely accurate but also interpretable based on domain knowledge.
The quality of the data is key to training a good AI model. Udugama et al. [3] reported that the data of chemical plants requires four properties: volume, variety, velocity, and veracity. It is difficult to obtain an accurate model when lacking any one of the properties in data. For conventional machine learning methods, a large amount of such data was necessary for training robust models with both accuracy and interpretability. However, in process industries, most of the critical quality variables, such as concentration and viscosity, were measured by low frequency offline analysis in laboratories. Due to the low frequencies, the corresponding databases were usually small and required long periods of time to enlarge to be sufficient for training neural networks. Furthermore, for some newly started processes with a short operation history, it was impossible to gather big data. Small datasets are the inherent problem in soft sensor development [4]. To overcome the lack of data, one common approach is to use linear models such as partial least squares (PLS). However, industrial processes are nonlinear over a large range. For instance, distillation column operations in chemical plants have very nonlinear behavior for producing highpurity products. Hence, constantly updating linear models is required [5,6]. Nonlinear models such as ANN are commonly used to predict nonlinear systems such as distillation columns, but the generalization ability of ANN models must be checked using validation data and regularizations [7]. Our recent study showed that a simple validation test might not be sufficient to ensure the generalizability and physical consistency when the datasets are limited.
It should be pointed out that some form of prior knowledge must exist when we try to build a data-driven model. Hybrid models have been used to alleviate the problem of small datasets and improve soft sensor accuracy [8,9]. Prior knowledge may be in the form of data of a similar system or an approximate simulator based on the first-principle model. In machine learning, the technique of building a data-driven model for the current problem, a target domain, from a model of another similar system, the source domain, is known as transfer learning [10][11][12]. The purpose of this study was to present a new data-driven soft sensor development methodology that combines first-principle simulations and transfer learning methods to overcome the overfitting issue and ensure the interpretability of the soft sensor when there is only a limited data set. An industrial C 4 separation column was used to demonstrate the performance of this approach. Furthermore, gain consistency analysis was used to ensure the interpretability of soft sensors.

Transfer Learning
One of the most common approaches to performing transfer learning is to fine-tune the parameters (weights and bias) of the networks [12], which was employed in this study. During the fine-tuning procedures, it was critical to determine which layer should be frozen (nontrainable). However, there is no common census to which layers should be frozen, but usually, the weights of the first or last few layers are changed.

Process Simulator in Transfer Learning Framework
In process industries, process simulators serve as the core tools to calculate, analyze, and optimize the chemical and refining processes. These simulators provide engineers with reasonable results based on the first-principle theories and empirical correlations for operation decisions making. In the rapid progressions on computational abilities, process simulators, especially the dynamic ones, were ideal candidates to provide the dataset of source domains under transfer learning framework for the development of soft sensors. With the help of dynamic simulators, big datasets can be obtained within a short period of time.
To generate a dataset from first-principle simulators, the operating conditions of the first-principle simulators should be set to give periodic random variations within reasonable ranges such as feed stream conditions and controller setpoints. The methods of Processes 2021, 9, 667 3 of 12 data extraction from simulators were reported in [13,14]. In this study, we used MATLAB Simulink, connecting with simulators to extract and collect the simulation data.

Process Description
In this study, an industrial C 4 separation column was used as an example to illustrate the effectiveness of this approach. The column separated the C 4 and C 5 + components of the reactor effluent. The main product, C 4 (over 90% of the feed) left as liquid distillate, while some noncondensable light impurities are left from the vapor distillate; C 5 + are left from the bottom. Quality control of the liquid distillate and bottom product was the top priority of this distillation operation, especially the quality of liquid distillate; namely, the concentration of C 5 + impurities at distillate and C 4 losses at the bottom should be controlled within an acceptable range. Hence, two soft sensors were built in this case study to monitor C 5 + impurities and C 4 losses at the distillate and bottom, respectively.
According to the domain knowledge of distillation unit operations, 14 critical process variables, including pressures, temperatures, and flow rates, were selected as the input variables for the soft sensors, as shown in Figure 1. The selected variables can be divided into two types: six manipulated variables (MVs) and eight sensor variables (SVs). MVs were manipulated by manual or automatic approaches, while the SVs were only the measured values, as shown in Table 1.
To generate a dataset from first-principle simulators, the operating conditions of the first-principle simulators should be set to give periodic random variations within reason able ranges such as feed stream conditions and controller setpoints. The methods of data extraction from simulators were reported in [13,14]. In this study, we used MATLAB Sim ulink, connecting with simulators to extract and collect the simulation data.

Process Description
In this study, an industrial C4 separation column was used as an example to illustrate the effectiveness of this approach. The column separated the C4 and C5+ components o the reactor effluent. The main product, C4 (over 90% of the feed) left as liquid distillate while some noncondensable light impurities are left from the vapor distillate; C5+ are lef from the bottom. Quality control of the liquid distillate and bottom product was the top priority of this distillation operation, especially the quality of liquid distillate; namely, the concentration of C5+ impurities at distillate and C4 losses at the bottom should be con trolled within an acceptable range. Hence, two soft sensors were built in this case study to monitor C5+ impurities and C4 losses at the distillate and bottom, respectively.
According to the domain knowledge of distillation unit operations, 14 critical proces variables, including pressures, temperatures, and flow rates, were selected as the inpu variables for the soft sensors, as shown in Figure 1. The selected variables can be divided into two types: six manipulated variables (MVs) and eight sensor variables (SVs). MV were manipulated by manual or automatic approaches, while the SVs were only the meas ured values, as shown in Table 1.

Data Preprocessing
For soft sensors of distillate quality, there were 929 available samples for modeling, where 838 samples were used for learning (training and validation) and 91 samples for testing. For soft sensors of bottom quality, there were 453 available samples for modeling, where 414 samples were used for learning and 39 samples for testing.
The moving window method was applied to consider the dynamic behaviors of the process [15]. Window length (W) was 1-h backtracking from each sampling instant t, and each input variable was averaged every 10 min. The input-output relations of the soft sensor can be mathematically expressed as: where subscript t represents time; W represents window length; mv represents the manipulated variables; sv represents the sensor variables.

Network Structure and Hyperparameters
A feedforward network (FFN) was the simplest neural network containing a multilayer perceptron (MLP). In this study, fully-connected FFNs with five hidden layers based on five different models were tested and compared. The five different models in this study were listed as follows:
Three Fine-Tuned FFNs (FT-FFNs) with L 2 -norm, which transferred from different source-domain models For FFN, R-FFN, three of FT-FFNs, and the source-domain models, the number of inputs was 76 features, and the number of the output was one. The number of parameters for all models was 32,161. To train from scratch for FFN and R-FFN, the Glorot uniform initialization [17] (the default option of Keras library) was applied. The regularization rate, λ, penalty weighting of L 2 -norm of parameters were fixed to the value of 0.01 for both R-FFN and FT-FFNs.
According to the universal approximation theorem, the rectified linear unit (ReLU) [18], a commonly used activated function, activated width-bounded deep networks with N + 4 neurons per layer can approximate any Lebesgue-integrable function, where N is the number of features [19]; in this case, there were 76 features (mv t , mv t−1 , · · · , mv t−W , sv t−1 , · · · , sv t−W ) as the inputs, and there were 80 neurons for each layer. The algorithm for gradient descent optimization was Adam [20]. Additionally, to avoid overfitting during modeling with extremely small datasets, the L 2 -norm was implemented to penalize the loss function. All the modeling works were done in Python language using Keras library.

Metrics of Performance
For the development of robust and interpretable neural models, both predictive accuracy and interpretability (descriptive accuracy) should be cautiously considered. In this study, Root-mean-square error (RMSE) was used as the metric of accuracy for the soft sensors and was calculated with the following equation.
Alongside the metrics for predictive accuracy, the models were also interpreted using post hoc analysis. Post hoc interpretability is a concept and approach of interpretable machine learning [2,21]. It aims to interpret the black boxes globally or locally using domain-knowledge-based models. Local interpretation targets to identify the contributions of each feature in the input towards a specific model prediction and usually attributs a model's decision to its input features (Du et al., 2018). Such interpretation is usually done by posing perturbation on certain features in the input.
For soft sensors regressing input-output relationships of chemical processes, the responding behaviors, which were usually called the process gains, of outputs to the disturbances (perturbations) of inputs should be physically consistent with the chemical engineering domain knowledge. The dynamic process gain (K dyn ij ) of quality variables i posed by manipulated variable j can be defined as: where ∆u j,t is the perturbation of manipulated variable j at sampling instant t. For soft sensors of distillation columns, the inputs are the manipulated variables, including reflux rate and reboiler temperature, and the outputs are the qualities of distillate and bottom products. Thus, there should be four process gains, including two main gains (i = j) and two interaction gains (i = j).
It was reasonable, based on the common sense of distillation unit operation, to believe that the sign of dynamic and steady-state process gains to be consistent. Therefore, the percentage of testing samples whose signs of dynamic and steady-state process gains were consistent was defined as the gain consistency (Con ij ), as follows: where Hv is the Heaviside function. The interpretable soft sensors should be at least with high gain consistency to reasonably respond to the change of manipulated variables.

Degree of Freedom
There is often an issue of the degree of freedom (DoF), whether the networks were over-parameterized, which implies overfitting. Intuitively, the parameter-to-data ratio, which was conventionally calculated in the form of Equation (5), was an appropriate way to estimate the DoF of networks. However, some works of literature [22,23] provided that the equivalent DoF of the multilayer FFNs only related to the units in the highest hidden layer, with the other layers performing only geometric transformations of data. Thus, instead, we considered the DoF of networks using Equation (6). To consider the effect of the number of learning samples, the neural networks were trained with 360, 270, 180, 90, and 45 samples (with the parameter-to-data ratio of 0.225, 0.3, 0.45, 0.9, and 1.8, respectively), 20% of these samples were used as the validation set during learning.

Source-Domain Models
Three ASPEN Plus dynamics simulators were constructed to serve as the source domains in this study. The first source domain (D 1 ) mimiced the actual plant, a debutanizer. The number of trays, type of trays, location of feeds and draws, and feed flowrates were similar to the actual column. The hardware parameters such as actual column diameter, sump size, size of accumulators were obtained using auto-sizing in ASPEN Plus. It should be noted that it is tedious, time-consuming, and somewhat unrealistic to build a rigorous simulation that dynamically matches the real process exactly. Furthermore, we showed that such simulation should be unnecessary with transfer learning techniques in the following discussion. The second (D 2 ) was also a debutanizer found in the literature [24,25]. The third source domain D 3 was a methanol/water splitter, also given in the literature [24,25]. Since the processes were the same, D 1 and D 2 used the same thermodynamic models, the Peng-Robinson equation of states. For the methanol/water splitter, an UNIQUAC model was used.
These source domains were the process of separation towers. The three source models had the same MVs shown in Table 1. The temperature sensors in D 1 were located as those in the plant. D 2 and D 3 had different numbers of trays, and hence the corresponding SVs were the temperature of trays selected by the relative positions with respect to the condenser and the reboiler. The quality output of D 2 and D 3 was set as the light or heavy component.
In general, the qualities of the distillate and bottom were affected by reflux flowrate and reboiler temperature. Thus, in this study, we focused on the interaction of the distillate (qv 1 ) and bottom (qv 2 ) to the reflux flowrate (u 1 ) and the reboiler temperature (u 2 ) to the gain signs analysis. For these source domains, the corresponding steady-state process gains shared the same sign. Namely, the main gain signs (K ss 11 and K ss 22 ) were all negative, and the interaction gain signs (K ss 12 and K ss 21 ) were all positive, as shown in Table 2. With the datasets generated by these source domains, three source-domain neural network models were pretrained, and their gain consistencies were calculated. There were two potential factors that affect the result of transfer learning: (1) domain similarity between source domains and target domain, and (2) gain consistency of source-domain models. Both effects were considered in this case study. For observation of gain consistency effect, source-domain models with low gain consistencies were intentionally chosen; namely, the Con 12 for D 3 model was 0% consistent, shown in Table 3.

Fine-Tuning Recipe
There is no general criterion for neural network fine-tuning [26]. The most common practices are done by fine-tuning deep layers while freezing shallow ones [12]. However, Li et al. [27] stated that shallow layers also had some effects during domain adaptation. Thus, to obtain better fine-tuning results, the trial-and-error method was used to figure out the best one before performing fine-tuning.
The trial results are shown in Figure 2. As the figure shows, the shallowest layer gave the most significant contribution to minimizing the RMSE in fine-tuning procedures of soft sensors of the distillate and bottom. Note that the six digits in tuning recipes represented positions of the hidden layers and output layer; 1 represented trainable on the specific layer, and 0 represented nontrainable. Generally, the recipes freezing the shallowest layer performed worse than the ones updating their weights, and the ones freezing the intermediate layers performed better than the ones updating their weights. Compared with the shallowest layer, the deeper layers contributed much less effect during fine-tuning, but the deeper layers still gave the contribution to reducing the RMSE. Thus, in this study, the recipe "100011" was chosen, marked by the red arrow in Figure 2, where the output layer, the shallowest hidden layer, and the deepest hidden layer were fine-tuned, while the intermediate ones were frozen. The number of trainable parameters with this recipe was 12,721, and the number of nontrainable parameters was 19,440.

Fine-Tuning Recipe
There is no general criterion for neural network fine-tuning [26]. The practices are done by fine-tuning deep layers while freezing shallow ones Li et al. [27] stated that shallow layers also had some effects during dom Thus, to obtain better fine-tuning results, the trial-and-error method was out the best one before performing fine-tuning.
The trial results are shown in Figure 2. As the figure shows, the shallow the most significant contribution to minimizing the RMSE in fine-tuning soft sensors of the distillate and bottom. Note that the six digits in tuning sented positions of the hidden layers and output layer; 1 represented traina cific layer, and 0 represented nontrainable. Generally, the recipes freezing layer performed worse than the ones updating their weights, and the on intermediate layers performed better than the ones updating their weig with the shallowest layer, the deeper layers contributed much less effect d ing, but the deeper layers still gave the contribution to reducing the RMS study, the recipe "100011" was chosen, marked by the red arrow in Figu output layer, the shallowest hidden layer, and the deepest hidden layer w while the intermediate ones were frozen. The number of trainable param recipe was 12,721, and the number of nontrainable parameters was 19,440.  Figure 3 plots the predictive accuracy of the testing set of soft sensors Both R-FFN and three FT-FFNs largely improved the accuracy compared w FFN even when the parameter-to-data ratio was larger than one (the red area). FT-FFNs based on different source domains performed slightly bett and they had similar accuracy, which indicated there was no need for th  and they had similar accuracy, which indicated there was no need for the domain to be very accurate to the target domain. With narrower statistical distribution, FT-FFNs showed better robustness than their pure data-driven counterparts. ended up with over-penalized. The three FT-FNNs showed a similar ord when the size of the data set was small. When there is more data, the perfo FFN#3 becomes inferior. The introduction of some useful prior knowledge its accuracy, will be very helpful when there is not enough data. As more d ble, a more accurate source domain is required. Figure 5a-d shows the C4 losses and C5 losses predictions for R-FFN with 0.225 of the parameter-to-data ratios. Since the predictions of FT-FFN similar to the prediction of FT-FFN#1, only the result of FT-FFN #1 was dis    However, R-FFN performed better than FFN only when small datasets were used, and it failed to improve accuracy when bigger datasets were available. Such phenomena were called under-fitting, which occurred when λ parameter of L 2 -norm was too strict that ended up with over-penalized. The three FT-FNNs showed a similar order of accuracy when the size of the data set was small. When there is more data, the performance of FT-FFN#3 becomes inferior. The introduction of some useful prior knowledge, regardless of its accuracy, will be very helpful when there is not enough data. As more data are available, a more accurate source domain is required.

Predictive Accuracy
Processes 2021, 9, x FOR PEER REVIEW Figure 4 plots the predictive accuracy of the testing set of soft sensors FT-FFNs still performed better than simple FFN and R-FFN regardless of tasets. However, R-FFN performed better than FFN only when small datas and it failed to improve accuracy when bigger datasets were available. Su were called under-fitting, which occurred when λ parameter of L2-norm wa ended up with over-penalized. The three FT-FNNs showed a similar ord when the size of the data set was small. When there is more data, the perfo FFN#3 becomes inferior. The introduction of some useful prior knowledg its accuracy, will be very helpful when there is not enough data. As more d ble, a more accurate source domain is required. Figure 5a-d shows the C4 losses and C5 losses predictions for R-FFN with 0.225 of the parameter-to-data ratios. Since the predictions of FT-FFN similar to the prediction of FT-FFN#1, only the result of FT-FFN #1 was dis

Main Gain Consistency
Figures 6 and 7 plot the main gain consistency Con11 and Con22, respectively, for the different modeling methods. Both R-FFN and FT-FFNs improved the gain consistency even when the parameter-to-data ratio was larger than one; namely, only small datasets were available. Just like predictive accuracy, domain similarity also displayed few effects on obtaining correct directions of main process gains.

Main Gain Consistency
Figures 6 and 7 plot the main gain consistency Con 11 and Con 22 , respectively, for the different modeling methods. Both R-FFN and FT-FFNs improved the gain consistency even when the parameter-to-data ratio was larger than one; namely, only small datasets were available. Just like predictive accuracy, domain similarity also displayed few effects on obtaining correct directions of main process gains. Consistency   Figures 8 and 9 show the interaction gain consistency Con 12 and Con 21 , respectively, for the different modeling methods.

Interaction Gain
The results showed that FT-FFNs improved gain consistency when the source-domain models had gain consistency. In Table 3, the Con 12 of D 1 and D 2 source-domain models were 100% and 89%, respectively, which ensured the target models based on them to be high gain consistent. Contrarily, the Con 12 of D 3 source-domain model was 0%, which led to the low gain consistent target models. Additionally, although R-FFN improved predictive accuracy (shown in Figure 3), it not only failed to improve, but low down the Con 12 instead, leading to low model interpretability. Consistency   Figures 6 and 7 plot the main gain consistency Con11 and Con22, respectively, for th different modeling methods. Both R-FFN and FT-FFNs improved the gain consistency even when the parameter-to-data ratio was larger than one; namely, only small dataset were available. Just like predictive accuracy, domain similarity also displayed few effect on obtaining correct directions of main process gains.   Consistency   Figures 8 and 9 show the interaction gain consistency Con12 and Con21, respectively for the different modeling methods.

Interaction Gain
The results showed that FT-FFNs improved gain consistency when the source-do main models had gain consistency. In Table 3, the Con12 of D1 and D2 source-domain mod els were 100% and 89%, respectively, which ensured the target models based on them t be high gain consistent. Contrarily, the Con12 of D3 source-domain model was 0%, whic led to the low gain consistent target models. Additionally, although R-FFN improved pre dictive accuracy (shown in Figure 3), it not only failed to improve, but low down the Con instead, leading to low model interpretability.
In Figure 9, three FT-FFNs improved the gain consistency due to the high value o Con21 of three source-domain models. R-FFN improved the Con21, but it failed to provid high predictive accuracy when bigger datasets were available (shown in Figure 4).   Consistency   Figures 8 and 9 show the interaction gain consistency Con12 and Con21, respectively for the different modeling methods.

Interaction Gain
The results showed that FT-FFNs improved gain consistency when the source-do main models had gain consistency. In Table 3, the Con12 of D1 and D2 source-domain mod els were 100% and 89%, respectively, which ensured the target models based on them t be high gain consistent. Contrarily, the Con12 of D3 source-domain model was 0%, whic led to the low gain consistent target models. Additionally, although R-FFN improved pre dictive accuracy (shown in Figure 3), it not only failed to improve, but low down the Con1 instead, leading to low model interpretability.
In Figure 9, three FT-FFNs improved the gain consistency due to the high value o Con21 of three source-domain models. R-FFN improved the Con21, but it failed to provid high predictive accuracy when bigger datasets were available (shown in Figure 4).

Conclusions
In this paper, a new methodology combining the first-principle simulation and tran fer learning was proposed to address the potential problems of overfitting and low inte pretability posed by small available datasets often taking place in industrial processe The method was applied to a real distillation process. It showed its advantages in enhan ing both predictive accuracy and physical interpretability over the other convention deep learning methods, especially when the amount of available real data was small com pared to the number of network parameters. Transfer learning was implemented by fin tuning the weights of networks, freezing inner layers, and updating outer layers. Throug fine-tuning, the input-output relationships were modified to accomplish adaptation from the source domains into the target domain. The result showed that the similarities b tween source and target domains had nearly no effect on fine-tuning results, while th gain consistency of target models was strongly determined by the gain consistency of the corresponding source-domain models. Additionally, the concept and definition of gai consistency were used as the metrics to quantify the physical interpretability of the ne works.   In Figure 9, three FT-FFNs improved the gain consistency due to the high value of Con 21 of three source-domain models. R-FFN improved the Con 21 , but it failed to provide high predictive accuracy when bigger datasets were available (shown in Figure 4).

Conclusions
In this paper, a new methodology combining the first-principle simulation and transfer learning was proposed to address the potential problems of overfitting and low interpretability posed by small available datasets often taking place in industrial processes. The method was applied to a real distillation process. It showed its advantages in enhancing both predictive accuracy and physical interpretability over the other conventional deep learning methods, especially when the amount of available real data was small compared to the number of network parameters. Transfer learning was implemented by fine-tuning the weights of networks, freezing inner layers, and updating outer layers. Through fine-tuning, the input-output relationships were modified to accomplish adaptation from the source domains into the target domain. The result showed that the similarities between source and target domains had nearly no effect on fine-tuning results, while the gain consistency of target models was strongly determined by the gain consistency of their corresponding source-domain models. Additionally, the concept and definition of gain consistency were used as the metrics to quantify the physical interpretability of the networks.