Soft Sensor Transferability: A Survey

: Soft Sensors (SSs) are inferential dynamical models employed in industries to perform prediction of process hard-to-measure variables based on their relation with easily accessible ones. They allow implementation of real-time control and monitoring of the plants and present other advantages in terms of costs and efforts. Given the complexity of industrial processes, these models are generally designed with data-driven black-box machine learning (ML) techniques. ML methods work well only if the data on which the prediction is performed share the same distribution with the one on which the model was trained. This is not always possible, since plants can often show new working conditions. Even similar plants show different data distributions, making SSs not scalable between them. Models should then be created from scratch with highly time-consuming procedures. Transfer Learning (TL) is a ﬁeld of ML that re-uses the knowledge from one task to learn a new different, but related, one. TL techniques are mainly used for classiﬁcation tasks. Only recently TL techniques have been adopted in the SS ﬁeld. The proposed survey reports the state of the art of TL techniques for nonlinear dynamical SSs design. Methods and applications are discussed and the new directions of this research ﬁeld are depicted.


Introduction
The advent of Industry 4.0 improved the automation and monitoring of traditional manufacturing and process industries [1]. Implementing efficient plant monitoring and control policies is performed through measurement acquisition and data elaboration systems.
Models of real industrial processes, able to perform prediction of process variables by exploiting their dependence on other ones, are known as Soft Sensors (SSs) [2,3].
Machine Learning (ML) and Deep Learning (DL) have greatly increased the capabilities of such data-driven systems over the last few years in the industrial automation field [4]. However, the practical implementation of those techniques is hampered by two characteristics of ML [5,6]. The first is that the training dataset and the actual system variables have to share the same feature space and the same distribution to perform prediction properly: collected data must be capable of representing the whole dynamics of the system since the model cannot provide more information than the that stored in the training data themselves. This means that training data have to be very large and varied to be able to represent uncommon occurrences too.
In an industrial environment, data acquisition already poses a limitation, since production systems and industrial processes cannot be stopped to perform experiments and generate suitable learning data. Ad-hoc experiments are indeed difficult, time-consuming and costly. For this reason, the SS designer refers to data stored in the historical databases which always contains outliers, multi-rate acquisitions, and suffer from labeled data scarcity [4].
The second ML characteristic refers to model retraining which is needed to cope with time-varying process and signal drift [7]. This could be sometimes comparable to training a new model from scratch, requiring large amounts of computational power and large datasets. Moreover, either in presence of new working conditions or when new processes are considered, the number of available data is too low to design a new model [8].
Nowadays, the gaps between domains and distributions of data and changes in the processes are the main obstacles of ML techniques for SS modelling for complex industrial processes.
The reported issues can be mitigated by Transfer Learning (TL), a set of approaches, increasingly popular in the ML field, which take advantage of the knowledge already acquired by a model on a source task to transfer it to a target task [9]. TL would allow reduction in the amount and quality of data needed by transferring knowledge among tasks. TL can also be used to approach model scalability from a process to a similar one [10,11].
TL has increasingly gained attention since the issue of inconsistency of data distribution is a common barrier in many general ML applications. TL was then been applied to different fields, as reviewed by different publications, such as in medical imaging [12], email spam [13] and speech recognition [14]. On the other side, in the industrial automation field, it still is a new research topic. Most of the studies focus on classification problems, anomaly detection, fault diagnosis and quality monitoring [15]. Only in the last two years have applications of TL methods for SSs design appeared in the literature.
This survey covers the recent developments of TL techniques to nonlinear dynamical SSs design. Methods and applications reported in the recent literature are discussed along with the description of the future trends in the field.
The remaining of the paper is organized as follows: firstly, a brief description of SSs and their design procedure is given in Section 2 to explain the need of TL; the main TL methodologies are then introduced in Section 3; in Section 4 current applications of TL methods for SSs are examined and classified. Final discussions and future trends are drawn in Section 5.

Soft Sensors
In industrial plants, many variables are monitored through online sensors. There are cases though in which some of them cannot be measured online due to the lack of online measurement instruments or hostile environments. This requires frequent sensor maintenance and/or introduces high delays due to laboratory analysis. SSs can be a solution to these issues. They are inferential models that make use of the underlying relations between the accessible process variables to provide an estimation of the required hard-to-measure physical variables as schematized in Figure 1. Depending on the problem complexity an, SS can be realized by using static/dynamic, linear/nonlinear, time-invariant/variant models [2]. SSs allow for real-time plant control, measuring system back-up, what-if analysis, sensor validation and fault diagnosis as well [3,16].
SSs can be designed by using first principle models when a priori physical knowledge of the plant is given [3]. When such knowledge is not available, or the modelling process is overcomplicated, the design of SSs relies on data-driven black-box techniques [26].
Data-driven SS design is a highly time-consuming task that involves pattern recognition [27] and system identification [28] steps, as summarized in the following: Data are retrieved from industries' historical databases and must be selected to represent the whole dynamics of the system. Historical databases usually suffer from oversampling, outliers, missing data, offsets, seasonal effects and high-frequency noise. An accurate pre-processing is therefore needed for the successive step of input selection, in which highly informative inputs concerning the chosen output are selected among the many available inputs [29].
The model choice should be conducted taking into consideration different characteristics, i.e., linear/nonlinear, static/dynamic, time-variant/invariant. Linear models, in general, do not show good performances for industrial processes. Nonlinear models are therefore widely used.
The following model classes for linear systems are usually considered: where G(·) and H(·) are transfer functions, z −1 is the time delay operator and e(t) a white noise signal. The identification procedure aims at determining a good estimate of G(·) and H(·), so the model can produce one-step-ahead predictions with a low variance error. The one-step-ahead predictor can be written in its regressor form as: where θ is the parameter vector and ϕ the regression vector that can contain past samples of system inputs and outputs and/or residuals. The model is then determined by identifying the parameters of the transfer functions. Models; structures are defined by imposing the structure of the transfer functions i.e., of the regression vector. The main parametric structure families are • FIR, characterized by the following regression vector: with d and m being the delay of the samples; • ARX, characterized by the following regression vector: where l is the maximum delay needed for the output variables; • ARMAX, characterized by the following regression vector: where ε is the model residual and k is the associated maximum time delay.
The linear structures above can be extended to their nonlinear counterparts, respectively NFIR, NARX, NARMAX where a nonlinear function is considered between the regressor vector and the estimated output. For data-driven designed SSs, these model structures can be implemented with a wide variety of ML techniques such as Artificial Neural Networks (ANN) [30], Convolutional Neural Networks (CNN) [31], Generative Adversarial Networks (GAN) [32], Deep Belief Networks (DBN) [33], Support Vector Regression (SVR) [34], Gaussian Processes Regression (GPR) [35], just to mention a few.
Finally, the identification step allows empirical estimation of the model unknown parameters based on the training dataset, whereas the validation step exploits test data to verify whether the model can adequately represent the system and be generalized to new samples.
Data-driven ML methods give better performance under the common assumption that training data and test data share the same distribution [5]. This characteristic should also be maintained during the future use of the SS.
The problem of data distribution inconsistency between the training and the test phases can be mitigated by the adoption of TL techniques. TL is a research field in ML that aims to exploit the knowledge gained while learning a task (source domain) to learn a different but related one (target domain) more efficiently [9].
This means that such techniques can handle the problem of data distribution inconsistency between the training and the test phases. Nevertheless, even though such techniques gained success more than ten years ago, little attention has been paid to their application to the SS field and process systems monitoring.
Few studies have employed TL methods in the fault detection and diagnosis of industrial systems, as reported in Maschler and Weyrich [15], and in regression problems for condition monitoring purposes [36]. Only in the last year have new results appeared in the literature addressing the application of TL techniques to SSs.

Transfer Learning
As stated before, TL methodologies are useful when the data distribution of the target domain is different from the data distribution of the source domain [9]. Figure 2 shows the difference in the learning process in cases of traditional ML and TL. In the first case, the task is learned from scratch each time, while in the TL case the knowledge acquired from a previously learned source task is, in some way, transferred while learning a target task, to overcome scarcity in high-quality target data.
A practical and common example is given by the problem of sentiment classification, which consists of a classification task on reviews of a specific type of product as a positive or negative view. To train the classifier, many reviews are first collected and labeled. When the type of product changes, such a classifier will not maintain good performance, if the distribution of data differs from the previous product. This means that new data should be collected and labeled for each type of product to create a new classifier. Such a procedure indeed being very expensive, TL techniques allow use of a classification model trained for some specific products to adapt to others of a different type, saving a great amount of effort [37].
Many other examples of TL are observable in nature as well, as shown in Figure 3. Humans themselves can intelligently apply the knowledge previously learned from one task to solve new ones faster or more efficiently. For instance, the knowledge acquired by learning a musical instrument would allow one to learn a new instrument faster.
Formal definitions to introduce the TL problem will now be provided. Given a feature space X , an instance set X defined as X = {x 1 , . . . , x n } ∈ X and its marginal probability distribution P(X), a domain D is defined as This means that two domains differ when either their X or P(X) differ. Given a domain D, a prediction, or a task, T is defined by a label space Y and a predictive function This again implies that two tasks differ when either their Y or f (·) differ. The predictive function f (·) is not known a priori, but learned from the training data, which consist of the labeled pairs {x i , y i }, where x i ∈ X, y i ∈ Y, with i = 1 . . . , n. Given a new instance x, then f (·) can be used to predict its corresponding label f (x), that from a probabilistic point of view can be written as P(y|x).
In the TL problem, distinction is made between a source domain D S and its corresponding source task T S ; a target domain D T and its target task T T .
The source domain D S is usually observed via the instance-label pairs as Whereas, the target domain data are denoted as In most real applications, an observation of the target domain consists of unlabeled instances or just a limited number of labeled ones, meaning that usually 0 ≤ n T n S . TL aims to improve the learning of the target predictive function f T (·) in D T using the knowledge in D S and T S , where D S = D T , T S = T T .
When the target and source domains are the same, D S = D T , and their tasks are the same, T S = T T , then it becomes an usual ML problem.
As already stated, the condition D S = D T implies that Analogously, T being defined as T = {Y, P(Y|X)}, the condition T S = T T means that Finally, when there exists some kind of relationship between the feature spaces of the two domains, then the domains are said to be related. In some cases, when the two domains are not related, a knowledge transfer could be unsuccessful at the point of worsening the learning in the target domain. When the target learner is indeed hurt by the transfer, the phenomenon is referred to as negative transfer [38].
The above definitions allow performance a categorization of TL techniques. Approaches can be grouped on a different basis, in particular under a problem point of view and an approach point of view. In the first case, the categorization can be performed on either the presence of labels in the source and target datasets (label setting) or the consistency of the feature space (space setting); the latter categorizes the types of TL techniques based on "what" part of the knowledge from the source is actually transferred to the target. The classification is reported in Table 1 and described in detail as follows. From the label-setting aspect, TL techniques can be categorized into three types, based on the possible different situations between domains and tasks of the source and the target: inductive transfer learning, transductive transfer learning, unsupervised transfer learning (see Figure 4).

•
Inductive TL: target and source tasks are different, regardless of the domains, and the label information of the target domain instances is available. Target-labeled data induce the learning of the target predictive model, hence the name; • Transductive TL: target and source domains differ and the label information only comes from the source domain. In this case, if the domains differ because the feature spaces are the same X S = X T , but the marginal probability distributions of the inputs differ P(X S ) = P(X T ), such TL setting is referred to as domain adaptation [39]; • Unsupervised TL: target and source tasks are different and the label information is unknown for both the source and the target domains. This means that by definition such setting regards clustering and dimensionality reduction tasks, and not classification or regression as in the previous cases. For this reason, given the application of SSs, unsupervised solutions are not considered. Another categorization is based on the consistency between the feature and label spaces from the source and the target.
Besides the label settings or the consistency of the spaces, TL techniques can be categorized based on "what" is transferred, leading to four groups: instance-based, featurebased, parameter-based and relational-based (see Figure 5).  Instance-based transfer: these approaches assume that due to the difference in distributions between the source and the target domains, certain parts of the data in the source domain can be reused, by reweighting them so as to reduce the effect of the "harmful" source data, while encouraging the "good" data; • Feature-based transfer: the idea behind this approach is to learn a "good" feature representation for the target domain, to minimize the marginal and the conditional distribution differences, preserving the properties or the potential structures of the data, and classification or regression model error. For instance, a solution is to find the common latent features through feature transformation and use them as a bridge to transfer knowledge: this case is referred to as a symmetric feature-based transfer since both the source and the target features are transformed into a new feature representation; in contrast, in the asymmetric case, the source features are transformed to match the target ones. When performing feature transformation to reduce the distribution difference, one issue is how to measure such differences or similarities. This is done through specific ad-hoc metrics, which are described in Section 3.1; • Parameter-based transfer: this approach performs the transfer at the model/parameter level by assuming that models for related tasks should share some parameters or prior distributions of hyperparameters. So, by discovering them, knowledge is transferred across tasks themselves; • Relational-based transfer: these approaches deal with transfer learning for relational domains, where the data are non-independent and identically distributed (i.i.d.) and can be represented by multiple relations. The assumption is that some relationship among the data in the source and target domains is similar and that is the knowledge to be transferred, transferring the logical relationship or rules learned in the source domain to the target domain.
The references reported in this paper are classified based on the above categories. In the next section, some metrics generally used in the TL framework are briefly introduced.

Distribution Distance Metrics
Metrics to evaluate the differences among data distributions are commonly adopted in feature-based TL to learn a new space that reduces the difference of distribution between the two domains. How to measure the distribution difference or the similarity between domains is, therefore, an important task. They are used in instance-based TL methods as well to produce the weights of the instances by minimizing the adopted metric between the domains [40]. In Table 2 the most adopted metrics in TL techniques are reported. They are defined in the following.
• Maximum Mean Discrepancy (MMD) [41] Given two distributions P and Q, MMD is defined as the distance between the means of them mapped into a Reproducing Kernel Hilbert Space (RKHS): where µ represents the mean value of the distribution. The MMD is one of the most used measures in TL. One known feature representation method for TL called Transfer Component Analysis (TCA) [42] learns some transfer components across domains in an RKHS using MMD. Another unsupervised feature transformation technique called Joint Distribution Adaptation (JDA) jointly adapts both the marginal and conditional distributions of the domains in a dimensionality reduction procedure based on Principal Component Analysis (PCA) and the MMD measure [43]. •

Kullback-Leibler Divergence (D KL ) [44]
D KL is an asymmetric measure of how one probability distribution differs from another. Given two discrete probability distributions, P and Q on the same probability space X , D KL , or the relative entropy, from Q to P is defined as: In Zhuang et al. [45] a supervised representation learning method based on deep autoencoders for TL is introduced so that the distance in distributions of the instances between the source and the target domains is minimized in terms of D KL . Featurebased TL realized through autoencoders is proposed in Guo et al. [46], where D KL is adopted to measure the similarity of new samples concerning historical data samples. • Jensen-Shannon Divergence (JSD) [47][48][49] JSD is a symmetric and smooth version of D KL , defined as: with M being M = 1 2 (P + Q). JSD is used in Dey et al. [48] as a distance metric in a clustering technique for domain adaptation purposes. A classification method for TL proposed in Chen et al. [49] exploits the JSD measure with a PCA feature mapping technique.
D F is a difference measure between two points defined in terms of a strictly convex function called Bregman function F. The points can be interpreted as probability distributions. Given F : Ω → R a continuously-differentiable, strictly convex function defined on a closed convex set Ω, the Bregman distance D F associated with F for points p, q ∈ Ω is defined as the difference between the value of F at point p and the value of the first-order Taylor expansion of F around point q evaluated at point p: A TL method for hyperspectral image classification proposed in Shi et al. [51] employs a regularization based on D F to find common feature representation for both the source domain and target domain. A domain adaptation approach introduced in Sun et al. [52] reduces the discrepancy between the source domain and the target domain in a latent discriminative subspace by minimizing a D F matrix divergence function. • Hilbert-Schmidt Independence Criterion (HSIC) [53] Given separable RKHSs F , G and a joint measure p xy over (X × Y, Γ × Λ), HSIC is defined as the squared HS-norm of the associated cross-covariance operator C xy : A domain adaptation method called Maximum Independence Domain Adaptation (MIDA) finds a latent feature space in which the samples and their domain features are maximally independent in the sense of HSIC [54]. Another method to find the structural similarity between two source and target domains is proposed in Wang and Yang [55]. The algorithm extracts the structural features within each domain and then maps them into the RKHS. The dependencies estimations across domains are performed using the HSIC. • Wasserstein Distance (W) [56] Given two distributions P and Q, the pth Wasserstein distance metric W is defined as: where F P and F Q are the corresponding cumulative distribution functions and F −1 P and F −1 Q the respective quantile functions. W is employed in Shen et al. [57] for an algorithm that aims to learn domain invariant feature representation. It utilizes an ANN to estimate the empirical W distance between the source and target samples and optimizes a feature extractor network to minimize the estimated W in an adversarial manner. A W-based asymmetric adversarial domain adaptation is proposed also in Ying et al. [58] for unsupervised domain adaptation for fault diagnosis. • Central Moment Discrepancy (CMD) [59] CMD is a distance function on probability distributions on compact intervals. Given two bounded random vectors X = (X 1 , . . . , X N ) and Y = (Y 1 , . . . , Y N ) i.i.d. and two probability distributions P and Q on the compact interval [a, b] N , CMD is defined as where E(X) is the expectation of X and c k (X) is the central moment vector of order k defined in Zellinger et al. [59]. In a domain adaptation method for fault detection presented in Li et al. [60], a CNN is applied to extract features from two differently distributed domains and the distribution discrepancy is reduced using the CMD criterion. Another CNN-and CMD-based for fault detection is proposed in Xiong et al. [61].
Since it is often difficult to design metrics that are well-suited to the particular data and task of interest, an ML field called Distance Metric Learning (DML) aims at automatically constructing task-specific distance metrics from supervised data [62]. As an ML task, DML suffers the same problems described so far, requiring a large amount of label information. For this reason, TL methods have been extended to this sub-field as well, in what is called Transfer Metric Learning (TML) [62]. These fields fall out of the scope of this paper.
The metrics introduced so far are also adopted in some TL implementations for SS for both feature-and instance-based TL methods, as described in the next section.

Transfer Learning in SS Design
The implementations of TL on SSs from the literature here considered are categorized in Table 3. Because of a lack of comparability between the different scenarios and cases, listing and comparing results (in terms of final performance or implementation burden) is not feasible. In the following sub-sections, the different solutions applied in the field of SS modelling are illustrated.
To better highlight the motivation behind the application of TL in SSs, works are here classified based on the use case as proposed in Maschler and Weyrich [15] (see Figure 6).
The cases considered are the following: • Cross-phase, which is the case in which plants meet new working conditions and models lose accuracy: this can happen because of signal drift or different operative stages in multi-grade processes or, in the case of production processes, because of changes in products, tools, machines or materials; • Cross-entity, which is the case in which TL is adopted to transfer knowledge between similar but physically different processes.
The classification is then performed from a problem and solution point of view: the former considers the TL settings described in Section 3, whereas in the latter the approach adopted and the chosen ML method are considered.
Finally, works are grouped into four groups based on the type of process, namely: batch processes, production processes, multi-grade chemical processes and industrial process systems.

Batch Processes
Industry processes in which the output appears in quantities of materials or lots and that present both characteristics of continuous and discrete processes are called batch processes. The product of a batch process is called a batch. Quality control of this kind of process is a difficult task due to nonlinearity and time variance. So, designing models able to capture accurately the process behaviour is a difficult task. One of the problems affecting batch process modelling is that sufficient data are often unavailable. The number of new batch data is indeed not sufficient to build a reliable process model. In Wang and Zhao [63], to solve this issue, a novel transfer and incremental SS is developed with the support of multiple historical process modes. The proposed algorithm, with the constant increase in new samples from the cloud of historical modes, can incrementally update model parameters to flexibly accommodate new process modes. To quantify the progressive prediction performance, the Root Mean Squared Error (RMSE) of eight different test batches is adopted. Prediction results of the proposed model are graphically compared with those of a general phase-based PLS (Partial Least Squares) model. The RMSE curve of the proposed model fluctuates around 0.05, whereas the one of the general model is high and unstable, revealing the goodness of the predictions of the real qualities of the product in the first case and the prediction inability of the general model.
The works [64][65][66][67] assess the problem of TL for batch processes for both the same process and for knowledge-transferring between similar processes as well. Similar batch processes employ the same or similar raw materials, equipment, and control strategies and the relationships between the process variables are the same or similar. One difficulty in applying TL in similar batch processes is that there are always differences between them, and this leads to a serious plant-model mismatch. The problem of applying TL in batch processes for quality prediction to solve both the problem of data scarcity and plantmodel mismatch is assessed in these papers. The method proposed is based on the latent variable model (LVM) [78] and the joint-Y partial least squares (JY-PLS) regression [79]. The transferring of the process knowledge is achieved through a common latent variable space and the mismatch between variables is addressed through an adaptive control strategy. Results are evaluated in terms of RMSE between the proposed model and a Kernel-PLS (KPLS) model. The proposed model showed indeed a reduction in the RMSE of 56%.
The JY-PLS method is adopted also in Jia et al. [68] and the transferring method is based on domain-adaption between the source and target domains (DAJY-PLS). In particular, an index, which is the difference between the variance in source and target domains, is used to realize the trade-off between minimizing the difference of distributions, quantified through the MMD measure, of the domains and maximizing the covariance between the latent and output variables. The efficiency of the proposed approach is verified by comparing the DAJY-PLS and its JY-PLS counterpart, adopting RMSE and MAE (Mean Absolute Error) as a measure of performance. The DAJY-PLS showed an average reduction of 67% of the RMSE and of 68% in the MAE over ten different experiments.

Production Processes
Quality prediction is tackled in Tercan et al. [36] and Yao et al. [69] for production processes. Every time a change in production occurs, the process changes behavior, leading to the need for what is called incremental learning. In such a case, new target data are incrementally used to extend the source model's knowledge. In Tercan et al. [36], an injection molding is considered. To assess the changes in the process behavior, an ANN is first trained on the source data and when the produced part is changed a new block of neurons, specially trained for the new part, is added, so that the model does not forget the knowledge from the previously learned parts. Such incremental learning approach is graphically compared, in terms of Correlation Coefficient (CC), to a baseline approach that does not adopt incremental learning but rather jointly trains the ANN on all data available to that point, when six different parts are produced. Results showed the incremental learning approach maintains a value of CC over 0.95 with every part, while the baseline approach cannot handle the increasing complexity in the data with an incremental drop in CC for every newly produced part.
In Yao et al. [69], quality prediction on cement clinker is performed through the prediction of the concentration of free calcium oxide (f-CaO). Incremental learning is needed because of the process time variance. A data-driven model based on deep dynamic features extracting and transferring methods is applied to build a SS for cement quality prediction. A large semi-supervised dataset is used to extract nonlinear dynamic features through a deep Long Short-Term Memory (LSTM) encoder-decoder network with an attention mechanism. The features are then transferred to an eXtreme Gradient Boosting (XGBoost) regression model for output prediction. The method is compared to other different models, both static and dynamic, for f-CaO content prediction proposed in other researches, in terms of CC and RMSE. In particular, the proposed method showed an improvement of the 285% of the CC and a reduction of 74% in the RMSE with respect to a simple static PLS model, whereas an improvement of 67% in the CC and a reduction of 65% in the RMSE with respect of a dynamic LSTM model.

Multi-Grade Chemical Processes
Multi-grade processes present multiple operating grades, each of them with an unknown distribution discrepancy of process data concerning the others. The same production line commonly produces different product grades after modifying the operating conditions and/or the ratio of ingredients in the process feed. Since key product qualities cannot be measured online and need laboratory analysis, manual operations for grade changeover are commonly implemented in practice, often leading to inefficient and offgrade products. The use of TL in such context allows application of the information from different grades to enlarge the prediction domain to some extent, even in the case of limited labeled samples.
In Yang et al. [70] and Liu et al. [71] the knowledge transfer between different grades is performed through an extension of Extreme Learning Machines (ELM), called Domain Adaptation ELM (DAELM). To implement the transfer, the empirical error from the target domain is used as a regularization term of the target-labeled instances. The DAELM method is compared to a regularized ELM (RELM) model over three grades, adopting each of them alternatively as a source domain in three different experiments. In particular, the method showed a reduction of 89% in the RMSE with the DAELM model when grade 3 data were used as source and grade 2 data as the target.
The method is further investigated in Liu et al. [72], where the distribution discrepancy between the grades is firstly reduced through a feature transformation using a GAN, before applying the DAELM method.

Industrial Process Systems
Signal drifts often affect process systems. These changes in data distribution over time lead to a decrease in SS performance.
In such cases, a possible solution for the designer is to fine-tune the model over the new working points. In Hsiao et al. [73], ANN fine-tuning strategies over small datasets are explored to adapt the SS of a refinery distillation column over time. To avoid losing previously learned knowledge the strategy adopted is to freeze the inner layers, updating only the outer ones. Performances are evaluated in terms of RMSE through graphical plots with respect to different-sized target data-sets, showing how the fine-tuned ANN performed better than its simple counterpart.
In Curreri et al. [10], fine-tuning and hyperparameter adaptation strategies are investigated in a cross-entity setting for different-sized target labeled datasets, in the design of a transferable SS for a Sulfur Recovery Unit (SRU). Experiments are performed using LSTMs and Recurrent Neural Networks (RNNs) to compare their transferability performance. To evaluate the performances of the proposed cross-entity method, results were compared between an optimized SS whose design procedure took 100 h and a transferred one whose transferring procedure took 7 min. In particular, RNN-based transferred SS showed an average degradation of only 8% of the CC on test data, whereas the LSTM-based transferred SS showed the same CC as the originally optimized one.
Some modelling methods are known to be effective against gradual and abruptrecurrent changes in process characteristics, such as Moving Window (MW) and Just-In-Time-Learning (JITL). These two methods are adopted in Alakent [74,75] where an adaptive learning frame to develop an SS able to contrast the signal drift phenomenon is proposed. The proposed method retunes the hyperparameters of the algorithm using a historical dataset through a weighting user-controlled parameter, which represents the trade-off between the information extracted from the new target samples and the JITL predictions. The technique is tested for the SS design of a debutanizer column (DC) and an SRU. The accuracy of the method is evaluated through the average RMSE for the studied cases. In the case of the DC, the transferring procedure reduced the RMSE by 66%, whereas in the case of the SRU, the RMSE was reduced by 31% for the first output and 23% for the second output.
A JITL-based model is used again in Guo et al. [46] where the transfer is performed through a feature extraction using a Gaussian Mixture Variational Autoencoder (GMVAE). When a new sample is considered, D KL is adopted to measure its similarity with historical data samples. Based on the result, weighted input and output historical data are obtained and used in the final model. Validation of the method is performed through RMSE, MAE and Mean Relative Error (MRE) between different JITL-based models from the literature and the proposed one. In particular, results on the adopted dataset showed a reduction of 48% in the RMSE and of 60% in the MAE and MRE with respect to a distance-based JITL model.
Feature-based knowledge transfer methods are investigated, in both cross-phase and cross-entity settings, as shown in Farahani et al. [11,76] for power plant SS design. A Domain Adversarial training Neural Network (DANN) approach is employed to firstly perform feature extraction and then the actual regression. In particular, the architecture of the DANN consists of three parts: a feature extractor, a regression model and a domain discriminator. The first maps m-dimensional input data into a one-dimensional feature representation. The second maps the latter into the output space, performing the prediction regression task. At the same time, the one-dimensional feature representation is introduced to the domain discriminator as well, which detects whether the input instances come from the source or target domain. To reduce the difference between the samples, the adversarial training procedure between the feature extractor and the domain discriminator is performed so that the extracted features become more indistinguishable between source and target domains. This way, the regression part, trained solely on the source data, can predict target data more accurately, without even needing their corresponding label. To quantify the performance of the performance of the TL, an index score here called Trans f er Ratio (TR) is introduced. TR is the target Mean Squared Error (MSE) when not using TL over the target MSE when using this TL technique: Trans f er Ratio = Target MSE without TL Target MSE with TL .
Results showed an average TR of 2.93 between the studied cases in the cross-entity case and an average TR of 1.81 between the studied cases in the cross-phase case.
Besides signal drifts, one of the problems afflicting SS design is labeled data scarcity, since process variables are sampled at a higher sample rate than quality variables. Incremental learning techniques to improve the performance of an SS when a low number of labeled data in the target domain are available is reported in Graziani and Xibilia [77]. The performance of an ANN-based SS for a refinery Sour Water Stripping (SWS) plant is improved by combining a preliminary PCA phase and a data selection procedure, based on DUPLEX and SPXY data selection algorithms. Evaluation of the approach is made in terms of CC and RMSE between the cases of simply applying a random-selection procedure without PCA and either DUPLEX and SPXY algorithms with PCA. Results showed an average improvement of 13% in the CC and a reduction of 14% of the RMSE in the case of PCA + DUPLEX with respect to the simple random-selection procedure.

Conclusions and Future Trends
The functionality of TL methods for SS design in industrial processes is a growing field of research. The existing literature demonstrates that TL can significantly enhance the SS performance, designing SSs that can both face cross-phase and cross-entity scenarios. Many questions are still open and need further research to make TL an efficient solution in industrial environments.
The first issue consists of a proper strategy to select the best transfer methods based on the process and dataset characteristics. The second issue refers to the definition of suitable metrics which allows evaluation of the applicability of each method providing an estimation of the transfer procedure performance. Another issue is related to determining the minimum size of the target dataset, both as regards input and output variables that guarantee the applicability of a given method. This should however depend both on the process characteristics and the applied method. Moreover, most of the implemented methods are actually parameter-and feature-based in homogeneous settings. Instancebased methods and heterogeneous settings, as well as relational-based approaches, still need further investigations.
The current applications and methods are still limited; further research is therefore needed also to evaluate the benefits of TL and compare the different strategies on realworld case studies. Hybrid solutions could be also investigated in the future to merge the advantages of different methods.