A Review of Kernel Methods for Feature Extraction in Nonlinear Process Monitoring

: Kernel methods are a class of learning machines for the fast recognition of nonlinear patterns in any data set. In this paper, the applications of kernel methods for feature extraction in industrial process monitoring are systematically reviewed. First, we describe the reasons for using kernel methods and contextualize them among other machine learning tools. Second, by reviewing a total of 230 papers, this work has identiﬁed 12 major issues surrounding the use of kernel methods for nonlinear feature extraction. Each issue was discussed as to why they are important and how they were addressed through the years by many researchers. We also present a breakdown of the commonly used kernel functions, parameter selection routes, and case studies. Lastly, this review provides an outlook into the future of kernel-based process monitoring, which can hopefully instigate more advanced yet practical solutions in the process industries.


Introduction
Process monitoring refers to various methods used for the detection, diagnosis, and prognosis of faults in industrial plants [1,2]. In literature, the term "fault" has been defined as any unpermitted deviation of at least one process parameter or variable in the plant [3]. Although controls are already in place to compensate for process upsets and disturbances, process faults can still occur [1]. These faults include sensor faults (e.g., measurement bias), actuator faults (e.g., valve stiction), fouling, loss of material, drifting reaction kinetics, pipe blockages, etc. Fault detection, diagnosis, and prognosis methods aim to, respectively, determine the presence, identify the cause, and predict the future behavior of these process anomalies [2,4]. Thus, process monitoring is a key layer of safety for maintaining an efficient and reliable operation of industrial plants [5].
In general, process monitoring can be performed using either a physics-driven, knowledge-driven, or data-driven approach (see Figure 1) [1,6]. Among these, the data-driven approach may be preferred due to the following reasons. Physics-driven methods rely on a first-principles model of the system, i.e., mass-and-energy balances and physical/chemical principles, which is used to check how well the theory agrees with the observed plant data. However, these models are difficult to construct given the complexity of modern industrial plants [6]. Similarly, knowledge-driven methods rely on expert knowledge and the experience of plant operators to judge process conditions, but a comprehensive The popularity of data-driven MSPM methods has increased in the past few decades, especially towards the advent of the Industry 4.0 era. Applications of machine learning [9][10][11], Big Data [12,13], artificial intelligence (AI) [14], and process data analytics [15,16] to the process systems engineering (PSE) field are now gaining acceptance. Deep neural nets, support vector machines, fuzzy systems, principal components analysis, k-nearest neighbors, K-means clustering, etc., are now being deployed to analyze plant data, generate useful information, and translate results into key operational decisions. For instance, Patwardhan et al. [17] recently reported real-world applications of these methods for predictive maintenance, alarm analytics, image analytics, and control performance monitoring, among others. Applications of the MSPM methods to an industrial-scale multiphase flow facility at Cranfield University have also been reported in [18,19]. Until now, new methods are still being developed within the machine learning and AI community, and so do their applications in PSE. This means that it may be difficult to select which data-driven methods to use. Nevertheless, chemical engineers can apply their domain expertise to match the right solutions to the right engineering problems.
Despite the benefits of data-driven techniques, it is still challenging to use them for process monitoring due to many issues that arise in practice. One key issue that is highlighted in this paper is the fact that real-world systems are nonlinear [20]. More precisely, the relationship between the process variables are nonlinear. For example, pressure drop and flow rate have a squared relationship according to Bernoulli's equation, outlet stream temperature and composition in a chemical reactor are nonlinearly related due to complex reaction kinetics, and so on. These patterns must be learned and taken into account in the statistical models. If the analysis of data involves linear methods alone, fault detection may be inaccurate, yielding many false alarms and missed alarms. Note, however, that linear methods can still be applied provided that the plant conditions are kept sufficiently close to a single operating point. This is due to the fact that a first-degree (linear) Taylor series approximation of the variable relationships can be assumed close to a fixed point. Linear methods are attractive because they rely only on simple linear algebra and matrix theory, which are elegant and computationally accessible. However, if the plant is operating at a wide range of conditions, the resulting nonlinear dynamic behavior must be addressed with more advanced techniques.
Kernel methods or kernel machines are a class of machine learning methods that can be used to handle the nonlinear issue. The main idea behind kernel methods is to pre-process the data by projecting them onto higher-dimensional spaces where linear methods are more likely to be applicable [21]. Thus, kernel methods can discover nonlinear patterns from the data while retaining the computational elegance of matrix algebra [22]. In the process monitoring context, kernel learning is mostly used in the feature extraction step of the analysis of plant data. In this paper, we review the applications of kernel methods for feature extraction in nonlinear process monitoring.
In detail, the objectives of this review are: (1) To motivate the use of kernel methods for process monitoring; (2) To identify the issues regarding the use of kernel methods to perform feature extraction for nonlinear process monitoring; (3) To review the literature on how these issues were addressed by researchers; and (4) To suggest future research directions on kernel-based process monitoring. This work is mainly dedicated to the review of kernel-based process monitoring methods, which has not appeared before to the best of the authors' knowledge. Other related reviews that may be of interest to the reader are also available, as listed Table 1, along with their relationship to this paper.
This review paper is timely for two reasons. The original proponent of the first developed kernel feature learner called kernel principal components analysis (KPCA) was Bernhard Schölkopf [22] in a 1998 paper, together with Alexander Smola and Klaus-Robert Müller. KPCA paved the framework for more kernel extensions of linear machines, known today as kernel methods. For his contributions, Schölkopf was awarded the Körber Prize last September 2019, which is "the scientific distinction with the highest prize money in Germany" [23]. This recognition highlights the impact kernel methods have made to the field of data analytics. The purpose of this paper is to showcase this impact in the process monitoring field. Shortly after, Lee et al. [24] was the first to use KPCA for nonlinear process monitoring in 2004. Hence, this paper is timely as it reviews the development of kernel-based process monitoring research for the last 15 years since the first application by Lee et al. This paper is organized as follows. In Section 2, we first motivate the use of kernel methods and situate them among other machine learning tools. Section 3 provides the methodology on how the literature review was conducted, and also includes a brief summary of review results. The main body of this paper is Section 4, where we detail the issues surrounding the use of kernel methods in practice, and the many ways researchers have addressed them through the years. A future outlook on this area of research is given in Section 5. Finally, the paper is concluded in Section 6.

Motivation for Using Kernel Methods
To motivate the use of kernel methods, we first discuss how a typical data-driven fault detection framework works (see Figure 2). A plant data set for model training usually consists of N samples of M variables collected at normal operating conditions. This data is normalized so that the analysis is unbiased to any one variable, i.e., all variables are treated equally. Firstly, the data set undergoes a feature extraction step. We refer to feature extraction as any method of transforming the data in order to reveal a reduced set of mutually independent signals, called features, that are most sensitive to process faults. In Figure 2, this step is carried out by multiplying a projection matrix of weight vectors to a vector of samples, x k , at the kth instant. Secondly, a statistical index is built from the features, which serves as a health indicator of the process. The most commonly used index is Hotelling's T 2 , which is computed as shown in the figure as well. Finally, the actual anomaly detector is trained by analyzing the distribution of T 2 . In this step, the aim is to find an upper bound or threshold on the normal T 2 values, called the upper control limit or UCL. This threshold is based on a user-defined confidence level, e.g., 95%, which represents the fraction of the area under the distribution of T 2 that is below the UCL. During the online phase, an alarm is triggered whenever the computed T 2 exceeds the fixed T 2 UCL , signifying the presence of a fault.
When a fault is detected, fault diagnosis is usually achieved by identifying the variables with the largest contributions to the value of T 2 at that instant. Lastly, fault prognosis can be performed by predicting the future evolution of the faulty variables or the T 2 index itself.

Feature Extraction Using Kernel Methods
Among the three basic steps in Figure 2, feature extraction is found to have the greatest impact to process monitoring performance. Even in other contexts, feature engineering is regarded as the one aspect of machine learning that is domain-specific and, hence, requires creativity from the user [39,40]. As such, traditional MSPM methods mainly differ in how the weight vectors are obtained. Weights can be computed via principal components analysis (PCA), partial least squares (PLS), independent components analysis (ICA), Fisher/linear discriminant analysis (FDA or LDA), or canonical correlation analysis (CCA) [1]. However, only a linear transformation of the data is involved in these methods. Mathematically, a linear transformation can be written as: where W n ∈ R M×n is the projection matrix, f k ∈ R n are the features, and x k ∈ R M are the normalized raw data at the kth instant. For the case of PCA, the W can be computed by diagonalizing the sample covariance matrix, C = cov(x k , x k ), as [1]: where V contains the eigenvectors with corresponding eigenvalues in Λ. Only the first n columns of W are taken to finally yield W n . The weights from PCA are orthogonal basis vectors that describe directions of maximum variance in the data set [1]. In order to generate nonlinear features, a nonlinear mapping can be used to transform the data, φ(x), so that Equation (1) becomes f k = W T n φ(x k ). However, the mapping φ(·) is unknown and difficult to design. In 1998, Schölkopf et al. [22] proposed to replace the sample covariance matrix, C = cov(φ(x k ), φ(x k )), by a kernel matrix K ij = k(x i , x j ) whose elements are computed by a kernel function, k(· , ·). They have shown that if the kernel function satisfies certain properties, it can act as a dot product in the feature space. That is, the K ij can take the role of a covariance matrix of nonlinear features. By adopting a kernel function, the need to specify φ(·) has now been avoided, and this realization has been termed as the kernel trick [22]. The result is a method called kernel principal components analysis (KPCA) [22], a nonlinear learner trained by merely solving the eigenvalue decomposition of K ij as in Equation (2). As mentioned in Section 1, KPCA is the first kernel method applied to process monitoring as a feature extractor [24].
Upon using kernel methods, the nonlinear transformation is now equivalent to [22]: where w i ∈ R M is a column weight vector, f k ∈ R n are the features, x k ∈ R M is the new data to be projected, x ∈ R M is the training data set, and k(· , ·) is the kernel function. The kernel function is responsible for projecting the data onto high-dimensional spaces where, according to Cover's theorem [21], the features are more likely to be linearly separable. This high-dimensional space is known in functional analysis as a Reproducing Kernel Hilbert Space (RKHS) [22]. Usual choices of kernel functions found from this review are as follows: Gaussian radial basis function (RBF): Polynomial kernel (POLY): Sigmoid kernel (SIG): where a, b, c, d are kernel parameters to be determined by various selection routes.
To understand what happens in the kernel mapping, Figure 3 shows three sample data sets and their projections in the kernel feature space. The red and blue data points belong to different classes, and evidently, it is impossible to separate them by a straight line in the original data space. However, after a kernel transformation onto a higher dimensional space, it is now possible to separate them using a linear plane (white contour), which translates to a nonlinear boundary in the original space. In these examples, an RBF kernel of various c values was used, Equation (5), and the transformation is computed using Support Vector Machines (SVM). More theoretical details on kernel methods, KPCA, and SVM can be found in other articles [22,41,42] Figure 3. Illustration of kernel nonlinear transformation. These were generated with code available in https://uk.mathworks.com/matlabcentral/fileexchange/65232-binary-and-multi-class-svm.

Kernel Methods in the Machine Learning Context
Aside from kernel methods, other tools from machine learning can also be applied to process monitoring. Figure 4 gives an overview of learning methods that are relevant to process monitoring, from the authors' perspective. Each method in this figure represents a body of associated techniques, and so the reader can search using these keywords to learn more. More importantly, the methods that were marked with an asterisk (*) have a "kernelized" version, and so they belong to the family of kernel methods. To kernelize means to apply the kernel trick to a previously linear machine. For example, PCA becomes Kernel PCA, Ridge Regression becomes Kernel Ridge Regression, K-means clustering becomes Kernel K-means, and so on. All these methods were developed to solve a particular learning problem or learning task, such as classification, regression, clustering, etc.
Supervised and unsupervised learning are the two main categories of learning tasks (although semi-supervised, reinforcement, and self-supervised learning categories also exist [9,11,46]). According to Murphy [47], learning is supervised if the goal is to learn a mapping from inputs to outputs, given a labeled set of input-output pairs. On the other hand, learning is unsupervised if the goal is to discover patterns from a data set without any label information. In the context of process monitoring, examples of learning problems under each category can be listed as follows: •

Supervised learning
Classification: Given data samples labeled as normal and faulty, find a boundary between the two classes; or, given samples from various fault types, find a boundary between the different types.
Regression: Given samples of regressors (e.g., process variables) and targets (e.g., key performance indicators), find a function of the former that predicts the latter; or, find a model for predicting the future evolution of process variables whether at normal or faulty conditions. In relation to the framework in Figure 2, one possible correspondence would be the following: (1) Use dimensionality reduction or clustering for feature extraction; (2) Use density estimation for threshold setting; (3) Use classification for diagnosis; and, (4) Use regression for prognosis and other predictive tasks. It is clear from Figure 4 that kernel methods can participate in any stage of the process monitoring procedure, not just in the feature extraction step. In fact, many existing frameworks already used kernel support vector machines (SVM) for fault classification, kernel density estimation (KDE) for threshold setting, etc. We also note that many other alternatives to kernel methods can be used to perform each learning task. For instance, an early nonlinear extension of PCA for process monitoring was based on principal curves and artificial neural networks (ANN) by Dong and McAvoy [48] in 1996. Even today, ANNs are still a popular alternative to kernel methods.

Relationship between Kernel Methods and Neural Networks
Neural networks are attractive due to their universal approximation property [49], that is, they can theoretically approximate any function to an arbitrary degree of accuracy [45]. Both ANNs and kernel methods can be used for nonlinear process monitoring. However, one important difference between them is in the computational aspect. Kernel methods such as KPCA are faster to train (see Section 2.1), whereas ANNs require an iterative process for training (i.e., gradient descent) because of the need to solve a nonlinear optimization problem [44]. But during the online phase, kernel methods may be slower since they need to store a copy of the training data in order to make predictions for new test data (see Equation (4)) [45]. In ANNs, once the parameters have been learned, the training data set can be discarded [45]. Thus, kernel methods have issues with scalability. Another distinction is provided by Pedro Domingos in his book The Master Algorithm [50] in terms of learning philosophy: If ANNs learn by mimicking the structure of the brain, kernel methods learn by analogy. Indeed, the reason why kernel methods need to store a copy of the training data is so that it can compute the similarity between any test sample and the training samples. The similarity measure is provided by the kernel function, k(· , ·) [44]. However, selecting a kernel function is also a long-standing issue. Later on, this review includes a survey of the commonly used kernel functions for process monitoring.
Despite the many distinctions between kernel methods and ANNs, neither of them is clearly superior to the other. Presently, many of the drawbacks of each are already being addressed, and their unique benefits are also being enhanced. Also, these two approaches are connected in some ways, as explained in [45]. For instance, the nonlinear kernel transformation in Equation (4) can be interpreted as a two-layer network [51]: the first layer corresponds to x k → k(x k , x ), while the second layer corresponds to k(x k , x ) → f k with weights, w i .
ANNs have found success in many areas, especially in computer vision where deep ANNs [52] have reportedly surpassed human-level performance for image recognition [53]. Opportunities for applying deep ANNs to the field of PSE were also given in [9]. Meanwhile, kernel methods were shown to have matched the accuracy of deep ANNs for speech recognition [54]. In the real world, kernel methods have been applied successfully to wind turbine performance assessment [55], machinery prognostics [56], and objective flow regime identification [57], to name a few.
In the AI community, methods that combine kernel methods with deep learning are now being developed, such as neural kernel networks [58,59], deep neural kernel blocks [60], and deep kernel learning [61,62]. A soft sensor based on deep kernel learning was recently applied in a polymerization process [63]. Based on these recent advances, Wilson et al. [62] has concluded that the relationship between kernel methods and deep ANNs must not be competing, but rather, complementary. Perhaps a more forward-looking claim would be that of Belkin et al. [51], who said that "in order to understand deep learning we need to understand kernel learning". Therefore, kernel methods are powerful and important machine learning tools that are worthwhile to consider in practice.

Methodology and Results Summary
Having motivated the importance of kernel methods in the previous section, the rest of the paper is dedicated to a review of their applications to process monitoring.

Methodology
The scope of this review is limited to the applications of kernel methods in the feature extraction step of process monitoring. This is because we are after the important issues in feature extraction that may drive future research directions. Papers that used kernelized MSPM tools such as kernel PCA, kernel ICA, kernel PLS, kernel FDA, kernel SFA, kernel CCA, kernel LPP, kernel CVA, etc. were included, although their details are not given here. Meanwhile, papers that used kernel methods in other stages of process monitoring (e.g., SVMs for fault classification, Gaussian Processes (GP) for fault prediction, and KDE for threshold setting) may also appear, but these are not the main focus. Moreover, this review only includes papers with industrial process case studies, such as the Tennessee Eastman Plant benchmark. A review of literature on the condition monitoring of electro-mechanical system case studies (e.g., rotating machinery) can be found elsewhere [64,65]. Interested practitioners are also referred to Wang et al. The keywords used for searching were "kernel and fault". Keywords such as "monitoring", "detection", and "diagnosis" were not used because not all intended papers contain these words in the text. From the search results, only the papers that fit the aforementioned scope were included; 155 papers were found this way. Also, selected papers from other journals and conference proceedings were found by following citations forwards and backwards. However, a comprehensive search is not guaranteed. The entire search process was performed in October 2019, and hence, only published works until this time were found. In the end, a total of 230 papers were included in this review. Figure 5 shows the distribution of the reviewed papers by year of publication. The overall increasing trend in the number of papers indicate that kernel-based feature extraction is being adopted by more and more process monitoring researchers. Figure 6a then shows the most commonly used kernelized feature extractors for nonlinear process monitoring. Kernel PCA is most widely used, followed by kernel PLS, kernel ICA, kernel FDA, kernel CVA, and so on. The widespread use of kernel PCA can be attributed to the fact that linear algorithms can be kernelized by performing kernel PCA followed by the linear algorithm itself. For instance, kernel ICA is equivalent to kernel PCA + ICA [66]. Likewise, kernel CVA can be performed as kernel PCA + CVA [67]. Hence, kernel PCA was cited more frequently than other techniques.   In the reviewed papers, application case studies were also used for evaluating the effectiveness of the proposed kernel methods for process monitoring. Figure 6b shows the breakdown of papers according to the type of case study they used: simulated or real-world. As shown, only 27% of the papers have indicated the use of at least one real-world data set, taken from either industrial processes or laboratory experiments. On the other hand, the rest of the papers used simulated data sets alone for testing. The Tennessee Eastman Plant (TEP) is found to be the most commonly used simulated case study. It may still be advantageous to use simulated case studies since the characteristics of the simulated data are usually known or can be built in the simulator. Hence, the user can highlight the strengths of a particular method by its ability to handle certain data characteristics. Another advantage of using simulated data is that tests can be repeated many times by performing many Monte Carlo simulations. Nevertheless, the ultimate goal should still be to assess the proposed methods on real-world data. For instance, in a paper by Fu et al. [68], kernel PCA and kernel PLS were applied to 3 different real-world data sets: two from the chemical process industry and one from a laboratory mixing experiment. Among the chemical processes is a butane distillation system. Vitale et al. [69] also used real-world data sets from the pharmaceutical industry to test kernel methods. Results from these examples have proven that handling the nonlinear issue is important for monitoring real-world industrial processes.

Results Summary
However, issues arise in the application of kernel methods for nonlinear process monitoring. After a careful study of the papers, 12 major issues were identified and listed in Table 2. The table includes the number of papers that addressed each of them. Although some of these issues are not unique to kernel methods alone, we review them within the context of kernel-based feature extraction. The bulk of this paper is dedicated to the discussion of these issues.
A list of all the reviewed papers is then given in Table 3. The table also shows the kernelized method they used, the case studies they used, the kernel functions they used, and more importantly, the issues they addressed. The purpose of this table is to help the reader choose a specific issue of interest (A to L) and peruse down the column for papers that addressed it. In the column on case studies, we have also highlighted in bold the ones that are real-world or industrial applications. The reader is referred to the appendix for the list of all abbreviations in this table.

Review Findings
In this section, the major issues on kernel-based process monitoring, as identified and presented in Table 2, are discussed one by one. We first motivate why they are important and then give examples of how they were addressed by many researchers through the years.

Batch Process Monitoring
Monitoring batch processes is important so as to reduce batch-to-batch variability and maintain the quality of products [70]. The first application of kernel PCA to process monitoring was in a continuous process [24], wherein the plant data set is a matrix of M variables ×N samples (2-D) (see Section 2). In contrast, for a batch process, the plant data set is a tensor of K batches ×M variables ×N samples (3-D) and, hence, must be handled differently. A multi-way approach is commonly adopted, where tensor data is unfolded into matrix data either variable-wise or batch-wise so that the kernel MSPM method can now apply. This led to multi-way kernel PCA [71], multi-way kernel ICA [72,73], multi-way kernel FDA [74,75], and so on. Variable correlation analysis (VCA) and its kernelized version was also proposed for batch process monitoring in [76,77]. Common batch process case studies include the fed-batch fermentation process for producing penicillin (PenSim) available as a simulation package from Birol et al. [78], the hot strip mill process (HSMP) as detailed in [79], the injection moulding process (IMP) [80], and other pharmaceutical processes [69,81].
If batch data sets have uneven lengths, the trajectories must be synchronized prior to analysis. Dynamic time warping (DTW) is one such technique to handle this issue, as adopted by Yu [75] and Rashid and Yu [82]. Another problem is related to the multi-phase characteristic of batch process data. Since a whole batch consists of steady-state and transition phases, then each phase must be modelled differently. Phase division has been employed to address this issue, as did Tang et al. [77] and Peng et al. [83]. In all these studies, the RBF and POLY kernels were mostly used to generate nonlinear features for process monitoring. But in particular, Jia et al. [84] has found that the POLY kernel is optimal for the PenSim case study, as calculated by a genetic algorithm (GA).       We refer the reader to the reviews by Yao and Gao [297] and Rendall et al. [298] for more information on batch process data analytics beyond the application of kernel methods.

Dynamics, Multi-Scale, and Multi-Mode Monitoring
Recall that in the framework of Figure 2, a column vector of samples at instant k is used to generate the statistical index for that instant. This scheme is merely static, however. It does not account for the trends and dynamic behaviors of the plant in the statistical model. Dynamic behaviors manifest in the data as serial correlations or trends at multiple time scales, which can arise from varying operating conditions. It is important to address both nonlinear and dynamic issues, as they can improve the accuracy of fault detection significantly [25].
To address dynamics, features must be extracted from time-windows of data samples at once (lagged samples) rather than sample vectors at one instant only. Dynamic extensions of kernel PCA [85,96,115,116,260], kernel PLS [101], and kernel ICA [66] have used this approach. In addition, some MSPM tools are inherently capable of extracting dynamic features effectively, such as canonical variate analysis (CVA) [299], slow feature analysis (SFA) [300], and dynamic latent variable models (DLV). Kernel CVA is the kernelized version of CVA and is used in many works [67,166,172,177,178,223,224,281,290,291]. Meanwhile, kernel slow feature analysis has appeared in [174,215,216,259], and more recently, the kernel dynamic latent variable model was proposed in [225]. The details of kernel CVA, kernel SFA, and kernel DLV can be found in these references. For mining the trends in the data at multiple time scales, wavelet analysis is commonly used. Multi-scale kernel PCA was first proposed by Deng and Tian [91], followed by similar works in [94,95,134,169,210], which includes multi-scale kernel PLS and multi-scale kernel FDA. A wavelet kernel was also proposed by Guo et al. [137], which was applied to the Tennessee Eastman Plant (TEP).
Multi-modality is a related issue found in processes that are designed to work at multiple operating points [38]. Figure 7 shows an example of a data set taken from the multiphase flow facility at Cranfield University [18], which exhibits multi-modality on the air flow measurements. The challenge is having to distinguish if transitions in the data are due to a change in operating mode or due to a fault. If this issue is not addressed, the changes in operating mode will trigger false alarms [38]. To address this issue, Yu [75] used k-nearest neighbors to classify the data prior to performing localized kernel FDA for batch process monitoring. Meanwhile, Khediri et al. [131] used kernel K-means clustering to identify the modes, and then support vector data description (SVDD) to detect faults in each cluster. Other ways to identify modes include a kernel Gaussian mixture model [136], hierarchical clustering [139,142], and kernel fuzzy C-means [199,234]. More recently, Tan et al. [295,296] proposed a new kernel design, called non-stationary discrete convolution kernel (NSDC), for multi-mode monitoring (see Section 4.7). The NSDC kernel was found to yield better detection performance than the RBF kernel based on the multiphase flow facility data [18].

Fault Diagnosis in the Kernel Feature Space
Diagnosis is a key process monitoring task. When a fault is detected in the plant, it is imperative to determine where did it occur, what type of fault is it, and how large its magnitude. The actual issue is that when nonlinear feature extraction is employed, fault diagnosis is more difficult to perform.

Diagnosis by Fault Identification
The usual practice is to first identify the faulty variables based on their influence to the value of the statistical index. This scheme is called fault identification. It is beneficial to identify which variables are associated to the fault, especially when the plant is highly integrated and the number of process variables is large [1]. There are 2 major ways to perform fault identification: variable contributions and variable reconstructions. Variable contributions are computed by taking the first-order Taylor series expansion of the statistical index to reveal which variables contribute the most to its value [87]. In the other approach, each variable is reconstructed in terms of the remaining variables to estimate the fault magnitude (the amount of reconstruction) along that direction [117]. Hence, variables with the largest amount of reconstructions are associated to the fault. Results can be visualized in contribution plots or contribution maps [301] to convey the diagnosis.
Fault identification is straightforward if the feature extraction involves only a linear machine. For kernel methods, however, it is complicated by the fact that the data went through a nonlinear mapping. This is because both approaches entail differentiating the statistical index, which is difficult if the chain involves a kernel function [86]. Nevertheless, many researchers have derived analytical expressions for either kernel contributions-based diagnosis [66,79,81,83,87,94,119,127,133,136,146,150,156,157,162,164,194,213,241,268,275,276,278,279,288,289,293] or kernel reconstructions-based diagnosis [86,117,140,155,161,163,176,217,236,254,265,285]. However, most derivations are applicable only when the kernel function is the RBF, Equation (5). In one approach, Tan and Cao [251] proposed a new deviation contribution plot to perform fault identification for any nonlinear feature extractor.

Diagnosis by Fault Classification
The fault identification approach assumes that no prior fault information is available for making a diagnosis. If fault information is available, then the learning problem becomes that of finding the boundary between normal and faulty samples or the boundary between different fault types, within the feature space (see Section 2.2). This learning problem pertains to fault classification, and the three common approaches are similarity factors, discriminant analysis, and SVMs.
The similarity factor method (SFM) was proposed by Krzanowski [302] to measure the similarity of two data sets using PCA. For fault classification, the idea is to compute the similarity between the test samples against a historical database of fault samples, and find the fault type that is most similar. A series of works by Deng and Tian [91,95,148] used SFM for diagnosis, after performing multi-scale KPCA for fault detection. Ge and Song [303] also proposed the ICA similarity factor, although it was not performed in a kernel feature space. SFM was also applied to features derived from kernel slow feature analysis (SFA) [175] and serial PCA [257].
Discriminant analysis, notably Fisher discriminant analysis (FDA), is a linear MSPM method that transforms the data as in Equation (1) where the weights are obtained by maximizing the separation of samples from different classes while minimizing the scatter within each class [1]. This means that the generated features from FDA are discriminative in nature. Kernel FDA, its nonlinear extension, is used extensively such as in [74,75,80,92,98,102,103,105,118,130,151,169,175,183,195,204,222,232,238,258,266,294]. One variant of FDA is exponential discriminant analysis (EDA) which solves the singularity problem in the FDA covariance matrices by taking their exponential forms [281,283]. Another variant is scatter-difference-based discriminant analysis (SDA), whose kernel version first appeared in [99], and then in [104,124]. SDA differs from FDA in that the difference of between-class scatter and within-class scatter matrices is maximized rather than their ratio, and hence avoids any matrix inversion or singularity problems [99]. Lastly, a kernel PLS discriminant analysis variant is used in batch process monitoring in [69].
SVM is a well-known method of choice for classification in machine learning, originally proposed by Cortes and Vapnik [304]. It is also regarded as the most popular kernel method, according to Domingos [50], although he also advocates that simpler classifiers (e.g., kNN) must be tried first before SVM [40]. In this regard, Zhang [106,305] used SVM on kernel PCA and kernel ICA features to perform diagnosis. Xu and Hu [121] and Xiao and Zhang [203] used a similar approach for classification, but also employs multiple kernel learning [306]. Meanwhile, Md Nor et al. [232] used SVM on the features from multi-scale kernel FDA. Aside from SFM, FDA, and SVM, an ANN-based fault classifier was also used by Bernal de Lazaro [183] on kernel PCA and kernel FDA features.
The Tennessee Eastman Plant (TEP) is usually the case study in most of these papers, as it contains samples at normal plant operation as well as from each of 20 different fault scenarios. Once the fault classifier is trained, it can automatically assign every new test sample as to normal or to any fault scenario it was trained on. However, the fault classification methods require a database of samples from many different fault scenarios a priori, in order to provide a comprehensive diagnosis.

Diagnosis by Causality Analysis
So far, the above methods are unable to perform a root cause diagnosis. Root cause diagnosis is valuable for cases when the fault has already propagated to multiple locations, making it difficult to locate its origin. To perform such a task, the causal relationships between process variables must be known so that the fault propagation can be traced throughout the plant [307]. Causal information can be supplied by process knowledge, plant operator experience, or model-based principles. One such work is by Lu and Wang [101], who used a signed digraph (SDG) model of the TEP consisting of 127 nodes and 15 root-cause nodes, and then used 20 local dynamic kernel PLS models for the subsystems. However, as a consequence of the kernel mapping, traversing the SDG backwards is difficult since it is impossible to find the inverse function from the kernel feature space to the original space [101]. Hence, the diagnosis was only performed qualitatively in that work [101].
The Bayesian network is an architecture for causality analysis, where the concepts of Granger causality and transfer entropy are used to define if one variable is caused by another based on their time series data. In 2017, Gharahbagheri et al. [236,237] used these concepts together with the residuals from kernel PCA models to generate a causal map for a fluid catalytic cracking unit (FCCU) and the TEP. A statistical software called Eviews was used to perform causality analysis.
In the future, fault diagnosis by causality analysis can potentially benefit from the combination of knowledge-, physics-, and data-driven approaches [1].

Handling Non-Gaussian Noise and Outliers
Recall that in the feature extraction step in Figure 2, it is desired to yield features that are mutually independent so that the T 2 statistical index can be built. However, previous methods such as PCA and PLS (even their kernelized versions) may fail to yield such features, especially if the data is laden with non-Gaussian noise or outliers. This issue is widely recognized in practice [25]. Instinctively, MSPM methods can be used for detecting outliers. However, if outliers are present in the training data itself, the accuracy of MSPM algorithms will be seriously affected.
Independent components analysis (ICA) and its kernelized version, kernel ICA, are widely used MSPM methods that can handle the non-Gaussianity issue. Here, the data is treated as a mixture of independent source signals, so that the aim of ICA is to de-mix the data and recover these sources [308]. To do this, the projection matrix in ICA, W n (also known as a de-mixing matrix), is chosen so that the ICA features are as statistically independent as possible [308]. More concretely, the goal is usually to maximize negentropy, which is a measure of the distance of a distribution from Gaussianity [309]. Kernel ICA can be performed by doing kernel PCA for whitening, followed by linear ICA, as did many researchers [66,72,73,82,90,97,100,106,107,133,140,145,154,155,157,188,203,213,233,239,265,275,276,283,305].
A variant of kernel ICA that avoids the usual KPCA-ICA combination is also proposed by Feng et al. [262]. Aside from kernel ICA, the non-Gaussianity issue can also be handled using a kernel Gaussian mixture model [136], the use of statistical local approach for building the statistical index [112], and kernel density estimation (KDE) for threshold setting [67,194,251].
To handle outliers in the data, Zhang et al. [134] and Deng and Wang [255] incorporated a sliding median filter and a local outlier factor method, respectively, into kernel PCA. Other outlier-robust methods include the spherical kernel PLS [153], the joint kernel FDA [204] and the kernel probabilistic latent variable regression model [235].

Improved Sensitivity and Incipient Fault Detection
Despite the use of advanced MSPM tools, it may be desired to improve their detection sensitivity further. This is beneficial in particular for detecting incipient faults, which are small-magnitude faults with a drifting behavior. These faults are difficult to detect at the initial stage because they are masked by noise and process control [67]. Yet because they are drifting, they can seriously escalate if no action takes place. Kernel MSPM solutions to these issues already exist, which we review as follows.
An early approach for improved detection is dissimilarity analysis (DISSIM), proposed by Kano et al. [310]. DISSIM is mathematically equivalent to PCA but its statistical index is different from the T 2 in that it quantifies the dissimilarity between data distributions. Its kernel version, kernel DISSIM, was developed by Zhao et al. [113], and further used in Zhao and Huang [263]. The concept of dissimilarity was also adopted by Pilario et al. [67] and Xiao [291] for kernel CVA and Rashid and Yu [311] for kernel ICA. Related to DISSIM is statistical pattern analysis (SPA), used in [148,221,258] for kernel PCA. The idea of SPA, as proposed by He and Wang [312], is to build a statistical index from the dissimilarity between the higher-order statistics of two data sets.
Another approach is to use an exponentially weighted moving average filter (EWMA) to increase the sensitivity for drifting faults, as did Yoo and Lee [88], Cheng et al. [116], Fan et al. [154], and Peng et al. [283]. The shadow variables by Feng et al. [262] also involve applying EWMA on the statistical indices for smoothing purposes as well. For batch processes, a method for detecting weak faults is also proposed by Wang et al. [139]. The works of Jiang and Yan [143,144] improved the sensitivity of kernel PCA by investigating the rate of change of the statistical index and by giving a weight to each feature. Lastly, a new statistic based on the generalized likelihood ratio test (GLRT) can also improve detection for kernel PCA and kernel PLS, as shown by Mansouri et al. [192,193,210,270,271].

Quality-Relevant Monitoring
Before the widespread use of MSPM methods, the traditional approach to process monitoring is to monitor only the quality variables [8] as embodied by statistical quality control. MSPM methods are more beneficial in that it utilizes the entire plant data set rather than just the quality variables to perform fault detection. However, as noted by Qin [25], it is imperative to link the results from MSPM methods to the quality variables. The kernel MSPM methods discussed thus far have not yet established this link. This issue can be addressed by performing quality-relevant monitoring.
Partial least squares (PLS) is an MSPM method associated with quality-relevant monitoring, as it finds a relationship between the process and quality variables. The first kernel PLS application was in a biological anaerobic filter process (BAFP) by Lee et al. [89], where the quality variables are the total oxygen demand of the effluent and flow rate of exiting methane gas. Zhang and Zhang [107] combined ICA and kernel PLS for monitoring the well-known penicillin fermentation (PenSim) process and predicting the CO 2 and dissolved O 2 concentrations. Hierarchical kernel PLS, dynamic hierarchical kernel PLS, and multi-scale kernel PLS were introduced in [128,135], and [129], respectively. Total PLS (T-PLS) was proposed to make PLS more comprehensive, and its kernel version was developed by Peng et al. [79,141]. The application was in the HSMP, wherein both quality-related and non-quality-related faults were investigated. Further developments on kernel PLS can be found in [146,160,163,164,168,173,196,197,199,206,229,231,242,243,268,284]. Concurrent PLS was also proposed to solve some drawbacks of the T-PLS. Kernel concurrent PLS was developed by Zhang et al. [176] and Sheng et al. [205].
The other more recent MSPM tool for relating process and quality variables is canonical correlation analysis (CCA). CCA is different from PLS in that it finds projections that maximize the correlation between two data sets. Kernel CCA first appeared in process monitoring as a modified ICA by Wang and Shi [123], but it was not utilized for quality-relevant monitoring. The same is true in Cai et al. [181], where kernel CCA was merely used to build a complex network for the process. In 2017, Zhu et al. [240] first proposed the kernel concurrent CCA for quality-relevant monitoring. Liu et al. [241] followed with its dynamic version. In a very recent work by Yu et al. [277], a faster version of kernel CCA was proposed, to be discussed later in Section 4.8.

Kernel Design and Kernel Parameter Selection
The issue of kernel design is often cited as the reason why researchers would prefer to use other nonlinear techniques over kernel methods. It is difficult to decide which kernel function to use (see Equations (5)-(7)) and how kernel parameters should be chosen. (Note, however, that decisions like these also exist in ANNs, e.g., how to set the depth of the network, number of hidden neurons, and learning rate, which activation function to use and which regularization method to use.) These choices also depend on the decisions made at other stages of process monitoring. For instance, choosing one kernel function over another may change the number of retained kernel principal components necessary for good performance. Moreover, the quality of the training data can influence all these decisions. Even if these parameters were carefully tuned based on fixed data sets for training and validation, the detection model may still yield too many false alarms if the data sets are not representative of all behaviors of the normal process. Process monitoring performance greatly depends on these aspects. We review existing efforts that address these issues, as follows.

Choice of Kernel Function
The main requirement for a kernel function to be valid is to satisfy Mercer's condition [22]. According to Mercer's theorem, as quoted from [313]: A necessary and sufficient condition for a symmetric function k(· , ·) to be a kernel is that for any set of samples x, . . . , x and any set of real numbers λ 1 , . . . , λ , the function k(· , ·) must satisfy: which translates to the function k(· , ·) being positive definite. This means that if a function satisfies the condition in Equation (8), it can act as a dot product in the mapping of x defined by φ(·), and hence, it is a valid Mercer kernel function. If k(· , ·) acts as a dot product, then for any two samples, x and z, the function is symmetric, i.e., k(x, z) = k(z, x), and also satisfies the Cauchy-Schwarz inequality: k 2 (x, z) ≤ k(x, x)k(z, z) [313].
Although many kernel functions exist [44,314], only a few common ones are being used in process monitoring, namely, Equations (5)- (7). We identified the kernels used in each of the 230 papers included in this review. In the tally shown in Figure 8a, the RBF kernel is found to be the most popular choice, by a wide margin. Even outside the process monitoring community, the Gaussian RBF kernel (also known as the squared exponential kernel) is the most widely used kernel in the field of kernel machines [314], possibly owing to its smoothness and flexibility. Other kernels found from the review are the cosine kernel [105], wavelet kernel [137], the recent non-stationary discrete convolution kernel (NSDC) [295,296], and the heat kernel [182,266,290] for manifold learning (see Section 4.9). Other advances are related to the kernel design itself. For instance, Shao et al. [108] and Luo et al. [182] proposed data-dependent kernels for kernel PCA, which is used to learn manifolds. A robust alternative to kernel PLS is proposed by Hu et al. [153] which uses a sphered kernel matrix. Meanwhile, Zhao and Xue [163] used a mixed kernel for kernel T-PLS to discover both local and global patterns. The mixed kernel consists of a convex addition of the RBF and POLY kernels. Mixed kernels were also used by Pilario et al. [67] for kernel CVA, but motivated by monitoring incipient faults. This additive principle was also used to design a kernel for batch processes by Yao and Wang [170]. More recently, Wang et al. [288,289] proposed to use the first-order expansion of the RBF kernel to save computational cost. However, it is not clear if the new design retains the same flexibility of the original RBF kernel to handle nonlinearity, or if it compares to polynomial kernels of the same order.

Kernel Parameter Selection
The kernel parameters for the RBF, POLY, and SIG kernels in Equations (5)-(7) are the kernel bandwidth, c, the polynomial degree, d, and the sigmoid scale a and bias b. These kernels satisfy Mercer's conditions for c > 0, d ∈ N, and only some combinations of a and b [22,67]. There are currently no theoretical basis on how to specify the values of these parameters, yet they must be specified prior to performing any kernel method. We review some of the existing ways to obtain their values, as follows.
We have tallied the various parameter selection routes used by the 230 papers included in this review. Based on the results in Figure 8b, the most popular approach is to select them empirically. For the RBF kernel, c is usually computed based on the data variance (σ 2 ) and dimensionality (m), i.e., c = rmσ 2 [24, 72,96,97], where r is an empirical constant. Another heuristic is based solely on the dimensionality, such as c = 5m [86][87][88] or c = 500m [66,118,130,204] for the TEP case study. For the TEP alone, many values were used, such as c = 6000 [157,213], c = 1720 [177], c = 4800 [205], c = 3300 [220], and so on. However, note that the appropriate value of c does not depend on the case study, but rather on the characteristics of the data that enters the kernel mapping. Hence, various choices will differ upon using different data pre-processing steps, even for the same case study. Other notable heuristics for c can be found in [68,126,131,164,248,280].
A smaller number of papers have used cross-validation to decide kernel parameter values. In this scheme, the detection model is tuned according to some objective, such as minimizing false alarms, using a validation data set that must be independent from the training data [67]. Another scheme is to perform k-fold cross-validation, as did [85], in which the data set is split into k groups: k − 1 groups are used for training, while the remaining group is used for validation, and then repeat k times for different held-out data. Typically, k = 5 or 10. Grid search is a common approach for the tuning stage, where the kernel parameters are chosen from a grid of candidates, as did [67,79,98,121,124,141,151,170,171,195,201,215,259]. Based on a recent study by Fu et al. [68], cross-validation was found to yield better estimates of the kernel parameters than the empirical approach.
A more detailed approach to compute kernel parameters is via optimization. It is known that if certain objectives are set, these parameters will have an optimal value. For instance, as explained by Bernal de Lazaro [184], if the RBF kernel bandwidth c is too large, the model loses the ability to discover nonlinear patterns, but if it is too small, the model will become too sensitive to the noise in the training data. Hence, the value of c can be searched such that the false alarm rate is minimum and the detection rate is maximum [184]. Exploring these trade-offs is key to the optimization procedure. Other criteria for optimizing kernel parameters were proposed in [183]. Some search techniques include the bisection method [162], Tabu search [247,250,274], particle swarm optimization [184,276], differential evolution [184], and genetic algorithm [84,93,102,108,154]. More recent studies have emphasized that kernel parameters must be optimized simultaneously with the choice of latent components (e.g., no. of kernel principal components) since these choices depend on each other [67,68].
Finally, there are also some papers that investigated the effect of varying the kernel parameters and presented their results (see [67,80,98,165,185,256,295,296]). In case the reader is interested in the investigation, we have provided a MATLAB code for visualizing the contours of kernel PCA statisical indices for any 2-D data set, available in [315]. This code was used to generate one of the figures in [67]. Understanding the effect of kernel parameters and the kernel function is important, especially as process monitoring methods become more sophisticated in the future.

Fast Computation of Kernel Features
Recall in Section 2.3 that one of the issues of kernel methods is scalability. This is because the computational complexity of kernel methods grow in proportion to the size of training data. Hence, although they are fast to train, they are slow in making predictions [45]. Addressing the scalability of kernel methods is important, especially since samples are now being generated at large volumes in the plant [8]. The time complexity of naïve kernel PCA for the online testing phase is O(N 2 ), where N is the number of training samples. Assuming that a typical CPU can do 10 8 operations in one second [316], kernel PCA can only allow at most 10 4 training samples if a prediction is desired within a second as well. In the following, we review the many approaches adopted by process monitoring researchers to compute kernel features faster.
An early approach to reduce the computational cost of kernel MSPM methods is to select only a subset of the training samples so that their mapping is as expressive as if the entire data set was used. By reducing the number of samples, the kernel matrix reduces in size, and hence the transformation in Equation (4) can be computed faster. Feature vector selection (FVS) is one such method in this regard, as proposed by Baudat and Anouar [317], and then adopted by Cui et al. [98] for kernel PCA based process monitoring. FVS aims to preserve the geometric structure of the kernel feature space by an iterative error minimization process. Cui et al. [98] have shown that for the TEP, even if only 30 out of the 480 samples were selected by FVS and stored by the model, the average fault detection rate has changed only by 0.7%. FVS was further adopted in [77,104,105,125,149,256]. A related feature points extraction scheme by Wang et al. [142] was also proposed for batch processes. Another idea is similarity analysis, wherein a sample is rejected from the mapping if it is found to be similar to the current set by some criteria (This is not to be confused with the similarity factor method, SFM, discussed in Section 4.3.2). Similarity analysis was adopted by Zhang and Qin [100] and Zhang [106]. Meanwhile, Guo et al. [278] reformulated kernel PCA itself to sparsify the projection matrix using elastic net regression. Other techniques for sample subset selection includes feature sample extraction [73], the use of fuzzy C-means clustering [159], reduced KPCA [207], partial KPCA [249], and dictionary learning [246,250,270,271,274]. These methods are efficient enough to warrant an online adaptive implementation (see Section 4.10).
The other set of approaches involves a low-rank approximation of the kernel matrix for large-scale learning. Nyström approximation and random Fourier features are the typical approaches in this set. The Nyström method approximates the kernel matrix by sampling a subset of its columns. It was adopted recently by Yu et al. [277] for kernel CCA. Meanwhile, random Fourier features was adopted by Wu et al. [279] for kernel PCA. This scheme exploits Bochner's theorem [59,279], in which the kernel mapping is approximated by passing the data through a randomized projection and cosine functions. This results to a map of lower dimensions which saves computational cost. For more information, see the theoretical and empirical comparison of the Nyström method and random Fourier features by Yang et al. [318]. Other related low-rank approximation schemes were proposed by Peng et al. [283] which applies to kernel ICA, and that of Zhou et al. [286] called randomized kernel PCA. Lastly, a different approximation using the Taylor expansion of the RBF kernel was also derived by Wang et al. [288,289], and was called kernel sample equivalent replacement.

Manifold Learning and Local Structure Analysis
The kernel MSPM methods described thus far are limited in their ability to learn local structure. A famous example that exhibits local structure would be the S-curve data set, described in [319], which is a sheet of points forming an "S" in 3-D space (see Figure 9a). In this case, manifold learning methods are more appropriate for dimensionality reduction. While kernel PCA aims to preserve nonlinear global directions with the maximum variance, manifold learning methods are constrained to preserve the distances between data points in their local neighborhoods [320]. For the S-curve data, this means that manifold learning methods will be able to "unfold" the curve in a 2-D mapping so that the points from either end of the curve become farthest apart, whereas kernel PCA would undesirably map them close together. In Figure 9c, local linear embedding (LLE) was used as the manifold learner. The concept of manifold learning, sometimes called local structure analysis, was already adopted by many process monitoring researchers, which we review as follows. The first few efforts to learn nonlinear manifolds via kernels for process monitoring were done by Shao et al. [108,109] in 2009. The techniques in [108,109] are related to maximum variance unfolding (MVU), which is a variant of kernel PCA that does not require selecting a kernel function a priori. Instead, MVU automatically learns the kernel matrix from the training data [109,320]. However, a parameter for defining the neighborhood must still be adjusted, for instance, the number of nearest neighbors, k. The strategy in [109] is to set k as the smallest integer that makes the entire neighborhood graph fully connected. Shao and Rong [109] have shown that the spectrum of the kernel matrix from MVU reveals a sharper contrast between the dominant and non-dominant eigenvalues than that from kernel PCA for the TEP case study. This result is important as it indicates that the salient features were separated from the noise more effectively. Other than MVU, a more popular technique is locality preserving projections (LPP), originally proposed by He and Niyogi [321] and then adopted by Hu and Yuan [322] for batch process monitoring. MVU only computes an embedding for the training data, hence, it requires a regression step to find the explicit mapping function for any test data. In contrast, the explicit mapping is readily available for LPP. The kernel version of LPP was adopted by Deng et al. [149,150] for process monitoring. Meanwhile, generalized LPP and discriminative LPP (and its kernel version) were proposed by Shao et al. [110] and Rong et al. [151], respectively. Other works that adopted variants of LPP can be found in [218,234,252,258,266,273,290]. The heat kernel (HK) is commonly used as a weighting function in LPP.
More recently, researchers have recognized that both global and local structure must be learned rather than focusing on one or the other. Hence, Luo et al. [182,187] proposed the kernel global-local preserving projections (GLPP). The projections from GLPP are in the middle of those from LPP and PCA because the local (LPP) and global (PCA) structures are simultaneously preserved. Other works in this regard can be found in [204,215,222,279,282]. To learn more about manifold learning, we refer the reader to a comparative review of dimensionality reduction methods by Van der Maaten et al. [320]. The connection between manifold learning and kernel PCA is also discussed by Ham et al. [323].

Time-Varying Behavior and Adaptive Kernel Computation
When an MSPM method is successfully trained and deployed for process monitoring, it is usually assumed that the normal process behavior represented in the training data is the same behavior to be monitored during the testing phase. This means that the computed projection matrices and upper control limits (UCLs) are fixed or time-invariant. However, in practice, the process behavior continuously changes. Even if sophisticated detection models were used, a changing process behavior would require the model to be adaptive. That is, the model must adapt to changes in the normal behavior without accommodating any fault behavior. However, it would be time-consuming for the model to be re-trained from scratch every time a new sample arrives. Hence, a recurrence relation or a recursive scheme must be formulated to make the model adaptive. For kernel methods, the actual issue is that kernel matrix adaptation is not straightforward. As noted by Hoegaerts et al. [324], adapting a linear PCA covariance matrix to a new data point will not change its size, whereas doing so for a kernel matrix would expand both its row and column dimensions. Hence, to keep its size, the kernel matrix must be updated and downdated at the same time. In addition, the eigendecomposition of the kernel matrix must also be adapted, wherein the number of retained principal components may change. These notions are important for addressing the time-varying process behavior.
In 2009, Liu et al. [111] proposed a moving window kernel PCA by implementing the adaptive schemes from Hoegaerts et al. [324] and Hall et al. [325]. It was applied to a butane distillation process where the fresh feed flow and the fresh feed temperature are time-varying. During implementation, adaptive control charts were produced, where the UCLs vary with time and the number of retained principal components varied between 8 and 13 as well. Khediri et al. [126] then proposed a variable moving window scheme where the model can be updated with a block of new data instead of a single data point. Meanwhile, Jaffel et al. [191] proposed a moving window reduced kernel PCA, where "reduced" pertains to an approach for easing the computational burden as discussed in Section 4.8. Other related works that utilize the moving window concept can be found in [190,[207][208][209]238,293]. A different adaptive approach is to use multivariate EWMA to update any part of the model, such as the kernel matrix, its eigen-decomposition, or the statistical indices [116,132,179,224,253,281,283,292]. Finally, for the dictionary learning approach by Fezai et al. [246,247] (see Section 4.8), the Woodbury matrix identity is required to update the inverse of the kernel matrix, thereby updating the dictionary of kernel features as well. This scheme was adopted later in [250,270,271].

Multi-Block and Distributed Monitoring
Due to the enormous scale of industrial plants nowadays, having a centralized process monitoring system for the entire plant has its limitations. According to Jiang and Huang [326], a centralized system may be limited in terms of: (1) fault-tolerance-it may fail to recognize faults if many of them occur simultaneously at different locations; (2) reliability-because it handles all data channels, it is more likely to fail if ever one of the channels become unavailable; (3) economic efficiency-it does not account for geographically distant process units that should naturally be monitored separately; and (4) performance-its monitoring performance can still be improved by decomposing the plant into blocks. These reasons have led to the rise of multi-block, distributed, or decentralized process monitoring methods, of which the kernel-based ones are reviewed as follows.
Kernel PLS is widely applied to decentralized process monitoring, as found in [101,119,129,206,284]. Lu and Wang [101] utilized a signed digraph, which was mentioned in Section 4.3.3 to have achieved fault diagnosis by incorporating causality. Zhang et al. [119] proposed the multi-block kernel PLS to monitor the continuous annealing process (CAP) case study, and utilized the fact that each of the 18 rolls in the process constitute a block of variables. By monitoring each of the 18 blocks rather than the entire process as one, it becomes easier to diagnose the fault location. An equivalent multi-block multi-scale kernel PLS was used by Zhang and Hu [129] in the PenSim and the electro-fused magnesia furnace (EFMF) case studies. Multi-block kernel ICA was proposed by Zhang and Ma [133] to monitor the CAP case study as well. Enhanced results for the CAP was achieved by Liu et al. [241] by using dynamic concurrent kernel CCA with multi-block analysis for fault isolation. Peng et al. [283] also used a prior process knowledge of the TEP to partition the 33 process variables into 3 sub-blocks, each monitored by adaptive dynamic kernel ICA.
In order to perform block division when process knowledge is not available, Jiang and Yan [327] proposed to use mutual information (MI) based clustering. This idea was fused with kernel PCA based process monitoring by Jiang and Yan [180], Huang and Yan [245], and Deng et al. [287]. All these works have used the TEP as a case study, and they have consistently elucidated 4 sub-blocks for the TEP. For instance, in [245], their method initially produced 12 sub-blocks of variables, but 7 of these contain only one variable. Hence, some sub-blocks were fused into others, yielding only 4 sub-blocks in the end. Another approach is to divide the process according to blocks that give optimal fault detection performance, as proposed by Jiang et al. [198]. They used the genetic algorithm and kernel PCA for optimization and performance evaluation, respectively. Different from the above, Cai et al. [181] used kernel CCA to model the plant as a complex network and then used PCA for process monitoring. Li et al. [80,267] also proposed a hierarchical process modelling concept that separates the monitoring of linear from nonlinearly related variables. More recently, Yan et al. [284] used self-organizing maps (SOM) for block division, where the quality-related variables are monitored by kernel PLS and the quality-unrelated variables by kernel PCA.
For a systematic review of plant-wide monitoring methods, the reader can refer to Ge [33].

Advanced Methods: Ensembles and Deep Learning
Ensemble learning and deep learning are two emerging concepts that have now become standard in the AI community [40]. The idea of ensemble learning is to build an enhanced model by combining the strengths of many simpler models [308]. The case for using ensembles is strong due to the many data science competitions that were won by exploiting the concept. For example, the winner of the Netflix Prize for a video recommender system was an ensemble of more than 100 learners [40], the winner of the Higgs Boson machine learning challenge was an ensemble of 70 deep neural networks that differ in initialization and training data sets [328], and it was reported that 17 out of the 29 challenges published in a machine learning competition site called Kaggle in 2015 alone were won by an ensemble learner called XGBoost [329]. Meanwhile, deep learning methods are general-purpose learning procedures for the automatic extraction of features using a multi-layer stack of input-output mappings [52]. Because features are learned automatically, it then avoids the task of designing feature extractors by hand, which would have required domain expertise. The case for using deep learners is strengthened by the fact that they have beaten many records in computer vision tasks, natural language processing tasks, video games, etc. [52,330]. In the process monitoring community, ensemble and deep architectures have also started appearing among kernel-based methods.
In 2015, Li and Yang [167] proposed an ensemble kernel PCA strategy wherein the base learners are kernel PCA models of various RBF kernel widths. For the TEP, 11 base models of kernel widths c = 2 i−1 5m, i = 1, . . . , 11 were used and gave better detection rates than using a single RBF kernel alone. Later on, Deng et al. [220] proposed Deep PCA by stacking together linear PCA and kernel PCA mappings. Bayesian inference was used to consolidate the monitoring statistics from each layer, so that a single final result is obtained. Using the TEP as case study, the detection rates of a 2-layer Deep PCA model were shown to have improved against linear PCA and kernel PCA alone. Further work in [256] used more layers in Deep PCA, as well as the FVS scheme (see Section 4.8) for reducing the computational cost. Deng et al. [257] also proposed serial PCA, where kernel PCA is performed on the residual space of an initial linear PCA transformation. In that work, the similarity factors method was used for fault classification as well (see Section 4.3.2). A different way to hybridize PCA and kernel PCA is by parallel instead of serial means, as proposed by Jiang and Yan [261]. Meanwhile, Li et al. [80,267] also used multi-level hierarchical models involving both linear PCA and kernel PCA. More recently, the ensemble kernel PCA was fused with local structure analysis by Cui et al. [273] for manifold learning (see Section 4.9).
We refer the reader to Lee et al. [9] for a more general outlook of the implications of advanced learning models to the process systems engineering field.

A Future Outlook on Kernel-Based Process Monitoring
Despite the many advances in kernel-based process monitoring research, more challenges are still emerging. It is likely that kernel methods, and other machine learning tools, as presented in Figure 4, will have a role in addressing these challenges towards safer operations in the industry. A few of these challenges are discussed as follows.

Handling Heterogeneous and Multi-Rate Data
As introduced in Section 2, plant data sets are said to consist of N samples of M process variables. However, process measurements are not the only source of plant data. To perform process monitoring more effectively, it can also benefit from image data analytics, video data analytics, and alarm analytics. One notable work by Feng et al. [262] used kernel ICA to analyze video information for process monitoring. A more recent integration of alarm analytics to fault detection and identification was also developed by Lucke et al. [331]. Aside from these, spectroscopic data could be another information source from the plant since it is used for elucidating chemical structure. In addition, process monitoring can also be improved by combining information from both low-and high-frequency process measurements. Most of the case studies in the papers reviewed here generate only low-frequency data, e.g., 3-s sampling interval for the TEP. But there also exist data from pressure transducers (5 kHz), vibration measurements (0.5 Hz-10 kHz), and so on. Ruiz-Carcel et al. [332], for instance, have combined these multi-rate data to perform fault detection and diagnosis using CVA. It is projected that more efforts to handle heterogeneous and multi-rate data will appear in the future.
Although the above issues are recognized, the way to move forward is to first establish benchmark case studies that exhibit heterogeneous and multi-rate data. This will help ensure that new methods for handling these issues can be fairly compared. One such data set has been generated and made publicly available by Stief et al. [333], namely, from a real-world multiphase flow facility. For more details about the data set and how to acquire it, see the above reference.

Performing Fault Prognosis
Fault detection and diagnosis are the main objectives of the papers found in this review. As noted in Section 1, the third component of process monitoring is fault prognosis. After detecting and localizing the fault, prognosis methods aim to predict the future behavior of the process under faulty conditions. If the fault would lead to process failure, it is important to know in advance when it would happen, along with a measure of its uncertainty. This quantity is known as the remaining useful life or time-to-failure of the process [334]. Once these quantities are computed, the appropriate maintenance or repair actions can be performed, and hence, failure or emergency situations can be prevented.
To perform prognosis, the first step is to extract an incipient fault signal from the measured variables that is separated from noise and other disturbances as clearly as possible. This means that the method used for feature extraction should handle the incipient fault detection issue very well (see Section 4.5). Secondly, the drifting behavior of the incipient fault must be extrapolated into the future using a predictive model. This predictive element is key to the prognosis performance. The model must have a satisfactory extrapolation ability, that is, the ability to make reliable predictions beyond the data space where it was initially trained [20]. For instance, a detection model based on the widely used RBF kernel would have poor extrapolation abilities, as noted in Pilario et al. [67]. To solve this, a mixture of the RBF and the POLY kernels was used to improve both interpolation and extrapolation abilities. These kernels were adopted into kernel CVA for incipient fault monitoring. Another kernel method for prediction is Gaussian Processes (GP), which was used by Ge [335] under the PCA framework. Also, Ma et al. [265] used the fault reconstruction approach in kernel ICA to generate fault signals for prediction. Meanwhile, Xu et al. [186] used a neural network for prediction, together with local kernel PCA based monitoring.
Despite these efforts, predictive tasks are generally considered difficult, especially in nonlinear dynamic processes. For nonlinear processes, predictions will be inaccurate if the hypothesis space of the assumed predictive model is not sufficient to capture the complex process behavior. And even if the hypothesis space is sufficient, enough training data must be acquired to search the correct model within the hypothesis space. However, training data is scarce during the initial stage of process degradation. In other words, it is difficult to determine whether the future trend would be linear, exponential, or any other shape on the basis of only a few degradation samples. Furthermore, a process is dynamic if its behavior at one point in time depends on its behavior at a previous time. This means that if the current prediction is fed into a dynamic model to serve as input for the next prediction, then small errors will accumulate as predictions are made farther into the future. It is important to be aware of these issues when developing fault prognosis strategies for industrial processes.

Developing More Advanced Methods and Improving Kernel Designs
Due to the recent advances in AI research, more and more process monitoring methods that rely on ensembles and deep architectures are expected to appear in the future (see Section 4.12). As mentioned in Section 2.3, both kernel methods and deep ANNs can be exploited, possibly in combined form, in order to create more expressive models. In addition, more creative kernel designs can be used, especially via the multiple kernel learning approach as noted in [67,163,277]. Multiple kernels can be created by combining single kernels additively or multiplicatively while still satisfying Mercer's conditions [44,306]. The combination of kernels can be done in series, in parallel, or both. For instance, the proposed serial PCA [257] and deep PCA [220] architectures can pave the way for deep kernel learning for process monitoring. Also, the concept of automatic relevance determination [314] can be considered in future works, wherein the Gaussian kernel width is allowed to have different values in each dimension of the data space. New kernel designs can also be inspired by the challenge of handling heterogeneous data, as mentioned in Section 5. In parallel with these developments, a more careful approach to kernel parameter selection must be carried out, such as cross-validation and optimization techniques. To ensure that new results can be replicated and verified, we encourage researchers to always state the kernel functions chosen, the kernel parameter selection route, and how all other settings were obtained in their methods. The repeatability of results strengthens the understanding of new concepts, which will further lead to newer concepts more quickly. Hence, these efforts are necessary to further the development of the next generation of methods for fault detection, fault diagnosis, and fault prognosis in industrial plants.
It is important to note, however, that the development of new methods must be driven by the needs of the industry rather than for the sake of simply implementing new techniques. This means that, although it is tempting to develop a sophisticated method that can handle all the issues discussed in this article, it is more beneficial to understand the case study and the characteristics of the plant data at hand so that the right solutions are delivered to the end users.

Conclusions
In this paper, we reviewed the applications of kernel methods to perform nonlinear process monitoring. This paper firstly discussed the relationship between kernel methods and other techniques from machine learning, more importantly neural networks. Within this context, we gave motivations on why kernel methods are worthwhile to consider to perform nonlinear feature extraction from industrial plant data.
Based on 230 collected papers from 2004 to 2019, this article then identified 12 major issues that researchers aim to address regarding the use of kernel methods as feature extractors. We discussed issues such as how to choose the kernel function, how to decide kernel parameters, how to perform fault diagnosis in kernel feature space, how to compute kernel mappings faster, how to make the kernel computation adaptive, how to learn manifolds or local structures, and how to benefit from ensembles and deep architectures. The rest of the topics include how to handle batch process data, how to account for process dynamics, how to monitor quality variables, how to improve detection, and how to distribute the monitoring task across the whole plant. By addressing these issues, we have seen how nonlinear process monitoring research has progressed extensively in the last 15 years, through the impact of kernel methods.
Finally, potential future directions on kernel-based process monitoring research were presented. Emerging topics on new kernel designs, handling heterogeneous data, and performing fault prognosis were deemed worthwhile to investigate. In order to move forward, we encourage more researchers to venture in this area of process monitoring. For interested readers, this article is also supplemented by MATLAB codes for SVM and kernel PCA (see Figure 3 and Ref. [315]), which were made available to the public. We hope that this article can contribute to the further understanding of the role of kernel methods in process monitoring, and provide new insights for researchers in the field.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in the manuscript text: