Identiﬁcation of Unknown Abnormal Conditions in Catalytic Cracking Process Based on Two-Step Clustering Analysis and Signed Directed Graph

: There are many unknown abnormal working conditions in industrial production. It is difﬁcult to identify unknown abnormal working conditions because there are few relative sample and experience in this ﬁeld. To solve this problem, a new identiﬁcation method combining two-step clustering analysis and signed directed graph (TSCA-SDG) is proposed. Firstly, through correlation analysis and R-type clustering analysis, the variables are effectively selected and extracted. Then, a two-step clustering analysis was carried out on the selected variables to obtain the cluster results. Through the establishment of the signed directed graph (SDG) model, the causes of abnormal working conditions and their mutual inﬂuence are deduced from the mechanism. The application of the TSCA-SDG method in the catalytic cracking process shows that this method has good performance for abnormal condition identiﬁcation. The of of of SDG mathematical analysis the relationship between the variables. In this paper, through correlation analysis and R-type clustering analysis, feature selection and


Introduction
As heavy and inferior crude oil becomes more and more popular, the fluidic catalytic cracking (FCC) process, as one of the core processes in the light-weight processing of heavy oil, has received more and more attention [1][2][3]. In China, the diesel and gasoline produced by FCC units account for about 30% and 70% of the finished diesel and gasoline [4]. With the application of distributed control system (DCS), the control of FCC has been computerized. At the same time, with the advent of big data era, the digitization and intelligence of FCC process have been developed widely.
The FCC process has some flammable, explosive chemicals and high temperature, high pressure conditions. The occurrence of accidents will cause serious casualties and property losses, as well as irreversible environmental pollution. With the development of computers, the industrial production process has become increasingly more automated, where abnormal alarms are mostly handled by operators. Due to the lack of ability and experience of operators, it is difficult to make correct judgments and take action quickly in case of abnormal occurrence, which may cause more serious subsequent accidents. According to industry statistics, abnormal events caused by operators accounted for about 70% of overall events [5]. Therefore, the current industrial production needs to introduce more effective computer system-based program for fault detection and diagnosis. Fault diagnosis technology has developed rapidly since the 1980s [6][7][8], which is generally divided into knowledge-based, mechanism-based and data-based technologies [9]. Data-based fault diagnosis technology does not have over reliance on rich expert experience and accurate analytical models because it makes full use of the large amount of data generated during the operation of machinery and equipment. With the rapid development of industrial big data and computer technology, data-based fault diagnosis technology is more and more widely applied [10][11][12][13]. The DCS system also provides vitality to the application and innovation of data-driven methods in a chemical process failure study [14].
The data-based fault diagnosis technology can be classified as qualitative and quantitative methods, while the latter one can be further classified into two categories: statistical and non-statistical [15][16][17]. Cluster analysis belongs to the statistical technology, which is a typical unsupervised learning technology in the field of data mining and machine learning. Cluster analysis techniques can be used to explore and discover hidden patterns in data. The main division basis of cluster analysis method is the similarity relationship between the sample points, which is an autonomous division of the data sample set. In the clustering process, all the sample points in the same set are divided into several clusters, where the similarity of sample points in the same cluster structure is kept as high as possible but the similarity between different clusters is kept as low as possible. At present, the commonly used clustering algorithms include Two-Step Clustering (TSC) [18], K-means [19], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [20], Gaussian Mixture Clustering (GMC) [21], Hierarchical Agglomerative Clustering (HAC) [22] and so on. DBSCAN can cluster dense datasets of any shape without unbiased results. However, when the density of the sample set is uneven or the clustering distance is very different, the clustering quality is poor. GMC can obtain elliptical clusters rather than circular ones with the mean and standard deviation. HAC is a bottom-up clustering algorithm; the disadvantage is that the computational complexity is too high and the efficiency is low. K-means has the advantages of simplicity, high efficiency, short time and low space complexity for large datasets. However, when the dataset is large, the result is prone to the local optimum. Moreover, K-means needs to set the value of K in advance and is, therefore, very sensitive to the selection of the K value [23]. TSC is a clustering method recently developed. It occupies fewer memory resources and has a fast computing speed for large datasets. TSC has an excellent clustering effect, so it is widely used in medical, nuclear engineering and other fields. In the identification of working conditions of industrial big data, TSC can accurately identify and cluster data of abnormal working condition. Although the above cluster methods have received in-depth development, their analysis of specific industrial mechanism is insufficient. Signed directed graph (SDG) is one of the labeling methods for mechanism analysis. SDG is a qualitative fault identification method, which has the advantages of simple modeling and flexible reasoning. SDG is a good way to show the relationship between complex system variables and reveal the propagation path of potential hazards and failures. SDG has a wide range of applications and development. Yang et al. summarized the background and development of the SDG method, and reviewed three modeling methods of SDG and their application in the field of safety evaluation and fault diagnosis [24]. Gao et al. proposed a semi-quantitative validation method for a simulation model based on SDG and qualitative trends, where qualitative trends were added to the SDG model and the complete testing cases were produced by positive inference. The semiquantitative validation was carried out by comparing the testing cases with outputs of the simulation model in different scales [25]. Wu et al. determined candidate faults based on SDG backward inference from the alarm parameters. According to the candidate faults, SDG forward inference was applied to obtain candidate parameters and then identify real faults [26]. Guo et al. proposed a general framework for the translation of multi-attribute graphs. In order to discover and preserve the consistency of the generated nodes and edges, a spectral graph regularization based on a non-parametric Laplacian graph was designed [27]. This paper proposes a method for identifying unknown abnormal conditions in the catalytic cracking process by combining the two-step clustering analysis with the SDG model (TSCA-SDG). The TSCA-SDG method identifies abnormal working conditions Processes 2021, 9,2055 3 of 17 through two-step clustering analysis, and analyzes the propagation path of abnormal working conditions by the SDG model. The outline of this paper is organized as follows. In Section 2, the framework of the TSCA-SDG method is introduced in detail, followed by the principles of two-step clustering analysis and SDG. The excellent performance of TSCA-SDG is proved by a case study in Section 3. Section 4 provides a summary of this paper.
Processes 2021, 9, x FOR PEER REVIEW 3 of 17 model (TSCA-SDG). The TSCA-SDG method identifies abnormal working conditions through two-step clustering analysis, and analyzes the propagation path of abnormal working conditions by the SDG model. The outline of this paper is organized as follows.
In Section 2, the framework of the TSCA-SDG method is introduced in detail, followed by the principles of two-step clustering analysis and SDG. The excellent performance of TSCA-SDG is proved by a case study in Section 3. Section 4 provides a summary of this paper.

Proposed Method
The TSCA-SDG method consists of four parts: (1) data preprocessing; (2) feature extraction and selection; (3) two-step clustering analysis; (4) SDG modeling. Its framework is shown in Figure 1. In the data processing part, Z-Score standardization is performed on the data of the control parameter and related variables to obtain data that remove the influence of magnitudes.
In the feature selection and extraction part, correlation analysis and R-type cluster analysis are carried out on the preprocessed variables. By calculating the Pearson correlation coefficient and the distance between variables, variables are selected and extracted to effectively achieve the purpose of dimensionality reduction.
In the two-step clustering analysis part, the variables after screening are clustered using the two-step clustering method. The optimal number of clusters is obtained through the Schwarz Bayesian Information Criterion for quick and effective clustering.
In the SDG model part, the SDG model is connected to the DCS system to realize process monitoring. For abnormal working condition data, the SDG model can accurately describe the fault characteristics as a consistent path through bidirectional inference. The In the data processing part, Z-Score standardization is performed on the data of the control parameter and related variables to obtain data that remove the influence of magnitudes.
In the feature selection and extraction part, correlation analysis and R-type cluster analysis are carried out on the preprocessed variables. By calculating the Pearson correlation coefficient and the distance between variables, variables are selected and extracted to effectively achieve the purpose of dimensionality reduction.
In the two-step clustering analysis part, the variables after screening are clustered using the two-step clustering method. The optimal number of clusters is obtained through the Schwarz Bayesian Information Criterion for quick and effective clustering.
In the SDG model part, the SDG model is connected to the DCS system to realize process monitoring. For abnormal working condition data, the SDG model can accurately describe the fault characteristics as a consistent path through bidirectional inference. The abnormal type marked with characteristics can be output and displayed to the operator for the warning of potential abnormal occurrence.

Data Preprocessing
The dimensions of variables are different and their magnitudes vary greatly. For comparison of these data together, the data are preprocessed first. Z Scores standard deviation is used to eliminate the influence of dimension, where the mean value of the transformed data is 0 and the standard deviation is 1, as shown in Equation (1): where x j is the mean of the data and S j is the standard deviation of the data.

Feature Extraction and Selection
In the application of actual industrial big data, some closely related variables in industrial production show low correlation due to time lag and other reasons. If only the correlation analysis is considered, some variables that are correlated in practice may be ignored. Combining expert experience, this paper proposes a feature extraction and selection method that comprehensively considers correlation analysis and R-type clustering analysis to effectively solve this problem.

Correlation Analysis
For the relationship between variables, it is easy to think of the deterministic relationship between variables. Its characteristic is that when the value of one variable is determined, the value of other variables is also completely determined. Different from the deterministic relationship, there is an indeterminate relationship between variables. Its characteristic is that after a variable value is given, the value of another variable can change within a certain range. This non-deterministic relationship is called correlation. It must be studied with the help of statistical methods, which is also called statistical correlation [28].
The Pearson correlation coefficient is used to analyze the correlation of variables, as shown in Equation (2): where n is the sample size and x i and y i are the variable values of the two variables, respectively.

R-Type Clustering Method
The R-type clustering method separates variables with large differences and clusters similar variables together. A few representative variables can be selected from similar variables to participate in other analyses to achieve the purpose of reducing the number of variables and dimensionality of variables.
The R-type clustering method used in this paper adopts agglomeration method. The process of agglomerative clustering is as follows. First, each observed individual is divided into a class. Then, the degree of closeness between all individuals is measured by the between-groups linkage distance method, and the closet individuals are grouped into a small class to form n − 1 classes. Next, the degree of closeness between the remaining observed individuals and subclasses is measured again, and the current closest individuals and subclasses are grouped into one class. The above process repeats until all the individuals are grouped together to form the largest group [29]. The flowchart of the R-type clustering method is shown in Figure 2. individuals are grouped together to form the largest group [29]. The flowchart of the Rtype clustering method is shown in Figure 2. The between-groups linkage distance is the average distance between an individual and each individual in the subclass. The between-groups linkage distance method overcomes the weakness that the nearest neighbor distance or the farthest neighbor distance is easily affected by extreme values as it uses the information of all distances between individuals and subclasses. During the agglomerative clustering, as the clustering progresses, the degree of closeness within the cluster gradually decreases. For n observed individuals, they can be agglomerated into a large class through n − 1 steps.

TSCA Method
Cluster analysis is an important part of the data mining discipline. It finds meaningful clusters from huge, seemingly chaotic data by mining the hidden patterns behind the data. The clustering algorithm is an unsupervised algorithm because there is no need to define the class in advance. Without taking the known classification information into consideration, all classification information can be generated by the clustering algorithm.
The two-step clustering algorithm is also called the two-stage clustering algorithm. The first stage is pre-clustering and the second stage is to use the clustering results of the first stage to cluster again. In the pre-clustering stage, the theory of cluster tree growth in BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm [30] is used to process the data points one by one. When processing data points, the clustering tree continually adds and updates a set of split leaf nodes to form many small subclusters [31]. In the second stage of clustering, agglomerative clustering is used to merge and group the preprocessed subclusters. With the Schwarz-Bayesian information criterion and the Akaike information criterion, the optimal number of clusters is determined. The between-groups linkage distance is the average distance between an individual and each individual in the subclass. The between-groups linkage distance method overcomes the weakness that the nearest neighbor distance or the farthest neighbor distance is easily affected by extreme values as it uses the information of all distances between individuals and subclasses. During the agglomerative clustering, as the clustering progresses, the degree of closeness within the cluster gradually decreases. For n observed individuals, they can be agglomerated into a large class through n − 1 steps.

TSCA Method
Cluster analysis is an important part of the data mining discipline. It finds meaningful clusters from huge, seemingly chaotic data by mining the hidden patterns behind the data. The clustering algorithm is an unsupervised algorithm because there is no need to define the class in advance. Without taking the known classification information into consideration, all classification information can be generated by the clustering algorithm.
The two-step clustering algorithm is also called the two-stage clustering algorithm. The first stage is pre-clustering and the second stage is to use the clustering results of the first stage to cluster again. In the pre-clustering stage, the theory of cluster tree growth in BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm [30] is used to process the data points one by one. When processing data points, the clustering tree continually adds and updates a set of split leaf nodes to form many small subclusters [31]. In the second stage of clustering, agglomerative clustering is used to merge and group the preprocessed subclusters. With the Schwarz-Bayesian information criterion and the Akaike information criterion, the optimal number of clusters is determined. The Euclidean distance function is used to calculate both the degree of dissimilarity between two objects and the degree of closeness and similarity between data individuals, as shown in Equation (3): where k = (1,2,3, . . . . . . n) represents the internal characteristics of the data individual. In the case of sufficient information, weighting values are assigned to each feature to obtain the weighted Euclidean distance, as shown in Equation (4).
For the two-step clustering method, the optimal number of classifications is judged according to the Schwarz Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). In the statistical analysis, the smaller the BIC and AIC values, the better the clustering effect. However, in practice, the BIC change ratio and distance measurement ratio should also be considered. The greater the BIC variation and distance measurement ratio are, the better the clustering effect becomes.

SDG Model
SDG is a qualitative analysis graph that expresses the interaction between process variables. The directed arc between nodes is helpful to reveal the propagation relationship between variables. The nodes in the model can be physical variables such as pressure and temperature in the system, or operating variables such as valves and controllers. The status values of the nodes are "+", "0", or "−", indicating that its value is greater than the upper threshold, normal state and lower threshold, respectively. If the changing trends of two nodes are the same, that is, the increase of the previous node leads to the increase of the next node, the two nodes are connected by solid arrows. If the trends of two nodes are opposite, that is, the increase of the previous node leads to the decrease of the next node, the two nodes are connected by dotted arrows. A simple SDG model structure is shown in Figure 3. The states of M, N and P are "+", "+" and "·" respectively. The relationship between M and N is represented by a solid arrow, while the relationship between N and P is represented by a dashed arrow, meaning that an increase in M will lead to an increase in N and then a decrease in P [32].
Processes 2021, 9, x FOR PEER REVIEW 6 of 17 The Euclidean distance function is used to calculate both the degree of dissimilarity between two objects and the degree of closeness and similarity between data individuals, as shown in Equation (3): where k = (1,2,3,……n) represents the internal characteristics of the data individual. In the case of sufficient information, weighting values are assigned to each feature to obtain the weighted Euclidean distance, as shown in Equation (4).
For the two-step clustering method, the optimal number of classifications is judged according to the Schwarz Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). In the statistical analysis, the smaller the BIC and AIC values, the better the clustering effect. However, in practice, the BIC change ratio and distance measurement ratio should also be considered. The greater the BIC variation and distance measurement ratio are, the better the clustering effect becomes.

SDG Model
SDG is a qualitative analysis graph that expresses the interaction between process variables. The directed arc between nodes is helpful to reveal the propagation relationship between variables. The nodes in the model can be physical variables such as pressure and temperature in the system, or operating variables such as valves and controllers. The status values of the nodes are "+", "0", or "−", indicating that its value is greater than the upper threshold, normal state and lower threshold, respectively. If the changing trends of two nodes are the same, that is, the increase of the previous node leads to the increase of the next node, the two nodes are connected by solid arrows. If the trends of two nodes are opposite, that is, the increase of the previous node leads to the decrease of the next node, the two nodes are connected by dotted arrows. A simple SDG model structure is shown in Figure 3. The states of M, N and P are "+", "+" and "·" respectively. The relationship between M and N is represented by a solid arrow, while the relationship between N and P is represented by a dashed arrow, meaning that an increase in M will lead to an increase in N and then a decrease in P [32].  The sample of SDG model γ = (G_0, ϕ) is a function of node state value ψ : is the symbol of node n k , as shown in Equations (5)-(7) [24]: where X n k represents the actual value of the variable corresponding to the node, X n k represents the normal value of the variable corresponding to the node and ε n k represents the threshold value of the node n k in the normal state.

Industrial Applications
To evaluate the effectiveness of TSCA-SDG, an industrial application is carried out on a catalytic cracking unit for the identification of a reaction temperature anomaly.

Process Description
The petrochemical catalytic cracking process technology is developed from the thermal cracking process, which can effectively improve the processing depth of crude oil and product quality. It is the core technology for modern refineries to improve heavy distillates and residual oil property. In recent years, due to the shortage of global petroleum resources, the use of petrochemical catalytic cracking technology has become an inevitable trend for petroleum refining companies due to the intensive, energy-saving and environmental protection purposes.
FCC is an important benefit-creating device in the oil refining sector, which can flexibly adjust the product structure. The reaction regeneration system is the core of the catalytic cracking unit, consisting of a reaction part and a regeneration part. The reaction temperature is the main control parameter of the FCC unit, which is an important means to adjust the reaction depth. Increasing the reaction temperature will increase the conversion rate, while the yield and quality of the product will also change. The yield of dry gas and liquid hydrocarbons increase as the reaction temperature rises. However, within different ranges, the range of change is different. The yield of gasoline and diesel has a maximum value with the increase of the reaction temperature. However, after a high value, due to the re-cracking of the products formed by the cracking, a further increase in the reaction temperature will reduce the yield of gasoline and diesel products. Therefore, it is necessary to select an appropriate reaction temperature according to different production schemes.
The process structure of catalytic cracking is complex and its operating environment is harsh, leading to some abnormal working conditions and many unplanned shutdowns. In this paper, the catalytic cracking unit of a petrochemical enterprise is taken as an example for unknown abnormal condition identification. According to the two-year historical data of 347,520 observations in the operation, the abnormal working conditions of the reaction temperature are identified. There are 1700 variables in the whole device, and the data collection cycle is once every three minutes. The collection time for different parts of the process is the same. Abnormal condition identification has far-reaching significance for the long-term stable operation of the device and the improvement of economic benefits.
The flowchart of the catalytic cracking unit is shown in Figure 4. The process consists of three main parts: the reaction regeneration system, the fractionation system and the absorption stabilization system. The reaction regeneration system mainly includes reactor R-101 and regenerator R-102. The fresh oil is mixed with the refining slurry after heat exchange into the lift tube reactor reaction, where FEED1 indicates the vapor extraction steam. The reaction products enter the fractionation system, including feedstock buffer tank D-101, fractionation tower T-101 and diesel vapor extraction tower T-102. Reaction oil and gas enter the fractionation tower from the bottom to the top of the tower. The product at the top of the tower is rich gas and crude gasoline, while the product at the bottom of the tower is oil slurry. OUT1 indicates the side line product light diesel. The absorption and stabilization system mainly consists of absorption tower T-103, reabsorption tower T-106, desorption tower T-104 and stabilization tower T-105. The rich gas and crude gasoline from the fractionation system are separated into liquefied gas OUT2, stabilized gasoline OUT3 and dry gas OUT4 by absorption stabilization, while the rich absorbed oil OUT5 is returned to the fractionation tower.

Data Preprocessing
There are 1700 variables in the whole process of catalytic cracking. However, too many variables will greatly increase the difficulty of unnecessary data analysis. This paper only considered variables related to the reaction temperature. Through communication with field experts, combined with actual work experience, process knowledge and mechanism analysis, 17 variables were selected. After these 17 variables were standardized by

Data Preprocessing
There are 1700 variables in the whole process of catalytic cracking. However, too many variables will greatly increase the difficulty of unnecessary data analysis. This paper only considered variables related to the reaction temperature. Through communication with field experts, combined with actual work experience, process knowledge and mechanism Processes 2021, 9, 2055 9 of 17 analysis, 17 variables were selected. After these 17 variables were standardized by Z Scores standard deviation, the impact of different dimensions was eliminated. The mean value of the converted data is 0, and the standard deviation is 1.

Feature Extraction and Selection of the Reaction Temperature
The 17 variables are shown in Table 1. Table 1 also lists the Pearson correlation coefficients between these variables and the reaction temperature. The variables are clustered by R-type, and the distance between clusters is betweengroups linkage distance. The clustering results are shown in Figure 5.
Processes 2021, 9, x FOR PEER REVIEW 9 of 17 Z Scores standard deviation, the impact of different dimensions was eliminated. The mean value of the converted data is 0, and the standard deviation is 1.

Feature Extraction and Selection of the Reaction Temperature
The 17 variables are shown in Table 1. Table 1 also lists the Pearson correlation coefficients between these variables and the reaction temperature. The variables are clustered by R-type, and the distance between clusters is betweengroups linkage distance. The clustering results are shown in Figure 5.  Through the results of the correlation analysis in Table 1 and the results of the R-type clustering between variables in Figure 5, the correlation between each variable and the reaction temperature is intuitively reflected. From the ordinate in Figure 5, the order of clustering among variables is given. It can also be seen from the abscissa that these variables are grouped into several classes when given different distances. There are 11 highly correlated variables after feature selection and extraction, as shown in Table 2. After feature selection and extraction, the data are effectively reduced in dimensionality.

Two-Step Cluster Analysis
The 11 variables obtained were clustered by the two-step clustering method. The BIC automatic clustering results and the AIC automatic clustering results are shown in Tables 3  and 4, respectively. In the rows where the number of clusters in Tables 3 and 4 is 2, the position circled in red box shows that the BIC and AIC values are greatly reduced, and the mutation rate is the smallest and the distance measurement ratio is the largest. The clustering result with good clustering quality is obtained, so the number of clusters is determined to be 2. Through the results of the correlation analysis in Table 1 and the results of the R-type clustering between variables in Figure 5, the correlation between each variable and the reaction temperature is intuitively reflected. From the ordinate in Figure 5, the order of clustering among variables is given. It can also be seen from the abscissa that these variables are grouped into several classes when given different distances. There are 11 highly correlated variables after feature selection and extraction, as shown in Table 2. After feature selection and extraction, the data are effectively reduced in dimensionality.

Two-Step Cluster Analysis
The 11 variables obtained were clustered by the two-step clustering method. The BIC automatic clustering results and the AIC automatic clustering results are shown in Tables  3 and 4, respectively. In the rows where the number of clusters in Tables 3 and 4 is 2, the position circled in red box shows that the BIC and AIC values are greatly reduced, and the mutation rate is the smallest and the distance measurement ratio is the largest. The clustering result with good clustering quality is obtained, so the number of clusters is determined to be 2. Through the results of the correlation analysis in Table 1 and the results of the R-type clustering between variables in Figure 5, the correlation between each variable and the reaction temperature is intuitively reflected. From the ordinate in Figure 5, the order of clustering among variables is given. It can also be seen from the abscissa that these variables are grouped into several classes when given different distances. There are 11 highly correlated variables after feature selection and extraction, as shown in Table 2. After feature selection and extraction, the data are effectively reduced in dimensionality.

Two-Step Cluster Analysis
The 11 variables obtained were clustered by the two-step clustering method. The BIC automatic clustering results and the AIC automatic clustering results are shown in Tables  3 and 4, respectively. In the rows where the number of clusters in Tables 3 and 4 is 2, the position circled in red box shows that the BIC and AIC values are greatly reduced, and the mutation rate is the smallest and the distance measurement ratio is the largest. The clustering result with good clustering quality is obtained, so the number of clusters is determined to be 2. Through the results of the correlation analysis in Table 1 and the results of the R-type clustering between variables in Figure 5, the correlation between each variable and the reaction temperature is intuitively reflected. From the ordinate in Figure 5, the order of clustering among variables is given. It can also be seen from the abscissa that these variables are grouped into several classes when given different distances. There are 11 highly correlated variables after feature selection and extraction, as shown in Table 2. After feature selection and extraction, the data are effectively reduced in dimensionality.

Two-Step Cluster Analysis
The 11 variables obtained were clustered by the two-step clustering method. The BIC automatic clustering results and the AIC automatic clustering results are shown in Tables  3 and 4, respectively. In the rows where the number of clusters in Tables 3 and 4 is 2, the position circled in red box shows that the BIC and AIC values are greatly reduced, and the mutation rate is the smallest and the distance measurement ratio is the largest. The clustering result with good clustering quality is obtained, so the number of clusters is determined to be 2.   The results of two-step clustering are shown in Table 5. The cluster classes in the twostep clustering results are the first and second classes. The numerical characteristics of class 2 fluctuate in the normal range. Class 2 is, thus, the normal working condition, with 321,856 observations clustered into this class. Compared with the normal value, the numerical characteristics of class 1 have a large fluctuation range. Class 1 is, thus, the abnormal working condition, with 25,664 observations. After cluster analysis, there are data on 25,664 abnormal working conditions. The standardized data of the 11 variables and their two-step clustering results are shown in Figure 6. It can be clearly seen that the two-step clustering has obtained good clustering results. Observation with large fluctuations caused by meter damage, meter calibration, shutdown of the device, etc., are grouped into abnormal working condition class 1.  The results of two-step clustering are shown in Table 5. The cluster classes in the twostep clustering results are the first and second classes. The numerical characteristics of class 2 fluctuate in the normal range. Class 2 is, thus, the normal working condition, with 321,856 observations clustered into this class. Compared with the normal value, the numerical characteristics of class 1 have a large fluctuation range. Class 1 is, thus, the abnormal working condition, with 25,664 observations. After cluster analysis, there are data on 25,664 abnormal working conditions. The standardized data of the 11 variables and their two-step clustering results are shown in Figure 6. It can be clearly seen that the two-step clustering has obtained good clustering results. Observation with large fluctuations caused by meter damage, meter calibration, shutdown of the device, etc., are grouped into abnormal working condition class 1.

1.000
Processes 2021, 9,   The results of two-step clustering are shown in Table 5. The cluster classes in the twostep clustering results are the first and second classes. The numerical characteristics of class 2 fluctuate in the normal range. Class 2 is, thus, the normal working condition, with 321,856 observations clustered into this class. Compared with the normal value, the numerical characteristics of class 1 have a large fluctuation range. Class 1 is, thus, the abnormal working condition, with 25,664 observations. After cluster analysis, there are data on 25,664 abnormal working conditions. The standardized data of the 11 variables and their two-step clustering results are shown in Figure 6. It can be clearly seen that the two-step clustering has obtained good clustering results. Observation with large fluctuations caused by meter damage, meter calibration, shutdown of the device, etc., are grouped into abnormal working condition class 1. The results of two-step clustering are shown in Table 5. The cluster classes in the two-step clustering results are the first and second classes. The numerical characteristics of class 2 fluctuate in the normal range. Class 2 is, thus, the normal working condition, with 321,856 observations clustered into this class. Compared with the normal value, the numerical characteristics of class 1 have a large fluctuation range. Class 1 is, thus, the abnormal working condition, with 25,664 observations. After cluster analysis, there are data on 25,664 abnormal working conditions. The standardized data of the 11 variables and their two-step clustering results are shown in Figure 6. It can be clearly seen that the two-step clustering has obtained good clustering results. Observation with large fluctuations caused by meter damage, meter calibration, shutdown of the device, etc., are grouped into abnormal working condition class 1.

Comparison with K-Means Clustering Method
The K-means clustering algorithm uses distance as the evaluation index of similarity. The closer the two data points are, the greater the similarity becomes. Clusters are composed of close data points. The ultimate goal of clustering is to obtain compact and independent clusters.
The K-means clustering algorithm is an iterative solution clustering analysis algorithm. The calculation steps of the clustering algorithm are shown in Figure 7. Processes 2021, 9, x FOR PEER REVIEW 12 of 17 Figure 6. Variables after standardization and two-step clustering results.

Comparison with K-Means Clustering Method
The K-means clustering algorithm uses distance as the evaluation index of similarity. The closer the two data points are, the greater the similarity becomes. Clusters are composed of close data points. The ultimate goal of clustering is to obtain compact and independent clusters.
The K-means clustering algorithm is an iterative solution clustering analysis algorithm. The calculation steps of the clustering algorithm are shown in Figure 7.

Comparison with K-Means Clustering Method
The K-means clustering algorithm uses distance as the evaluation index of similarity. The closer the two data points are, the greater the similarity becomes. Clusters are composed of close data points. The ultimate goal of clustering is to obtain compact and independent clusters.
The K-means clustering algorithm is an iterative solution clustering analysis algorithm. The calculation steps of the clustering algorithm are shown in Figure 7.  In order to facilitate the comparison of clustering results with different clustering methods, the value of K in K-means clustering is set to 2. The K-means clustering results are shown in Table 6. The cluster classes in the K-means clustering results are the first and second classes. Class 2 is the normal working condition, with 322,913 observations clustered into this class. Class 1 is the abnormal working condition, with 24,607 observations. The comparison between the two-step clustering results and the K-means clustering results is shown in Figure 8. In order to facilitate the comparison of clustering results with different clustering methods, the value of K in K-means clustering is set to 2. The K-means clustering results are shown in Table 6. The cluster classes in the K-means clustering results are the first and second classes. Class 2 is the normal working condition, with 322,913 observations clustered into this class. Class 1 is the abnormal working condition, with 24,607 observations. The comparison between the two-step clustering results and the K-means clustering results is shown in Figure 8. In Figure 8, the reaction temperature circled in red fluctuates greatly. The two-step clustering method effectively identifies and classifies them as an abnormal condition, while the K-means clustering method does not identify them effectively. It can, thus, be clearly seen that the two-step clustering method used in this paper is better than the Kmeans clustering method.

Establishment of the SDG Model
In actual engineering applications, it is difficult to obtain algebraic equations and differential equations between parameters for large and complex devices and equipment, so the SDG built based on expert experience knowledge is more effective. The use of expert experience alone to establish SDG models has certain limitations, maybe resulting in the inability of system. At the same time, the establishment of SDG model using mathematical analysis alone cannot specifically analyze the relationship between the variables. In this paper, through correlation analysis and R-type clustering analysis, feature selection and In Figure 8, the reaction temperature circled in red fluctuates greatly. The two-step clustering method effectively identifies and classifies them as an abnormal condition, while the K-means clustering method does not identify them effectively. It can, thus, be clearly seen that the two-step clustering method used in this paper is better than the K-means clustering method.

Establishment of the SDG Model
In actual engineering applications, it is difficult to obtain algebraic equations and differential equations between parameters for large and complex devices and equipment, so the SDG built based on expert experience knowledge is more effective. The use of expert experience alone to establish SDG models has certain limitations, maybe resulting in the inability of system. At the same time, the establishment of SDG model using mathematical analysis alone cannot specifically analyze the relationship between the variables. In this paper, through correlation analysis and R-type clustering analysis, feature selection and extraction are effectively carried out, which has played a key role in reducing the dimensionality of the data variables. By calculating the Pearson correlation coefficient between variables, it effectively and intuitively reflects the correlation between variables. Through the effective combination of expert experience and mathematical analysis, the SDG model of the reaction temperature is well established. The Pearson correlation coefficients among the variables are shown in Table 7. The relationship among the nodes of the reaction temperature related variables is analyzed with mechanism and process data. As shown in Figure 9, the listed influence relationships are combined into a complete SDG model. All nodes are connected by dotted arrows and solid arrows to indicate the negative and positive correlations between the nodes. sionality of the data variables. By calculating the Pearson correlation coefficient between variables, it effectively and intuitively reflects the correlation between variables. Through the effective combination of expert experience and mathematical analysis, the SDG model of the reaction temperature is well established. The Pearson correlation coefficients among the variables are shown in Table 7. The relationship among the nodes of the reaction temperature related variables is analyzed with mechanism and process data. As shown in Figure 9, the listed influence relationships are combined into a complete SDG model. All nodes are connected by dotted arrows and solid arrows to indicate the negative and positive correlations between the nodes.

Abnormal Identification
In the SDG model, bidirectional inference is used to find the consistent paths and all possible cause nodes. First, the states of all nodes are detected and abnormal nodes that exceed the threshold are found. The transient states of the abnormal nodes are shown in Table 8. Then related nodes are reversely searched from the alarm node T1. Five compatible paths with different cause nodes are obtained as follows: Figure 9. SDG of the reaction temperature.

Abnormal Identification
In the SDG model, bidirectional inference is used to find the consistent paths and all possible cause nodes. First, the states of all nodes are detected and abnormal nodes that exceed the threshold are found. The transient states of the abnormal nodes are shown in Table 8. Then related nodes are reversely searched from the alarm node T1. Five compatible paths with different cause nodes are obtained as follows: The first path is from V1 to T1, and the correlation of each node is positive. The increase of the valve position of the regenerated catalyst slide valve will increase the reaction temperature. The second path from T2 to T1 shows the influence of T2 on T1. The increase in the preheating temperature of the raw materials will also increase the reaction temperature. For the third and fourth paths, the increase of F1 increases F2 and F3, and then T1. In the fifth path, F6 has a negative correlation with T1, of which the increase will lead to the decrease of T1. The SDG model under abnormal conditions is shown in Figure 10. In short, the abnormality of nodes V1, F1, T2 and F6 will cause the fluctuation of reaction temperature T1.
• T1←L1←P3←P2←F3←F1 • T1←F6 The first path is from V1 to T1, and the correlation of each node is positive. The increase of the valve position of the regenerated catalyst slide valve will increase the reaction temperature. The second path from T2 to T1 shows the influence of T2 on T1. The increase in the preheating temperature of the raw materials will also increase the reaction temperature. For the third and fourth paths, the increase of F1 increases F2 and F3, and then T1. In the fifth path, F6 has a negative correlation with T1, of which the increase will lead to the decrease of T1. The SDG model under abnormal conditions is shown in Figure  10. In short, the abnormality of nodes V1, F1, T2 and F6 will cause the fluctuation of reaction temperature T1.

Conclusions
A new TSCA-SDG method is proposed to detect and identify the unknown abnormal working conditions in the catalytic cracking process. Through correlation analysis and Rtype clustering analysis, 11 variables are selected, such as feed quantity, preheating temperature of the raw materials and valve position of the regenerated catalyst slide valve. The two-step cluster analysis is performed on 347,520 observations of 11 variables, and the clustering results are obtained as two classes, one for normal working conditions and the other for abnormal operating conditions. The K-means clustering method is used for further verification of the two-step clustering method. SDG model accurately describes the characteristics of abnormal working conditions through the information propagation path between nodes with alarm thresholds. Through the organic combination of cluster analysis with SDG, data dimensionality reduction and feature selection and extraction are effectively carried out. Then, abnormal working conditions are quickly identified. From the perspective of mechanism analysis, the identification of unknown abnormal working conditions in catalytic cracking is better and more accurate than experience only. At present, there is much research on the identification of known working conditions, while there is little research on the identification of unknown working conditions, although this is urgently needed because there are a lot of abnormal working conditions in industrial production. The TSCA-SDG method proposed in this paper solves this problem

Conclusions
A new TSCA-SDG method is proposed to detect and identify the unknown abnormal working conditions in the catalytic cracking process. Through correlation analysis and R-type clustering analysis, 11 variables are selected, such as feed quantity, preheating temperature of the raw materials and valve position of the regenerated catalyst slide valve. The two-step cluster analysis is performed on 347,520 observations of 11 variables, and the clustering results are obtained as two classes, one for normal working conditions and the other for abnormal operating conditions. The K-means clustering method is used for further verification of the two-step clustering method. SDG model accurately describes the characteristics of abnormal working conditions through the information propagation path between nodes with alarm thresholds. Through the organic combination of cluster analysis with SDG, data dimensionality reduction and feature selection and extraction are effectively carried out. Then, abnormal working conditions are quickly identified. From the perspective of mechanism analysis, the identification of unknown abnormal working conditions in catalytic cracking is better and more accurate than experience only. At present, there is much research on the identification of known working conditions, while there is little research on the identification of unknown working conditions, although this is urgently needed because there are a lot of abnormal working conditions in industrial production. The TSCA-SDG method proposed in this paper solves this problem meaningfully. The quality of the clustering algorithm will limit the identification of abnormal conditions, so its further development will promote more in-depth research.