# A Novel Intelligent Method for Fault Diagnosis of Steam Turbines Based on T-SNE and XGBoost

## Abstract

## 1. Introduction

## 2. Methods

#### 2.1. Performance Indicator Extraction Based on t-SNE and K-Means

#### 2.2. Imbalanced Data Recognition Model Based on SMOTE and XGBoost

#### 2.3. Model Assessment Method

## 3. Experiments, Results and Discussion

#### 3.1. Introduction of Data Set

#### 3.2. Setting Labels for Different or Normal Faults

#### 3.3. Dealing with Data Imbalance

#### 3.4. Test Results

#### 3.5. Results and Discussion

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

Algorithm A1. T-SNE algorithm. |

#!/usr/bin/env python # coding: utf-8 import os import sys os.chdir (os.path.split (os.path.realpath (sys.argv [0]))[0]) import numpy from numpy import * import numpy as np from sklearn.manifold import TSNE from sklearn.datasets import load_iris from sklearn.decomposition import PCA import matplotlib.pyplot as plt import pandas as pd df1 = pd.read_excel (‘D:/data/gz5.xlsx’) df1.label.value_counts () def get_data (data): X = data.drop (columns = [‘time’, ‘label’]).values y = data.label.values n_samples, n_features = X.shape return X, y, n_samples, n_features X1, y1, n_samples1, n_features1 = get_data (df1) X_tsne = TSNE (n_components = 2,init = ‘pca’, random_state = 0).fit_transform (X1) def plot_embedding (X, y, title = None): x_min, x_max = np.min(X, 0), np.max(X, 0) X = (X − x_min) / (x_max − x_min) plt.figure () ax = plt.subplot (111) for i in range (X.shape [0]): plt.text (X [i, 0], X [i, 1], ‘.’, color = plt.cm.Set1 (y[i] * 3/10.), fontdict = {‘weight’: ‘bold’, ‘size’: 9}) plt.xticks ([]), plt.yticks ([]) if title is not None: plt.title (title) plot_embedding (X_tsne, y1) from sklearn.cluster import KMeans from sklearn.externals import joblib from sklearn import cluster estimator = KMeans (n_clusters = 2) res = estimator.fit_predict (X_tsne) lable_pred = estimator.labels_ centroids = estimator.cluster_centers_ inertia = estimator.inertia_ from pandas import DataFrame XA = DataFrame (res) XA.to_csv (‘D:/data/gz5out.csv’) |

Algorithm A2. XGBoost algorithm. |

#!/usr/bin/env python # coding: utf-8 from xgboost import plot_importance from matplotlib import pyplot as plt import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import numpy as np import pandas as pd from xgboost.sklearn import XGBClassifier # load data data = pd.read_csv (‘D:/data/suanfa/kyq.csv’) x, y = data.loc [:,data.columns.difference ([‘label’])].values, data [‘label’].values x_train, x_test, y_train, y_test = train_test_split (x, y, test_size = 0.3) data.label.value_counts () params ={‘learning_rate’: 0.1, ‘max_depth’: 2, ‘n_estimators’:50, ‘num_boost_round’:10, ‘objective’: ‘multi:softprob’, ‘random_state’: 0, ‘silent’:0, ‘num_class’:6, ‘eta’:0.9 } model = xgb.train (params, xgb.DMatrix (x_train, y_train)) y_pred = model.predict (xgb.DMatrix (x_test)) yprob = np.argmax (y_pred, axis = 1) # return the index of the biggest pro model.save_model (‘testXGboostClass.model’) yprob = np.argmax (y_pred, axis = 1) # return the index of the biggest pro predictions = [round (value) for value in yprob] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print (“Accuracy: %.2f%%” % (accuracy * 100.0)) plot_importance (model) plt.show () xgb1 = XGBClassifier ( learning_rate = 0.1, n_estimators = 20, max_depth = 2, num_boost_round = 10, random_state = 0, silent = 0, objective = ‘multi:softprob’, num_class = 6, eta = 0.9 ) xgb1.fit (x_train, y_train) y_pred1 = xgb1.predict_proba (x_test) yprob1 = np.argmax (y_pred1, axis = 1) # return the index of the biggest pro from sklearn.metrics import confusion_matrix confusion_matrix (y_test.astype (‘int’), yprob1.astype (‘int’)) from sklearn.metrics import classification_report print (‘Accuracy of Classifier:’,xgb1.score (x_test, y_test.astype (‘int’))) print (classification_report (y_test.astype (‘int’), yprob1.astype (‘int’))) |

## Appendix B

No. | Description |
---|---|

F0 | Time stamp |

F1 | Turbine Speed |

F2 | Main Steam Pressure |

F3 | Reheat Steam Pressure |

F4 | Main Steam Temp |

F5 | Bearing Bushing 11 |

F6 | Bearing Bushing 12 |

F7 | Bearing Bushing 21 |

F8 | Bearing Bushing 22 |

F9 | Bearing Bushing 31 |

F10 | Bearing Bushing 32 |

F11 | Bearing Bushing 41 |

F12 | Bearing Bushing 42 |

F13 | Bearing Bushing 51 |

F14 | Bearing Bushing 61 |

F15 | Bearing Vibration 1X |

F16 | Bearing Vibration 1Y |

F17 | Bearing Vibration 1Z |

F18 | Bearing Vibration 2X |

F19 | Bearing Vibration 2Y |

F20 | Bearing Vibration 2Z |

F21 | Bearing Vibration 3X |

F22 | Bearing Vibration 3Y |

F23 | Bearing Vibration 3Z |

F24 | Bearing Vibration 4X |

F25 | Bearing Vibration 4Y |

F26 | Bearing Vibration 4Z |

F27 | Bearing Vibration 5X |

F28 | Bearing Vibration 5Y |

F29 | Bearing Vibration 5Z |

F30 | Bearing Vibration 6X |

F31 | Bearing Vibration 6Y |

F32 | Bearing Vibration 6Z |

F33 | Turbine Differential Expansion |

F34 | Rotor Eccentricity |

**Figure 2.**Two-dimensional features of five faults. (

**a**) Two-dimensional fusion features of Fault 1. (

**b**) Two-dimensional fusion features of Fault 2. (

**c**) Two-dimensional fusion features of Fault 3. (

**d**) Two-dimensional fusion features of Fault 4. (

**e**) Two-dimensional fusion features of Fault 5.

**Figure 3.**Time series data of five faults. (

**a**) Clustering results of Fault 1 based on time series. (

**b**) Clustering results of Fault 2 based on time series. (

**c**) Clustering results of Fault 3 based on time series. (

**d**) Clustering results of Fault 4 based on time series. (

**e**) Clustering results of Fault 5 based on time series.

Proposed Method | Other Literatures | |
---|---|---|

Data set source | Actual data from the actual plant | Experimental data or numerical simulation data |

Data length | Larger (months or even years) | Smaller (hours or days) |

Fault label | Partly missing or being blurred | Identified by the experiment |

Fault verification | Based on real faults in the plant | Based on simulated faults |

Iterative strategy for research | Determined by the actual operation of the plant | Unable to iterate |

Significance of research | Solving practical problems | Continuous improvement of research algorithms |

Data Set | Sample Size | Time Range |
---|---|---|

Steam turbine | 340,468 | January to August in 2018 |

No. | Fault Discovery Time |
---|---|

1 | 3 Feb 2018 2:07 |

2 | 11 Feb 2018 6:19 |

3 | 13 Mar 2018 7:28 |

4 | 10 Jun 2018 7:44 |

5 | 7 Aug 2018 23:17 |

No. | Start Time | End Time | Advanced Time (min) |
---|---|---|---|

1 | 3 Feb 2018 0:14 | 3 Feb 2018 6:45 | 113 |

2 | 10 Feb 2018 22:02 | 11 Feb 2018 16:16 | 497 |

3 | 12 Mar 2018 19:32 | 13 Mar 2018 11:10 | 716 |

4 | 9 Jun 2018 14:53 | 10 Jun 2018 17:25 | 1011 |

5 | 7 Aug 2018 12:07 | 8 Aug 2018 6:25 | 670 |

Original Data | by SMOTE | |
---|---|---|

Normal | 78,513 | 78,513 |

Fault 1 | 392 | 5832 |

Fault 2 | 1095 | 16,823 |

Fault 3 | 939 | 14,402 |

Fault 4 | 1593 | 24,655 |

Fault 5 | 1099 | 16,801 |

Ratio | 15:1 | 1:1 |

Confusion Matrix | Predicted Result (%) | |||||
---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | |

0 | 97.06 | 0.08 | 1.09 | 0.67 | 0.37 | 0.73 |

1 | 0.06 | 99.94 | 0 | 0 | 0 | 0 |

2 | 1.24 | 0 | 98.76 | 0 | 0 | 0 |

3 | 2.36 | 0 | 0 | 97.64 | 0 | 0 |

4 | 0.41 | 0 | 0 | 0 | 99.59 | 0 |

5 | 0.27 | 0 | 0 | 0 | 0 | 99.72 |

Fault Label | Precision | Recall Rate | F1-Score |
---|---|---|---|

0 | 99.18% | 96.80% | 97.98% |

1 | 98.74% | 100.00% | 99.37% |

2 | 94.54% | 99.02% | 97.07% |

3 | 96.52% | 97.63% | 97.07% |

4 | 98.52% | 99.70% | 99.11% |

5 | 96.58% | 99.72% | 98.13% |

