## 1. Introduction

Due to its characteristics of comfort, speed, energy-saving and environmental protection, the high-speed railway has an increasing ratio of carrying passengers and cargo, which puts increasing requirements on the safety of the railway system. According to the 2019 China Railway Statistical Bulletin [

1], in 2019, the number of railway passengers sent across the country was 3.66 billion in China, an increase of 8.4% over the previous year. Moreover, in Europe, passenger and cargo transportation in railways has been, respectively doubled and tripled in the past decade [

2] according to the report of the European Commission.

The turnout, which is usually installed between two or more strands of tracks, is responsible for switching the train direction, consequently being a key part of the whole signal system in railway infrastructure [

3,

4,

5].

However, due to its complicated mechanical and electrical structure, exposure to the outdoor environment and the need to be frequently pulled, the turnout is more prone to failure. Based on the statistics given by China Railway Jinan Group Co, Ltd. in 2016 [

6] and China Railway Guangzhou Group Co, Ltd. in 2018 [

7], the number of turnout failures, respectively accounts for approximately 36% and 55% of the total railway signal infrastructure failures. It is obvious that turnout system damage will increase a lot along with the increase in railway transportation demand.

In addition, the cost of turnout maintenance is also very high. For example in England, the maintenance cost of turnout is about 3.4 million pounds per 1000 km of railway [

2]. Moreover, according to the report of the International Railway Union, the annual maintenance cost of the turnout system accounts for about 30% of the entire maintenance budget [

8]. Moreover, in the United States, the cost is about ten times that of ordinary tracks [

9]. Therefore, research on the turnout fault-diagnosis method is very important in not only improving railway reliability and safety, but also in reducing maintenance costs.

The main fault-diagnosis methods have been proposed can be divided into three categories, they are expert system-based, model-based and feature-based.

Xue et al. [

10,

11,

12,

13] designed and implemented a fault diagnosis and analysis system for railway signal equipment by establishing an expert knowledge base of railway signal equipment. This method overcomes the overreliance on the mathematical model and can effectively identify the faults. However, the application of expert system needs to obtain sufficient and comprehensive fault diagnosis expert knowledge and experience, which has certain difficulty. In addition, it is also difficult to generalize and summarize the experience of experts into the base [

14].

Eker, O. F. et al. [

15,

16] established the mathematical model for turnouts to achieve fault recognition and prediction of turnout operating state. This method does not require a large number of samples, however, in practice, it is very difficult to establish an accurate mathematical model for complex equipment.

For the feature-based diagnosis method, Witczak, Marcin et al. [

17] proposed a turnout fault-diagnosis method based on a neural network. They determined the input parameters and designed a variable threshold to adapt to new faults. Silmon, JA et al. [

18] used qualitative trend analysis to extract the data trend of normal and fault samples and established a rule matching mechanism for different fault types.

However, these methods are more costly to deal with new types of faults. Therefore, incremental learning is introduced in this paper. In other engineering areas, some researchers have proposed different classification methods (fault-diagnosis method) based on incremental learning. Generally, these methods can be divided into two different kinds: hierarchy-based methods and improved models [

19].

For hierarchy-based methods, a particular learning hierarchy is designed for incremental learning. Nong Ye et al. [

20] discussed scalable and incremental learning of a non-hierarchical cluster structure from training data. This cluster structure serves as a function to map the attribute values of new data to the target class of these data, that is, classify new data. Jun Wu [

21] proposed a new online semantic classification framework, in which two sets of optimized classification models, local and global, are online trained by sufficiently exploiting both local and global statistic characteristics of videos.

For improved methods, an original model is modified to adapt to the incremental learning scene. Max Kochurov et al. [

22] presented an improved Bayesian method to update the posterior approximation of each new data so as to reduce the cost of training a deep learning network. David A [

23] proposed an incremental learning target tracking method based on principal component analysis, which effectively adapted to the change of target appearance. Marko Ristin et al. [

24] introduced two improved random forest (RF) algorithms, one based on k-means and the other based on SVM, which were used to avoid retraining RFs from scratch when merging new classes. Stefan et al. [

25] proposed an incremental learning SVM method, which inputs the data into the algorithm in several batches and produces initial results after each step of training. In addition, many other neural network-based methods have also been proposed [

26,

27].

Some researchers have also proposed other different incremental methods for classification problems [

28,

29,

30]. However, many of these methods cannot be combined well to the field experience of workers (especially for the turnout system), and also some of these methods cannot update itself, which may reduce the efficiency of the model when new samples or fault types appear since the training samples are quite limited from the very beginning.

Our research group has been long engaged in the safety of railway turnout system, from turnout fault analysis [

31] and simulation fault data generating [

32], to fault-diagnosis method [

33,

34,

35], based on which, this paper proposed a Bayes-based online turnout fault-diagnosis method with high accuracy, adaptability and efficiency, which can realize incremental learning and scalable fault recognition, i.e., the model can update itself and deal with new turnout fault types, making it more applicable to the fieldwork and eventually, of great significance for labor-saving, timely maintenance and further, safety and efficiency of railway transportation.

The remainder of this paper is organized as follows:

Section 2 introduces the basic conceptions of the turnout system, including its structure, fault types and monitoring data. The methods of data processing and modeling are explained in

Section 3, including feature processing, imbalanced data preprocessing, incremental learning method and scalable fault recognition.

Section 4 presents a numeric experiment using monitoring data from Guangzhou Railway, followed by conclusions in

Section 5. The research framework of this paper is as shown in

Figure 1.

## 3. Online Learning and Diagnosis Method

#### 3.1. Knowledge Graph-Based Feature Engineering

Feature engineering, which, to a large extent, determines the limit of machine learning, is of great importance to obtaining valuable input features for the online diagnosis model of this paper. Therefore, the first step is feature extraction from the original data. The feature extraction method for monitoring data of railway turnouts has been introduced in our previous research [

33], where 21 features (e.g., maximum value, mean value, fluctuation factor, peak factor, etc.) are stored by the knowledge graph in this paper.

The knowledge graph is a theoretical method to visualize the core structure or knowledge framework of disciplines or research fields, where the representation of the knowledge graph may vary from one another. In this paper, the entity-relationship model is used to show the structure of the knowledge graph of the turnout monitoring data, where the basic elements are entities (squares), attributes (ellipses) and relationships (lines) as shown in

Figure 5.

In reality, turnout systems are well maintained, and the fault samples are quite rare. According to the data from Guangzhou Railway, the abnormal samples account for less than 0.1% of the total. Therefore, the knowledge graph of normal samples and abnormal samples should be arranged in different ways.

As shown in

Figure 5, for normal conditions, it is not necessary to put every single sample into the knowledge graph. Instead, the changing trend of sample features is the most important characteristic. Therefore, the knowledge graph records the center, i.e., the most representative sample of all the normal samples. Other normal samples are only recorded as feature ranges, e.g., mean, maximum, etc. For abnormal conditions, the samples are divided into different fault types. In each fault type, all samples will be recorded into the knowledge graph, including features, figures, data and information.

As mentioned above, it is very important to choose the center of normal samples and record it into the knowledge graph. After the feature set is formed, the next step is to choose the most representative sample, i.e., the center. To solve this problem, the K-medoids method (an adaption of K-means algorithm) [

35], which requires the clustering center to be one of all the samples rather than creating a new clustering center, is applied in this paper. The steps of the K-medoids algorithm are as follows:

The input feature set is ${F}_{m\times n}$ (n is the feature dimension) and choose any row as the original center;

Define the cluster evaluation variable sum of absolute differences (

SAD):

Choose any sample except ${F}^{0}$ as ${F}^{k}$ and calculate the corresponding SAD;

If the SAD of ${F}^{k}$ is smaller than the SAD of ${F}^{0}$, update the ${F}^{k}$ to be the new center;

When all the samples have been searched, the SAD of the current center will reach the minimum and the iteration is finished. Otherwise, return to the third step).

#### 3.2. Imbalanced Data Preprocessing

Before incremental learning modeling, the imbalanced problem of the dataset needs to be solved. In reality, since fault samples are quite rare, which may influence the classification accuracy. The class imbalance problem means that there is a class in the data set whose sample number is far more or less than others, which often leads to the failure of some machine learning models. In this study, the SMOTE (synthetic minority oversampling technique) [

41], an improved method based on random oversampling, is applied to deal with the imbalance data.

SMOTE produces new samples for the minority class by selecting k-nearest neighbors and add random numbers to them. The detailed process is as follows:

For each sample ${s}_{1}$ in the minority class, the Euclidean distance is taken as the standard to calculate its distance to all samples in the class so as to find its k-nearest neighbor;

A sampling ratio $N$ is set according to the sample imbalance ratio; For each minority sample ${s}_{1}$, several samples are randomly selected from its k-nearest neighbors, assuming that the chosen nearest neighbor is ${s}_{2}$;

For each chosen neighbor

${s}_{2}$, a new sample

${s}_{new}$ is constructed according to the following formula with the original sample

${s}_{1}$:

#### 3.3. Bayesian Incremental Learning

The naïve Bayesian (NB) classification model is a pattern recognition method based on the Bayesian principle which combines the class prior probability and the conditional probability of each feature attribute to calculate the posterior probability.

Presume that the feature set $T$ consists of $\{{t}_{1},{t}_{2},\dots ,{t}_{m}\}$ and the $m$ feature attributes are independent of each other. The class variable is represented as $({c}_{1},{c}_{2},\dots ,{c}_{n})$. A single sample is represented as $<T,c>$.

According to the Bayesian principle, when the prior probability

$P({c}_{i})$, the conditional probability

$P(T|{c}_{i})$ and marginal probability is known, the posterior probability

$P({c}_{1}|T),\text{}P({c}_{2}|T),\text{}\dots ,\text{}P({c}_{n}|T)$ corresponding to

$n$ classes can be calculated. The largest posterior probability indicates the class of this sample:

The Laplace estimation is applied to get the prior probability and conditional probability of each feature attribute:

where

$\left|{D}_{{c}_{i}}\right|$ is the number of the sample of class

${c}_{i}$ in the training data,

$\left|C\right|$ denotes the class number,

$\left|D\right|$ denotes the number of training samples,

$\left|{D}_{{t}_{k}|{c}_{i}}\right|$ is the number of training samples which have the attribute

${t}_{k}$ and belongs to the class

${c}_{i}$ and

$\left|{B}_{k}\right|$ is the number of training samples which have the feature attribute

$k$. Since the features are independent of each other, the conditional probability can be calculated as:

naïve Bayesian classifier is of strong mathematical logic and high classification accuracy. However, the traditional naïve Bayesian classifier has two weaknesses. First, the classification accuracy of the classifier is closely related to the size and integrity of the training data set, which, generally, is difficult to guarantee. Second, when it is applied to turnout fault diagnosis, with the sample size gradually increasing, for each additional batch of sample data, the quasi-prior probability and conditional probability of each characteristic attribute must be recalculated, which is very time consuming and cannot meet the real-time requirement of online fault diagnosis.

However, the combination of the Bayesian algorithm and incremental learning can solve these problems well while maintaining its inherent strong logic and high diagnostic accuracy. When a fault occurs, the newly added sample is input into the Bayesian classification model to get the diagnosis result. If the result is reliable, then this new sample can be used to update the model and ultimately helps to increase the sample data sets and accuracy of the model.

For the new sample, when meeting the judgment conditions, it can be used to update the model, i.e., to update the class prior probability and characteristic properties of conditional probability, which are defined as follows:

where

${P}^{*}({c}_{i})$ is the updated class prior probability and

${P}^{*}({t}_{k}|{c}_{i})$ is the updated conditional probability. To be specific,

$\delta =\left|C\right|+\left|D\right|$ and

$\gamma =\left|{B}_{k}\right|+\left|{D}_{{c}_{i}}\right|$.

As for the judgment conditions, the posterior probability of the fault label

${c}_{i}{}^{\prime}$ satisfies the following condition:

The overall incremental fault diagnosis process is as shown in

Figure 6.

#### 3.4. Scalable Fault Recognition

Though the incremental learning method was widely used in different fields and it can reduce the training cost, there is a key problem that cannot be ignored is the recognition of new classes. Since the types of faults may increase along with the mechanical changes of the turnouts, the new fault types are very normal in reality. Therefore, to find a method to recognize new faults and realize the automatic update of the knowledge graph is very important for the field application.

The scalable fault recognition problem (SFR) can be abstracted into a novelty detection problem. The model of novelty detection will fit a rough boundary to define the contour of the initial samples, i.e., the normal samples. When a new sample appears, it will be judged whether it is within the boundary.

Local outlier factor (LOF) [

42], an efficient novelty detection method, aims to detect those abnormal data that are quite different from the characteristic attributes of normal data. One of the assumptions of LOF is that the normal data must be clustered, which puts forward certain requirements on the original data set. The turnout data samples, which were clustered according to the mentioned methods of data processing, basically meet this requirement.

The principle of LOF is to calculate the density of a certain point and to judge whether the point is in the sparse region according to the density of the data cluster of the dataset itself. A point in the sparse region is recognized as a novelty point. Due to its "local" characteristics, LOF can handle clusters with different densities correctly. The local outlier factor

$LO{F}_{k}(p)$ is used to measure the novelty degree of point

$p$:

where

$d(p,o)$ is the distance between point

$p$ and

$o$;

${d}_{k}(p)$ is the distance between point $p$ and the ${k}_{th}$ nearest neighbor of $p$;

$r{d}_{k}(p,o)$ is the reachability distance between point $p$ and $o$;

$lr{d}_{k}(p,o)$ is the local reachability density. It measures the density of point $p$ and its $k$ neighbors. The larger the reachable distance, the smaller the density;

$LO{F}_{k}(p)$ is the local outlier factor. It defines a concept of relative density, which can deal with heterogeneous data and make outlier detection more accurate.

According to the definition of the local outlier factor, $LO{F}_{k}(p)\approx 1$ means the local density of $p$ is similar to its $k$ neighbors; $LO{F}_{k}(p)$ < 1 means $p$ is in a high-density area, indicating it is a normal point; If $LO{F}_{k}(p)\gg 1$, then the point $p$ is far away from the normal data cluster, which means it is very likely to a novelty point.

Since the original LOF can only deal with binary classification problems, the final output should consider all the binary classification models, inspired by the multiple classification models. If all the binary classification models output −1, i.e., this sample is not similar to any known classes, then it is identified as a new class sample.

When the sample size is big enough, a clustering method is applied, and the unidentified samples can be labeled more easily. Finally, the densest N clusters among these unidentified samples are selected as the candidate sample set of the new classes. If the density of this sample set exceeds the specified threshold, then it is determined as the representative sample set of the new class for further knowledge graph update.