Imbalanced Data Fault Diagnosis Based on an Evolutionary Online Sequential Extreme Learning Machine

: To quickly and e ﬀ ectively identify an axle box bearing fault of high-speed electric multiple units (EMUs), an evolutionary online sequential extreme learning machine (OS-ELM) fault diagnosis method for imbalanced data was proposed. In this scheme, the resampling scale is ﬁrst determined according to the resampling empirical formulation, the K-means synthetic minority oversampling technique (SMOTE) method is then used for oversampling the minority class samples, a method based on Euclidean distance is applied for undersampling the majority class samples, and the complex data features are extracted from the reconstructed dataset. Second, the reconstructed dataset is input into the diagnosis model. Finally, the artiﬁcial bee colony (ABC) algorithm is used to globally optimize the combination of input weights, hidden layer bias, and the number of hidden layer nodes for an OS-ELM, and the diagnosis model is allowed to evolve. The proposed method was tested on the axle box bearing monitoring data of high-speed EMUs, on which the position of the axle box bearings was symmetrical. Numerical testing proved that the method has the characteristics of faster detection and higher classiﬁcation performance regarding the minority class data compared to other standard and classical algorithms. and the number of hidden layer nodes for the OS-ELM. The simulation results show that the method improved the classiﬁcation accuracy and speed.


Introduction
In recent years, remarkable achievements have been made in the construction of high-speed electric multiple units (EMUs), with over 3500 sets of standard high-speed EMUs put into service and a railway operating distance reaching 29,000 km in China alone. Thus, the safety issue of high-speed EMUs is of great importance. The axle box bearing, which is one of the essential parts of a high-speed EMU, runs at a very fast speed under complicated and threatening environments and suffers a heavy load [1]. It poses a risk to railway safety when a train fully loaded with passengers is running at a high speed [2]. Therefore, it is of great importance to research techniques for bearing fault diagnosis [3][4][5][6] and the time is ripe for such a study, thanks to the development of big data, artificial intelligence, and the Internet.
As imbalanced as the axle box bearing states typically are, being able to identify the minority class faults with accuracy is much more crucial than identifying the normal ones in the majority class [7]. To shed light on this, when a faulted axle box bearing is diagnosed as normal, it might lead to derailment and death, causing great casualties. However, regarding a normal state axle box as a faulted one would probably bring a temporary halt to the train. Yet, the first situation is much more severe The rest of this paper is organized as follows: Section 2 describes the basic principles of the OS-ELM, ABC algorithm, and the ABC-OSELM fault diagnosis method; Section 3 introduces the main calculation steps of the imbalanced data mixed resampling and the proposed imbalanced axle box bearing fault diagnosis based on the evolutionary OS-ELM; Section 4 shows the application of this method to an axle box bearing fault diagnosis and the results of several experiments; and in Section 5, a summary is stated.

OS-ELM
The axle box bearing fault diagnosis model is not static but is continuously self-optimized through updated training data. The classical algorithms are very time-and resource-consuming because the new data and the past data have to be retrained together while new data arrives. An OS-ELM introduces the concept of time into the model training, where monitoring data is batched together, which can provide better generalization performance than other popular sequential learning algorithms [11] and keeps the advantages of fast learning speed and simple parameter selection.

Initialization
An extreme learning machine (ELM) [28] is the initialization phase of the OS-ELM, which is a smart algorithm that is applied to a single-hidden-layer feedforward neural network (SLFN). It transforms the task in the tradition of training single-layer parameters to linear equation computations through randomly assigning values to weights between the input layer and the hidden layer, as well as to the bias vector parameters in the hidden layer. The output weights are derived using the minimum norm least-squares solution.
An ELM is a single-hidden-layer feedforward neural network, as Figure 1 shows, which consists of N input nodes, L hidden nodes, and M output nodes. X i represents the input vector of the ith input node, such as the axle box bearing information and working conditions, and y i represents the output vector of the ith output node. The rest of this paper is organized as follows: Section 2 describes the basic principles of the OS-ELM, ABC algorithm, and the ABC-OSELM fault diagnosis method; Section 3 introduces the main calculation steps of the imbalanced data mixed resampling and the proposed imbalanced axle box bearing fault diagnosis based on the evolutionary OS-ELM; Section 4 shows the application of this method to an axle box bearing fault diagnosis and the results of several experiments; and in Section 5, a summary is stated.

OS-ELM
The axle box bearing fault diagnosis model is not static but is continuously self-optimized through updated training data. The classical algorithms are very time-and resource-consuming because the new data and the past data have to be retrained together while new data arrives. An OS-ELM introduces the concept of time into the model training, where monitoring data is batched together, which can provide better generalization performance than other popular sequential learning algorithms [11] and keeps the advantages of fast learning speed and simple parameter selection.

Initialization
An extreme learning machine (ELM) [28] is the initialization phase of the OS-ELM, which is a smart algorithm that is applied to a single-hidden-layer feedforward neural network (SLFN). It transforms the task in the tradition of training single-layer parameters to linear equation computations through randomly assigning values to weights between the input layer and the hidden layer, as well as to the bias vector parameters in the hidden layer. The output weights are derived using the minimum norm least-squares solution.
An ELM is a single-hidden-layer feedforward neural network, as Figure 1 shows, which consists of N input nodes, L hidden nodes, and M output nodes.
represents the input vector of the ith input node, such as the axle box bearing information and working conditions, and represents the output vector of the ith output node. A training dataset D is given as = ( , )|( , ) ∈ × , where and denote an × 1 vector and an × 1 vector, respectively. For an ELM model with L hidden nodes, the output can be given as Equation (1): in which the output weight matrix is: A training dataset D is given as D = (x i , y i ) (x i , y i ) ∈ R N × R M , where x i and y i denote an N × 1 vector and an M × 1 vector, respectively. For an ELM model with L hidden nodes, the output can be given as Equation (1): in which the output weight matrix is: where β ij denotes the weight from the ith hidden node to the jth output and β i denotes the output weight vector of the ith hidden node. a i is an input weight vector between the input nodes and the ith hidden node, b i is the bias weight of the ith hidden node, and G(a i , b i , x) is the output of the ith hidden node as a function of the input x. The activation function g(a, b, x) can be any bounded nonconstant piecewise continuous function. The sigmoid function is usually chosen as an activation function and is shown as follows: A feedforward neural network with an activation function g(x) can produce a zero-error estimation of the outputs; therefore, there exists a i , b i , and β i that can be substituted into Equation (5): Then, Equation (5) can be rewritten as Equation (6): In Equation (6), H denotes the hidden layer output matrix of the ELM: H(a 1 , · · ·, a L , b 1 , · · ·, b L , x 1 , · · ·, x L ) = Y is the matrix form of the target output: Due to the minute contribution the random value assignments to the hidden layer parameters a i and b i makes to the model precision, the output weight matrix β can be described as: where H † is the Moore-Penrose generalized inverse of the hidden layer output matrix H, which can be solved using an orthogonal projection, iteration, singular value decomposition (SVD), etc. The orthogonal projection method can be used if H T H is nonsingular such that H † can then be written To elevate the generalization performance and robustness of the solution, a regularization parameter C is introduced; thus, Equation (9) is further rewritten as: SVD has prevailed in recent ELM studies when it comes to deriving H † . If the initial training data are given as , then the initial hidden layer output matrix H 0 and output weight vector β (0) is given using an ELM. β (0) can be written as:

Online Sequential Learning
Given that the information for the (k+1)th batch is known, then the information is utilized to calculate the hidden layer output matrix H k+1 ; the output weight vector β (k+1) is given as: The online sequential learning steps above are repeated and the iterative hidden layer output matrix H and output weight vector β are updated. In this way, the neural network's classification and generalization performance are reinforced, as well as the fault diagnosis accuracy.

Artificial Bee Colony
The ABC algorithm, which was put forward by Karaboga, is a new type of smart optimization algorithm that imitates the collection behaviors within a swarm of bees [29]. It is easily manipulated and robust with condensed control parameters. The evolution of the solution entails three types of bees: employed bees, onlooker bees, and scout bees. The ABC algorithm realizes multi-population collaborative optimization with high quality and reduced searching time. Meanwhile, the unique random probe mechanism implemented by scout bees effectively prevents becoming stuck at a local optimum. Two main phases are used to build ABC architectures: initialization and optimization.

Initialization
First, the scale of the colony N is established. One employed bee is assigned to each nectar source and equal amounts of onlooker bees and employed bees are maintained. Then, the initialization parameters are set: the maximum cycle number (MCN), the quit limit (limit), and the upper and lower bounds in the search space.

Optimization
Employed Bees

•
The employed bees should exploit nectar sources and find out the amounts of nectar in food sources. Every one of the nectar sources represents a feasible solution to the problem, and the nectar amounts represent the fitness value of the solution.

•
The employed bees then locate a new candidate source away from the previous one and compare the nectar amounts of the two, then choose the richer one. The new candidate source location is generated from its predecessor using Equation (16): where v ij denotes the new location, while x ij denotes the previous location and x k j is an arbitrary source location in the neighborhood of the previous location, with the indices k, i = 1, 2, . . . , N, where N marks the colony scale, and j = 1, 2, . . . , D, where D marks the dimensions of the problem. ϕ ij is a random number following a uniform distribution within [−1, 1].

Onlooker Bees
After exploiting the nectar sources, the employed bees share the information of these sources with onlooker bees. The onlooker bees then select the probability-related source using Equation (17): where f it i is the nectar amount of the ith source; i = 1, 2, . . . , N; and N marks the colony scale. The onlooker bees also produce a new candidate source location and calculate the nectar amounts of it, and then select the better source out of the previous source and the candidate, as was mentioned above.

Scout Bees
Provided that there is no further progress in a particular source after the limit times of employed bees' and onlooker bees' exploitation, the evolution of the solution gets stuck at a local optimum, which means that this source or solution should be thrown out of the employed bees' task set. In this situation, the employed bees will then work as scout bees in search of a new arbitrary source location in place of the inferior source to propel the evolution. When this abandoned source is the present optimum, its information will be recorded. The new arbitrary source location is found using Equation (18): where x max, j and x min, j mark the upper and lower bounds of source x ij in the jth dimension, respectively. The optimization stage in step (2) will not cease until it meets the MCN, and thus, the best nectar source is obtained.

ABC-OSELM
It has been observed that the performance of OS-ELM depends highly on the chosen set of input weights, hidden layer bias, and the number of hidden layer nodes. OS-ELM may display worse performance in the case of non-optimal parameters. In this study, the ABC algorithm was used to find the optimal set of input weights and hidden layer bias with different numbers of hidden layer nodes for the OS-ELM. The structure of the evolutionary OS-ELM method is shown in Figure 2 and works as follows: 1.
Set the chunk size P and the maximum hidden layer nodes number H max .

2.
Generate the initial population randomly. Each initial solution vector u i contains input weights, a hidden layer bias, and the initialization parameters, i.e., MCN, limit, and upper/lower bounds of the search space need to be established depending on the number of hidden layer nodes: where n and m are the numbers of nodes in the input layer and the hidden layer, respectively. 3.
For each individual, calculate the output weight matrix β 4.
Calculate the fitness value of each solution vector.
Record the best parameters and update the diagnosis model.

Imbalanced Data Mixed Resampling
The actual high-speed EMU axle box bearing fault data have an imbalanced feature. While the evolutionary OS-ELM algorithm proposed in Section 2 improves the accuracy of the classification, the sorting accuracy in the minority class for samples with a high proportion of imbalances is not high. In this study, a scheme comprised of an undersampling method, based on the Euclidean distance, and the oversampling K-means SMOTE, based on the clustering distribution, was established to manage the problems of data imbalance. The procedure is presented in Figure 3, which proceeds as follows:

Imbalanced Data Mixed Resampling
The actual high-speed EMU axle box bearing fault data have an imbalanced feature. While the evolutionary OS-ELM algorithm proposed in Section 2 improves the accuracy of the classification, the sorting accuracy in the minority class for samples with a high proportion of imbalances is not high. In this study, a scheme comprised of an undersampling method, based on the Euclidean distance, and the oversampling K-means SMOTE, based on the clustering distribution, was established to manage the problems of data imbalance. The procedure is presented in Figure 3, which proceeds as follows: 1.
The sample set D made up of historical monitored data of high-speed EMU axle box bearings is classified according to the sample's properties. The minority class sample set is defined as S.
The majority class sample set is defined as M.

2.
Using the resampling empirical Equation (20) proposed in Porwik et al. [18] for the imbalanced data dichotomy, the resampling scales are set as follows: where c is the oversampling scale for the minimum class (c ≥ 1) and d is the undersampling scale for the maximum class (0 ≤ d ≤ 1). R denotes the ratio of the number of the minority class samples to the number of majority class samples.

3.
Oversampling of the minority class samples using K-means SMOTE is based on the clustering distribution. In the beginning, filter the targeted clusters and save the ones that are numerically dominant within the minority class samples by generating k clusters using K-means algorithms. Then, distribute the quantity of the clustering samples while preferring the sparse groups.
To prevent undesirable noise, only data relating to key indicators are oversampled, while the rest are kept in their original form to highlight the more informative and significant indicators.
where is the oversampling scale for the minimum class ( 1) and is the undersampling scale for the maximum class (0 1). R denotes the ratio of the number of the minority class samples to the number of majority class samples. 3. Oversampling of the minority class samples using K-means SMOTE is based on the clustering distribution. In the beginning, filter the targeted clusters and save the ones that are numerically dominant within the minority class samples by generating k clusters using K-means algorithms. Then, distribute the quantity of the clustering samples while preferring the sparse groups. To prevent undesirable noise, only data relating to key indicators are oversampled, while the rest are kept in their original form to highlight the more informative and significant indicators. Finally, new samples are generated in each selected cluster with SMOTE and added to the

Diagnosis Process
The mixed resampling online sequential extreme learning machine based on the artificial bee colony algorithm (MS-ABC-OSELM) extracts more important information from features with the mixed resampling techniques proposed in Section 3.1 and applies the evolutionary OS-ELM algorithm proposed in Section 2 to search for global optimum sets of the input weights, hidden layer bias, and the number of hidden layer nodes for the OS-ELM. It inherits stability and fast processing speed from the OS-ELM and optimization characteristics from the ABC algorithm, which guarantees high learning accuracy and ensures the least amount of training error.
The fault diagnosis with the MS-ABC-OSELM is modeled, as exhibited in Figure 4, and proceeds as follows:

1.
Resample the imbalanced data by extracting the required high-speed EMU axle box bearing data and dividing them into minority class samples and majority class samples. Determine the resampling scale of the imbalanced data and reconstruct the training dataset with mixed resampling methods to acquire new training samples.

2.
Use the evolutionary OS-ELM algorithm to search for the global optimum sets of the input weights and hidden layer bias under different numbers of hidden layer nodes for the OS-ELM, as discussed in Section 2.

3.
Report the best combination of variables and note down the optimized input weights and hidden layer bias with the number of hidden layer nodes.

4.
Establish the optimized diagnosis model.

5.
Update the fault diagnosis model using the accurately diagnosed historical data as the online sequential sample. These online sequential samples are then reconstructed to a balanced distribution. Compute H k+1 and β (k+1) , then update the diagnosis model using the OS-ELM. Repeat step (5) and optimize the high-speed EMU axle box bearing fault diagnosis model through adaptive learning.
Symmetry 2020, 12, x FOR PEER REVIEW 9 of 16 The mixed resampling online sequential extreme learning machine based on the artificial bee colony algorithm (MS-ABC-OSELM) extracts more important information from features with the mixed resampling techniques proposed in Section 3.1 and applies the evolutionary OS-ELM algorithm proposed in Section 2 to search for global optimum sets of the input weights, hidden layer bias, and the number of hidden layer nodes for the OS-ELM. It inherits stability and fast processing speed from the OS-ELM and optimization characteristics from the ABC algorithm, which guarantees high learning accuracy and ensures the least amount of training error.
The fault diagnosis with the MS-ABC-OSELM is modeled, as exhibited in Figure 4, and proceeds as follows:

Experiments and Analysis
All computation experiments were conducted in an identical system environment. A PC installed with a 2.60 GHz CPU, 8.00 GB RAM, Windows 10 system, and Python 3.7 was sufficient for the experiments in the study.

Dataset
Samples were taken from the routine monitoring data of 20 homomorphic trains over 4 months of service, including onboard data of axle box bearings (bearings were manufactured by NTN) and ground-detected data. The selected data were representative in that the sampled trains' routes covered a broad span of districts and complicated environments. To monitor the axle box bearings' state, data of the trains, working conditions, and temperatures of axle box bearings containing 10 dimensions, as presented in Table 1, were analyzed. The temperature of each axle box bearing was recorded using two channels. The preorder temperature marks the tested bearing temperature 10 min prior.
For high-speed EMUs, each bogie has two axles and there is an axle box bearing on each end of each axle, wherefore the installation position of the axial box bearings on the EMU is symmetrical. Considering the symmetrical effect, the temperature of the coaxial bearing, which is on the symmetrical side, is selected as one of the properties. Furthermore, the temperature of the bearing on the same side, which is on the other axle of the bogie, was also selected. Preconditioning included the noise reduction and normalization of the data. Define class 1 as the normal state and class 2 the abnormal state. Samples in class 1 were numerous and centralized, thus belonging to the majority class, while samples in class 2 were sparse and naturally went into the minority class. The number ratio of the minority class samples to the majority class samples was 115:345, showing an obvious imbalanced characteristic in the distribution of the sample states. The majority class was divided into the original training set and testing set at a 1:1 ratio via randomly sampling. Furthermore, the minority class was the same. Therefore, the training and testing sets had the same number of samples and the same imbalance ratio. Temperature of the coaxial bearings 6 Temperature of the same-sided bearings 7 Preorder temperature 8 Mile 9 Acceleration 10 Load

Assessment Criteria
Accuracy, which is the ratio of correctly classified samples to the total, cannot perfectly reflect the function of the classification model for imbalanced data. Consequently, this study used a far better method, namely, the G-mean, to assess the classification performance; the G-mean is calculated as the geometric mean of the minority class precision and majority class precision. Furthermore, the F1-measure of the minority class was also used for verification. Table 2 exhibits the confusion matrix of the binary classification imbalanced data, in which TP denotes the true classification of the positive samples, TN denotes the true classification of the negative samples, FP denotes the false classification of the positive samples, and FN denotes the false classification of the negative samples. Accuracy is defined as follows: The G-mean is defined as follows: Symmetry 2020, 12, 1204

of 16
The F1-measure is defined as follows: where Recall and Precision are defined as follows:

Analysis
A support vector machine (SVM), ELM, OS-ELM, ABC-OSELM, and MS-ABC-OSELM were contrastively tested and evaluated in the study concerning the function of imbalanced data classification. An SVM is robust regarding imbalanced data classification and was therefore chosen as a reference for comparison with the others. The kernel function of the SVM took the shape of a radial basis function (RBF) with the kernel parameter σ and penalty parameter g set as 0.8 and 21, respectively, through the cross-validation searching method. The sigmoid function was collectively chosen as an activation function for the ELM, OS-ELM, ABC-OSELM, and MS-ABC-OSELM models. The initial training dataset contained 100 samples and the batch number was set as 100 to simulate a continuous dataflow for the OS-ELM, ABC-OSELM, and MS-ABC-OSELM.

Algorithm Efficiency Analysis on the Original Dataset
To verify the classification effect of the ABC-OSELM, we compared it with other existing methods. The G-mean, F1-measure, and running time were used as the evaluation metrics to analyze the experimental results. Figure 5 demonstrates the G-mean, F1-measure, and operating times were affected by the number of hidden layer nodes for the ELM, OS-ELM, and ABC-OSELM. Figure 5a,b show that as the number of hidden layer nodes increased, the classification performance increased to a certain level, and then stabilized with the number of hidden layers increasing. The proposed method, ABC-OSELM, found the solution quicker and the solution was better compared to other methods. Figure 5a shows that the G-mean for the three algorithms also increased with the number of hidden layer nodes. It should be noted that the G-mean for ELM and OS-ELM were 0 when the number of hidden layer nodes was less than 10 nodes, while the ABC-OSELM identified minority class samples after six nodes. The G-means of the ELM and OS-ELM were always lower than that of the ABC-OSELM. The trend of the F1-measure was consistent with G-mean, which is shown in Figure 5b. This means that the ELM and OS-ELM needed more hidden layer nodes to identify the minority class samples. In other words, these two algorithms required more onboard computing resources and had higher energy consumption. From Figure 5c, we can see that the testing time of the three algorithms did not fluctuate significantly with the number of hidden layer nodes. This was because all three models were based on the ELM model, which has the characteristics of fast calculation speed. The testing time was mostly used for data processing, reading, and optimization. The calculation time of the hidden layer was so short that the number of hidden layers had no significant effect on the training and testing time. The training time of the ABC-OSELM was much longer than those of the ELM and OS-ELM because of the optimization. Since the training time did not affect the use of the model, this study did not consider the training time as a factor that was used for comparing performance.
The performances of each classification model are given in Table 3. Using the original dataset for classification, the testing time of the ABC-OSELM algorithms was much shorter than the time for the SVM and ELM, which was 1.17 times that of the OS-ELM. It had the highest classification performance compared to the other three machine learning algorithms, both for the training dataset and the test dataset. The ABC-OSELM method had displayed high classification performance for the minority class data. and testing time. The training time of the ABC-OSELM was much longer than those of the ELM and OS-ELM because of the optimization. Since the training time did not affect the use of the model, this study did not consider the training time as a factor that was used for comparing performance.
The performances of each classification model are given in Table 3. Using the original dataset for classification, the testing time of the ABC-OSELM algorithms was much shorter than the time for the SVM and ELM, which was 1.17 times that of the OS-ELM. It had the highest classification performance compared to the other three machine learning algorithms, both for the training dataset and the test dataset. The ABC-OSELM method had displayed high classification performance for the minority class data.     Figure 6 compares the classification effects of different algorithms more intuitively. The ABC-OSELM algorithm, using the best parameters set, had a higher classification performance and running efficiency than the other three classifiers using the original dataset.  Figure 6 compares the classification effects of different algorithms more intuitively. The ABC-OSELM algorithm, using the best parameters set, had a higher classification performance and running efficiency than the other three classifiers using the original dataset.

Algorithm Analysis on the Mixed Resampling Dataset
To demonstrate the effectiveness of the approach of imbalanced data mixed resampling, the training dataset was mixed resampled and the test dataset was unchanged. The scale parameter of the mixed resampling training dataset was determined using the resampling empirical Equation (20). The imbalance ratio of the mixed resampling training dataset was changed from 3:1 to 1.59:1. We carried out a series of experiments on the mixed resampling dataset. Figure 7 shows that when the hidden layer nodes number increased, the G-mean and F1measure of the MS-ABC-OSELM increased to a certain level and then stabilized, and the testing time fluctuated between 0.005 s and 0.007 s, and did not fluctuate significantly with the number of hidden layer nodes. The MS-ABC-OSELM identified minority class samples after the number of hidden layer nodes was two, which was four less than the ABC-OSELM.

Algorithm Analysis on the Mixed Resampling Dataset
To demonstrate the effectiveness of the approach of imbalanced data mixed resampling, the training dataset was mixed resampled and the test dataset was unchanged. The scale parameter of the mixed resampling training dataset was determined using the resampling empirical Equation (20). The imbalance ratio of the mixed resampling training dataset was changed from 3:1 to 1.59:1. We carried out a series of experiments on the mixed resampling dataset. Figure 7 shows that when the hidden layer nodes number increased, the G-mean and F1-measure of the MS-ABC-OSELM increased to a certain level and then stabilized, and the testing time fluctuated between 0.005 s and 0.007 s, and did not fluctuate significantly with the number of hidden layer nodes. The MS-ABC-OSELM identified minority class samples after the number of hidden layer nodes was two, which was four less than the ABC-OSELM.
In the experiment, the SVM, ELM, OS-ELM, and ABC-OSELM performed the base learning algorithms separately. Tables 4 and 5 list the comparison results of G-means and F1-measures of the experiments between the resampling dataset and the original dataset. The performances show that the resampling dataset obtained significantly better performances in terms of G-means and F1-measures than the original dataset. It was shown that the proposed method of imbalanced data mixed resampling effectively improved the imbalance index. Comparing the four algorithms on the two datasets, we found that the MS-ABC-OSELM performed the best in terms of both the G-mean and the F1-measure. Its G-mean was 18.7%, 5.4%, 11.2%, and 6.9% better than the ABC-OSELM, MS-SVM, MS-ELM, and MS-OS-ELM, while the F1-measure of the MS-ABC-OSELM method outperformed the four algorithms by 22.7%, 6%, 12.7%, and 9.1%, respectively. The numerical results show that the mixed sampling process could greatly improve the identification rate of the minority class data. In the experiment, the SVM, ELM, OS-ELM, and ABC-OSELM performed the base learning algorithms separately. Tables 4 and 5 list the comparison results of G-means and F1-measures of the experiments between the resampling dataset and the original dataset. The performances show that the resampling dataset obtained significantly better performances in terms of G-means and F1measures than the original dataset. It was shown that the proposed method of imbalanced data mixed resampling effectively improved the imbalance index. Comparing the four algorithms on the two datasets, we found that the MS-ABC-OSELM performed the best in terms of both the G-mean and the F1-measure. Its G-mean was 18.7%, 5.4%, 11.2%, and 6.9% better than the ABC-OSELM, MS-SVM, MS-ELM, and MS-OS-ELM, while the F1-measure of the MS-ABC-OSELM method outperformed the four algorithms by 22.7%, 6%, 12.7%, and 9.1%, respectively. The numerical results show that the mixed sampling process could greatly improve the identification rate of the minority class data.  From the performances above, it can be seen that the MS-ABC-OSELM model built in the study moderated the overfitting problems arising from the majority class samples and had a significant advantage when identifying minority class samples. Furthermore, the MS-ABC-OSELM still maintained a fast calculation speed and lower hidden layer nodes, which meant lower energy   From the performances above, it can be seen that the MS-ABC-OSELM model built in the study moderated the overfitting problems arising from the majority class samples and had a significant advantage when identifying minority class samples. Furthermore, the MS-ABC-OSELM still maintained a fast calculation speed and lower hidden layer nodes, which meant lower energy consumption for the onboard computing of high-speed EMUs. Put simply, MS-ABC-OSELM made the online axle box bearing fault diagnosis during operation more efficient and accurate.

Conclusions
Targeted at high-speed EMU axle box bearing state monitoring, in this study, we designed an evolutionary OS-ELM fault diagnosis model that specialized in imbalanced data. Considering that the axle box bearing monitoring data of high-speed EMUs in service is imbalanced, a mixed resampling method was employed to reconstruct the imbalanced data and to extract the characteristics. The ABC algorithm was utilized to globally optimize the input weights and hidden layer bias with different numbers of hidden layer nodes in the OS-ELM to establish an increment-based diagnosis model. Put together, these schemes (MS-ABC-OSELM) were used to accomplish the goal of axle box bearing state online classification and model adaptive optimization. By testing historically monitored data of operating high-speed EMU axle box bearings, it was demonstrated that the advantages of MS-ABC-OSELM over other classical algorithms were that it was faster at detecting and more accurately classified the minority class data. As a result, the proposed evolutionary MS-ABC-OSELM fault diagnosis model proved to be effective and could diagnose the axle box bearing states of high-speed EMUs online.

Conflicts of Interest:
The authors declare no conflict of interest.