Fault Diagnosis for High-Speed Train Axle-Box Bearing Using Simplified Shallow Information Fusion Convolutional Neural Network

Axle-box bearings are one of the most critical mechanical components of the high-speed train. Vibration signals collected from axle-box bearings are usually nonlinear and nonstationary, caused by the complicated operating conditions. Due to the high reliability and real-time requirement of axle-box bearing fault diagnosis for high-speed trains, the accuracy and efficiency of the bearing fault diagnosis method based on deep learning needs to be enhanced. To identify the axle-box bearing fault accurately and quickly, a novel approach is proposed in this paper using a simplified shallow information fusion-convolutional neural network (SSIF-CNN). Firstly, the time domain and frequency domain features were extracted from the training samples and testing samples before been inputted into the SSIF-CNN model. Secondly, the feature maps obtained from each hidden layer were transformed into a corresponding feature sequence by the global convolution operation. Finally, those feature sequences obtained from different layers were concatenated into one-dimensional as the fully connected layer to achieve the fault identification task. The experimental results showed that the SSIF-CNN effectively compressed the training time and improved the fault diagnosis accuracy compared with a general CNN.


Introduction
Axle-box bearings, as the key component of the high-speed train, can have a significant impact on the security, stability and sustainability of railway vehicles [1]. If an axle-box bearing failure is not detected promptly, it may cause severe delays or even dangerous derailments, implicating human life prejudice and significant costs for railway managers and operators. Therefore, how to identify the axle-box bearing fault accurately and quickly has become an urgent challenge to be solved. Currently, vibration analysis, acoustic analysis and temperature analysis are three main approaches for axle-box bearings failure detection [2]. However, the temperature does not raise much for an early-stage bearing fault, and noises from wheel/rail contacts and the train drive system, as well as aerodynamic forces, may contaminate the signal acquired by the acoustic arrays [3]. Due to its higher reliability, various fault diagnosis techniques based on vibration signal-processing techniques have been applied to maintain axle-box bearings operating properly and reliably [4][5][6][7]. However, it is a time-consuming and labor-intensive task to determine the type of bearing defects using conventional diagnosis methods from the fault diagnosis for other industrial equipment, the fault diagnosis of railway transportation equipment has its special characteristics. For high-speed trains, safety is the priority. The fault diagnosis model should process the bearing data quickly and accurately to meet stringent reliability and real-time requirements for the failure monitoring of axle-box bearings. Due to the complexity of the CNN, more layers mean more convolution kernels, and each neuron multiplies the input data with connection weights, which will lead to the size of the parameter of the CNN being more than tens or even hundreds of thousands. More computational burdens and longer training times are needed due to the large size of the parameters, which can lead to poorer performance of the CNN. In addition, each layer of the CNN has a different expression of the input data and with multiple layers. However, only the outputs of the last layer are connected to the fully connected layer, and the shallow information in other layers is neglected in a traditional CNN framework. Therefore, it is necessary to reduce the number of model parameters and make use of shallow information to improve the diagnosis performance of the CNN. Some research has been done to fill this gap. Fu et al. [30] proposed a multiscale comprehensive feature fusion-CNN (MCFF-CNN) based on residual learning for vehicle color recognition and achieved an improved recognition performance. Zhang et al. [31] proposed a compact convolutional neural network augmented with multiscale feature extraction to carry out diagnosis tasks with limited training samples and presented three cases to verify the effectiveness of the proposed method. Meng et al. [32] proposed a CNN-based framework for digital subtraction angiography cerebrovascular segmentation and obtained some results. Jun et al. [33] proposed a multiscale CNN model for bearings' remaining useful life predictions, in which the last convolutional layer and pooling layer were combined to form a mixed layer before being connected to the fully connected layer. However, the performances of the methods mentioned above still need to be improved.
Aiming to improve the computational efficiency and the diagnostic accuracy, a novel simplified shallow information fusion-convolutional neural network (SSIF-CNN) is proposed for vibration-based axle-box bearing fault diagnosis. The proposed method firstly converts the feature maps obtained from each pooling layer into a feature sequence by the global convolution operation. Then, those feature sequences obtained from different pooling layers are concatenated into a one-dimensional vector before been connected to the classifier through the fully connected layer. The experimental results show that, compared to the traditional CNN, the SSIF-CNN improves the computing efficiency on the premise of ensuring the accuracy of the fault diagnosis.
The contributions of this paper can be summarized as follows: (a) We employ an SSIF-CNN model structure to extract more identifiable features for the axle-box bearing fault diagnosis. By integrating the simplified shallow information, the features with more information are maintained to enhance the network capacity and to reduce the dimension of the parameter. (b) Due to fewer fully connected layer parameters in the SSIF-CNN framework, the model computational efficiency and fault diagnosis accuracy are improved. (c) The proposed systematic approach integrates feature extraction and SSIF-CNN into a framework, which could realize the goal of monitoring axle-box bearing conditions automatically.
The remaining parts of the paper are organized as follows: In Section 2, the modified procedure of the CNN is introduced. In Section 3, the diagnosis procedure using the modified method is proposed. In Section 4, the benchmark data and experimental data are described and analyzed. Finally, some conclusions are presented in Section 5.

Simplified Shallow Information Fusion CNN
As shown in Figure 1a, the CNN is a kind of multilayered feedforward neural network, and it mainly contains three parts: a convolutional layers, pooling layers and a fully connected layer. The convolutional layer detects the local conjunctions of the features of the input data by the local convolution operation. The pooling layer merges similar features into one to reduce the size of the  In a general CNN architecture, only the outputs of the last layer are connected to the fully connected layer, and the shallow convolution information is neglected. In order to make use of shallow information, the shallow features obtained from shallow pooling layers are connected to the fully connected layer, along with the features of the last layer. As shown in Figure 1b, the yellow circles in the fully connected layer represent the depth information extracted by the last pooling layer , and the blue and green circles represent the shallow information extracted from pooling layer and the first pooling layer , respectively. Each line between the circles represents the connection weight of the neurons. The calculations made by the neurons in the new fully connected layer can be expressed as: In a general CNN architecture, only the outputs of the last layer are connected to the fully connected layer, and the shallow convolution information is neglected. In order to make use of shallow information, the shallow features obtained from shallow pooling layers are connected to the fully connected layer, along with the features of the last layer. As shown in Figure 1b, the yellow circles in the fully connected layer represent the depth information extracted by the last pooling layer P n , and the blue and green circles represent the shallow information extracted from pooling layer P i and the first pooling layer P 1 , respectively. Each line between the circles represents the connection weight of the neurons. The calculations made by the neurons in the new fully connected layer can be expressed as: where f c( j) is the output of the jth neuron in the new fully connected layer, P i = p k i , k = 1, . . . , K is the outputs of the ith pooling layer, K is the number of outputs oft he ith pooling layer, ω j is the weight vector, b j is the bias value, m is the number of neurons in the new fully connected layer, n is the number of pooling layers and f (·) represents the nonlinear activation function. The new fully connected layer contains more neurons due to integrating the shallow information. The shallow information fusion-CNN model has a larger model parameter dimension, which could result in much more computational burdens and longer training times.
In order to reduce the dimension of the model parameters after integrating the shallow information, the feature maps obtained from each pooling layer are transformed into a feature sequence by the global convolution operation before being input into the fully connected layer. As shown in Figure 1c, the global convolution kernels with the same dimension as the feature maps obtained from each pooling layer are used to convolve those corresponding feature maps, and the results extracted from different pooling layers are further concatenated into a 1D feature vector. Then, the 1D feature vector is taken as the new fully connected layer to achieve the pattern recognition task. The green, blue and yellow rectangles represent the feature sequences outputted by using the corresponding global convolution kernels to convolve the outputs of the pooling layer P 1 , pooling layer P i and the last pooling layer P n , respectively. The global convolution feature sequences obtained from different pooling layers are concatenated as the new fully connected layer before being transmitted to the classification layer. The calculations made by a neuron in the new fully connected layer can be expressed as: where f c( j) is the output of the jth neuron in the new fully connected layer, P i = p k i , k = 1, . . . , K is the outputs of the ith pooling layer, K is the number of outputs of the ith pooling layer, G k i is the corresponding global convolution kernel with the same dimension of P k i , ω j is the weight vector, b j is the bias value, m is the number of neurons in the new fully connected layer, n is the number of pooling layers, f (·) represents the nonlinear activation function and ⊗ represents the global convolution operator.

Axle-Box Bearing Faults Diagnosis Method Based on SSIF-CNN
At present, there are two main technical approaches for machine-learning-based bearing failure diagnosis. As shown in Figure 2a, in the traditional machine-learning-based method, feature extraction and fault classification use different algorithms to achieve the purpose of the final fault classification. On the contrary, in the end-to-end deep-learning-based methods, the two processes of feature extraction and classification can be completed at the same time, as shown in Figure 2b.
As shown in Figure 3, the proposed method follows the pattern of feature extraction and deep feature learning rather than the end-to-end learning approach in this work based on the following considerations:        The flowchart of the axle-box bearing faults diagnosis method based on SSIF-CNN is shown in Figure 4. The fault diagnostic process follows the procedure of data acquisition, feature extraction, model training and fault classification.
Sensors 2020, 20, x FOR PEER REVIEW 8 of 23 The flowchart of the axle-box bearing faults diagnosis method based on SSIF-CNN is shown in Figure 4. The fault diagnostic process follows the procedure of data acquisition, feature extraction, model training and fault classification.  The algorithm process is illustrated as follows:  Collect the vibration signals of the axle-box bearing from the acceleration sensors at a particular sampling frequency under various operating conditions.  Segment the signals to build training samples and testing samples.  The algorithm process is illustrated as follows: • Collect the vibration signals of the axle-box bearing from the acceleration sensors at a particular sampling frequency under various operating conditions. • Segment the signals to build training samples and testing samples.

Data Augmentation and Feature Extraction
In order to avoid overfitting without sufficient training samples, the data augmentation is essential for improving the generalization and classification accuracy of the CNN. As shown in Figure 5, it is effective to obtain a sufficient number of training samples and testing samples by segmenting overlapping raw data with a specific step length. A vibration signal with 120,000 points can provide 400 training samples and 400 testing samples for the SSIF-CNN when the shift step is 144; the length of each training sample is 2048.

Data Augmentation and Feature Extraction
In order to avoid overfitting without sufficient training samples, the data augmentation is essential for improving the generalization and classification accuracy of the CNN. As shown in Figure  5, it is effective to obtain a sufficient number of training samples and testing samples by segmenting overlapping raw data with a specific step length. A vibration signal with 120,000 points can provide 400 training samples and 400 testing samples for the SSIF-CNN when the shift step is 144; the length of each training sample is 2048. Data for training Data for testing Step Step Overlap One sample Feature extraction is the first step for bearing fault classification. Time-domain features, which are intuitive and intelligible, constitute the raw data of the bearing running state. Frequency-domain features can describe the variations in the frequency band from the view of the signal spectrum and spectral energy distribution. In total, 29 time and frequency features (P1, P2,…, P29) are calculated in this paper according to reference [8] and reference [20], as illustrated in Table 1.
Since different features have different dimension units, it is necessary to normalize the features and make sure that each feature makes a contribution to the CNN model: where is the feature sequence. Feature extraction is the first step for bearing fault classification. Time-domain features, which are intuitive and intelligible, constitute the raw data of the bearing running state. Frequency-domain features can describe the variations in the frequency band from the view of the signal spectrum and spectral energy distribution. In total, 29 time and frequency features (P1, P2, . . . , P29) are calculated in this paper according to reference [8] and reference [20], as illustrated in Table 1.
Since different features have different dimension units, it is necessary to normalize the features and make sure that each feature makes a contribution to the CNN model: where P is the feature sequence.
Frequency center Root mean square frequency

Coefficient of skewness
Note: x(i) = (x 1 , x 2 , · · · , x N ) is the sequence of time domain signal, N is the length of the signal, s(k) = (s 1 , s 1 , · · · , s K ) is the spectrum of signal x(i), k is the total number of spectral lines and f k is the frequency of the kth spectral line.

Design of CNNs
When the input data is one-dimensional, the structure of the convolutional kernels in a CNN will be one-dimensional. Considering the limitation of the length and depth of the extracted features, the convolution kernel parameter should not be too large. Since the dimension of the input feature vector, which has 29 feature values, is small, there is no need to pool the output of the convolution layers to reduce the data dimension. The SSIF-CNN used in this work contains only three local convolution layers and three global convolution layers. The specific parameter settings of the three CNN models are shown in Tables 2-4.

Experimental Validation and Verification
In this section, two case studies are carried out to verify the effectiveness of the proposed model. Case 1 focuses on the benchmark data obtained from the Case West Reserve University (CWRU) bearing data center, Cleveland, Ohio, USA. Case 2 is devoted to the axle-box bearing data of the high-speed train collected from the laboratory experiments. The models are implemented on a computer where the CPU is I7-4790-k, the memory is 16 GB and the programming environments are MATLAB R2016 and Python 3.7. The learning rate is 0.01, and maximum number of iterations is 2000.

Data Description
The experimental bearing fault datasets that came from the CWRU rolling bearing data center are analyzed to validate the diagnosis performance of the modified model. In this experiment, batches of rolling bearings are processed by electrical discharge machining to simulate different fault types, which include ball fault (BF), inner race fault (IRF) and outer race fault (ORF). The raw vibration datasets, which are obtained from the drive-end bearing under 1797 rpm and sampled at 12 kHz by the accelerometers, are all chosen to recognize the fault patterns. Table 5 shows the information of the benchmark datasets. The depths of the defects are 0.18 mm, 0.36 mm, 0.54 mm and 0.72 mm, while the data of the outer race fault (ORF) with 0.72 mm is not available. More specifications of the rolling element bearings data acquirements can be found on the website [34]. Due to the limited data points of the benchmark data, 120,000 data points were finally picked for each bearing condition in our experiments. Each bearing condition has 400 training samples and 400 testing samples, and each sample contains 2048 data points. The total number of training samples is 4800 (400 × 12) and that of testing samples is also 4800 (400 × 12).

Effect of Sample Size for Training
In order to avoid overfitting and enhance the generalization ability of the SSIF-CNN model, a sufficient number of training samples is needed. Figure 6 shows the effect of the training sample size on the SSIF-CNN performance. To verify the stability of the SSIF-CNN, ten training trials were carried out for each training sample size. The mean value and boxplot of the accuracies of the ten training trials are shown in Figure 6a. As the number of training samples increases, the classification accuracy gradually increases. Even if the training sample size is relatively small, the SSIF-CNN can still achieve a high classification accuracy. Figure 6b shows the average time spent on training by the SSIF-CNN with different sizes of training samples. As the number of training samples increases, the average time required for the SSIF-CNN to process one sample gradually decreases. When the sample size exceeds 840, the modified model only needs about 0.02 s to diagnose a sample, which shows that the SSIF-CNN can meet the real-time requirements of fault diagnosis.

Diagnostic Results
After mixing up the training samples, the whole batch of training samples is input into the training models for ten repeated experiments, and the results of the first trial are shown in Figure 7. As shown in Figure 7, the general CNN will converge after 1672 iterations, with an accuracy at about 89.5%. The accuracy of the SIF-CNN achieves 98.75% after 1416 iterations. However, due to the fewer model parameters, the training accuracy of the SSIF-CNN achieves 100% after only 642 iterations, which is much faster than the general CNN and SIF-CNN.
The fault classification accuracy specifications of the three models are listed in Tables 6 and 7 in detail. Table 6 shows the specifications of the classification accuracy of the training samples, while Table 7 shows the same thing for the testing samples. In the CNN training process, the accuracy of bearing conditions 7 and 10 only reaches 52% and 42%, whereas all the accuracies of the SIF-CNN model maintain levels above 89%. The SSIF-CNN classifies all the training samples with an accuracy Sensors 2020, 20, 4930 13 of 23 of 100%. In the testing process, the accuracy of bearing conditions 7 and 10 only reaches 31% and 37.5% by the CNN, respectively, and the SIF-CNN model has a classification accuracy with at least 81%. The SSIF-CNN classifies all the training samples with an accuracy of 100%.
In order to avoid overfitting and enhance the generalization ability of the SSIF-CNN model, a sufficient number of training samples is needed. Figure 6 shows the effect of the training sample size on the SSIF-CNN performance. To verify the stability of the SSIF-CNN, ten training trials were carried out for each training sample size. The mean value and boxplot of the accuracies of the ten training trials are shown in Figure 6a. As the number of training samples increases, the classification accuracy gradually increases. Even if the training sample size is relatively small, the SSIF-CNN can still achieve a high classification accuracy. Figure 6b shows the average time spent on training by the SSIF-CNN with different sizes of training samples. As the number of training samples increases, the average time required for the SSIF-CNN to process one sample gradually decreases. When the sample size exceeds 840, the modified model only needs about 0.02s to diagnose a sample, which shows that the SSIF-CNN can meet the real-time requirements of fault diagnosis.

Diagnostic Results
After mixing up the training samples, the whole batch of training samples is input into the training models for ten repeated experiments, and the results of the first trial are shown in Figure 7. As shown in Figure 7, the general CNN will converge after 1672 iterations, with an accuracy at about  Tables 6 and 7 in detail. Table 6 shows the specifications of the classification accuracy of the training samples, while Table 7 shows the same thing for the testing samples. In the CNN training process, the accuracy of bearing conditions 7 and 10 only reaches 52% and 42%, whereas all the accuracies of the SIF-CNN model maintain levels above 89%. The SSIF-CNN classifies all the training samples with an accuracy of 100%. In the testing process, the accuracy of bearing conditions 7 and 10 only reaches 31% and 37.5% by the CNN, respectively, and the SIF-CNN model has a classification accuracy with at least 81%. The SSIF-CNN classifies all the training samples with an accuracy of 100%.    Figure 8 shows how the confusion matrix thoroughly records the diagnosis classification results of the different bearing conditions, including both the classification information and misclassification information. The ordinate axis of the confusion matrix represents the actual label of each bearing condition, and the horizontal axis represents the predicted label. Therefore, the element on the main    Figure 8 shows how the confusion matrix thoroughly records the diagnosis classification results of the different bearing conditions, including both the classification information and misclassification information. The ordinate axis of the confusion matrix represents the actual label of each bearing condition, and the horizontal axis represents the predicted label. Therefore, the element on the main diagonal of the multiclass confusion matrix represents the diagnosis classification accuracy of each condition. As shown in Figure 8a,b, the CNN fails to classify bearing condition 7 and bearing condition 10. The lowest accuracy happens in condition 10 for training and that of testing happens in condition 7. It can be seen from Figure 8c,d that the lowest accuracy happens in condition 10 for the SIF-CNN training and that of testing happens in condition 5. As shown in Figure 8e,f, the proposed method can classify all the fault types accurately. The t-distributed stochastic neighbor embedding [35] (t-SNE) technique is adopted to extract the feature visualizations, and the two-dimensional scatterplot distributions are given in Figure 9. From Figure 9d, the features of different fault types can be clearly classified. The SSIF-CNN can effectively extract features of datasets with different fault categories and different fault depths.  The t-distributed stochastic neighbor embedding [35] (t-SNE) technique is adopted to extract the feature visualizations, and the two-dimensional scatterplot distributions are given in Figure 9. In order to further illustrate the ability of the proposed model in the bearing fault diagnosis, two additional commonly used intelligent methods are applied here as comparative studies. The training samples and testing samples are inputted to the SVM and BPNN. The parameter descriptions of the SVM and BPNN are as follows [25]: (1) SVM: RBF kernel, penal factor equal to 7 and kernel radius equal to 0.1; (2) BPNN: 50 units in the hidden layer; the learning rate is adjusted following the discrete staircase schedule, which reduced the learning rate by half per 200 iterations; the initial learning rate is equal to 0.2, the solver type is "SGD" and the momentum is equal to 0.1. The specific parameter settings of the BPNN model are shown in Table 8. Due to the random initialization of the weights, the classification performance of the same model is different in the different training processes. Hence, ten repeated trials based on the randomly selected samples strategy are carried out. The training accuracies and testing accuracies of the ten In order to further illustrate the ability of the proposed model in the bearing fault diagnosis, two additional commonly used intelligent methods are applied here as comparative studies. The training samples and testing samples are inputted to the SVM and BPNN. The parameter descriptions of the SVM and BPNN are as follows [25]: (1) SVM: RBF kernel, penal factor equal to 7 and kernel radius equal to 0.1; (2) BPNN: 50 units in the hidden layer; the learning rate is adjusted following the discrete staircase schedule, which reduced the learning rate by half per 200 iterations; the initial learning rate is equal to 0.2, the solver type is "SGD" and the momentum is equal to 0.1. The specific parameter settings of the BPNN model are shown in Table 8. Due to the random initialization of the weights, the classification performance of the same model is different in the different training processes. Hence, ten repeated trials based on the randomly selected samples strategy are carried out. The training accuracies and testing accuracies of the ten trials are shown in Figure 10. Table 9 shows the classification performance of different models achieved from the ten repeated experiments.
As shown in Figure 10, the training accuracy of the proposed model reaches 100% in six of the ten trials, and the testing accuracy reaches 100% in four of the ten trials. All the accuracies of the proposed model maintain an accuracy level above 95%, which shows that the proposed model has excellent performance not only in high classification accuracy but, also, in classification stability. As illustrated in Table 8  trials are shown in Figure 10. Table 9 shows the classification performance of different models achieved from the ten repeated experiments. As shown in Figure 10, the training accuracy of the proposed model reaches 100% in six of the ten trials, and the testing accuracy reaches 100% in four of the ten trials. All the accuracies of the proposed model maintain an accuracy level above 95%, which shows that the proposed model has excellent performance not only in high classification accuracy but, also, in classification stability. As illustrated in Table 8, the normal CNN has an average accuracy of training accuracy with 91.8% and test accuracy with 91.6%, and the average training time is about 167.3s. The average training time of the SIF-CNN is about 192.8s, the average training accuracy is 99.1% and the average testing accuracy is 98.2%. The average testing accuracy training time of the SSIF-CNN is about 124.2s, the average training accuracy is 99.5% and the average testing accuracy is 98.6%. The SVM and BPNN have poor classification performances and need more time to complete the model training.  Due to fewer fully connected layer parameters in the SSIF-CNN framework, the model training speed is faster, and test accuracy is slightly improved. The SSIF-CNN not only controls the complexity of the machine learning but, also, has faster convergence speeds and higher identification ratios.

Data Description
The axle-box bearing fault data is obtained with a train rolling test rig from the National Engineering Laboratory for High-Speed Trains. As shown in Figure 11, the test rig consists of load motors, drive wheels, acceleration sensors, speed sensors and a national instrument (NI) data acquisition system. The load motor drives the driving wheels to rotate, and the high-speed train wheel, as the driven wheel, is driven to simulate the actual working condition of the high-speed train.  Due to fewer fully connected layer parameters in the SSIF-CNN framework, the model training speed is faster, and test accuracy is slightly improved. The SSIF-CNN not only controls the complexity of the machine learning but, also, has faster convergence speeds and higher identification ratios.

Data Description
The axle-box bearing fault data is obtained with a train rolling test rig from the National Engineering Laboratory for High-Speed Trains. As shown in Figure 11, the test rig consists of load motors, drive wheels, acceleration sensors, speed sensors and a national instrument (NI) data acquisition system. The load motor drives the driving wheels to rotate, and the high-speed train wheel, as the driven wheel, is driven to simulate the actual working condition of the high-speed train. Eight rolling bearings, collected from the locomotive maintenance, are mounted on the test train, and their conditions are listed in Table 10. Several experiments are carried out under different working conditions, and the vibration data is collected with a sample rate of 20 kHz. The datasets acquired from an experiment running at a speed of 200 km/h are used to identify and classify the axle-box bearing faults. In this experiment, the axle rotation speed is 1233rpm, and five kinds of defects are included: BF+ORF, IRF+ORF and ORF with three different sizes. Table 11 shows the information of the experimental datasets. Each bearing condition has 200 training samples and 200 testing samples, and each sample contains 2048 data points. The total number of training samples is 1200 (200 × 6), while that of testing samples is also 1200 (200 × 6).

Diagnostic Results
After mixing up the samples, the whole batch of samples is input into the training models for ten repeated experiments, and the results of the first trial are shown in Figure 12. The general CNN Eight rolling bearings, collected from the locomotive maintenance, are mounted on the test train, and their conditions are listed in Table 10. Several experiments are carried out under different working conditions, and the vibration data is collected with a sample rate of 20 kHz. The datasets acquired from an experiment running at a speed of 200 km/h are used to identify and classify the axle-box bearing faults. In this experiment, the axle rotation speed is 1233 rpm, and five kinds of defects are included: BF+ORF, IRF+ORF and ORF with three different sizes. Table 11 shows the information of the experimental datasets. Each bearing condition has 200 training samples and 200 testing samples, and each sample contains 2048 data points. The total number of training samples is 1200 (200 × 6), while that of testing samples is also 1200 (200 × 6).

Diagnostic Results
After mixing up the samples, the whole batch of samples is input into the training models for ten repeated experiments, and the results of the first trial are shown in Figure 12. The general CNN will converge after 546 iterations, and the average training accuracy is 95.8%. The accuracy of the SIF-CNN achieves 100% after 392 iterations. The convergence speed of the SSIF-CNN, with 258 iterations, is much faster than the general CNN and SIF-CNN.
The fault classification accuracy specifications of the three models are listed in Tables 12 and 13 in detail. Table 12 shows the specifications of the classification accuracy of the training samples, while Table 13 shows the same thing for the testing samples. In the CNN training process, the accuracy of bearing conditions 4 and 5 only reaches 93% and 86%, whereas all the accuracies of the SIF-CNN model maintain levels above 92%. The SSIF-CNN classifies all the training samples with an accuracy of 100%. In the testing process, the accuracy of bearing conditions 4 and 5 only reaches 85% and 82% by the CNN, respectively, and the SIF-CNN model has a classification accuracy with at least 90%. The SSIF-CNN classifies all the training samples with an accuracy of 100%.  Tables 12 and 13 in detail. Table 12 shows the specifications of the classification accuracy of the training samples, while Table 13 shows the same thing for the testing samples. In the CNN training process, the accuracy of bearing conditions 4 and 5 only reaches 93% and 86%, whereas all the accuracies of the SIF-CNN model maintain levels above 92%. The SSIF-CNN classifies all the training samples with an accuracy of 100%. In the testing process, the accuracy of bearing conditions 4 and 5 only reaches 85% and 82% by the CNN, respectively, and the SIF-CNN model has a classification accuracy with at least 90%. The SSIF-CNN classifies all the training samples with an accuracy of 100%.     Figure 13 shows that the confusion matrix thoroughly records the diagnosis classification results of the different bearing conditions, including both the classification information and misclassification information. As shown in Figure 13a,b, the CNN fails to classify bearing condition 4 and bearing    Figure 13 shows that the confusion matrix thoroughly records the diagnosis classification results of the different bearing conditions, including both the classification information and misclassification information. As shown in Figure 13a,b, the CNN fails to classify bearing condition 4 and bearing condition 5. The lowest accuracy happens in condition 5 for training and that of testing happens in condition 5, too. It can be seen from Figure 13c,d, in which the lowest accuracy happens in condition 5 for the SIF-CNN training and that of testing happens in condition 5. As shown in Figure 13e,f, the proposed method can classify all the fault types accurately.
Sensors 2020, 20, x FOR PEER REVIEW 19 of 23 condition 5. The lowest accuracy happens in condition 5 for training and that of testing happens in condition 5, too. It can be seen from Figure 13c,d, in which the lowest accuracy happens in condition 5 for the SIF-CNN training and that of testing happens in condition 5. As shown in Figure 13e,f, the proposed method can classify all the fault types accurately.  Figure 13. Confusion matrix of the three models for the first trial.
The feature representation of the fully connected layer of the SSIF-CNN is reduced to a twodimensional distribution by t-SNE. As shown in Figure 14, the features of the different fault types  The feature representation of the fully connected layer of the SSIF-CNN is reduced to a two-dimensional distribution by t-SNE. As shown in Figure 14, the features of the different fault types can be clearly classified, which indicates that the SSIF-CNN is an effective approach for high-speed train axle bearing fault classification.
Sensors 2020, 20, x FOR PEER REVIEW  20 of 23 can be clearly classified, which indicates that the SSIF-CNN is an effective approach for high-speed train axle bearing fault classification. The training accuracies and testing accuracies of the ten trials are shown in Figure 15. As shown in Figure 15, the training accuracy of the proposed model reaches 100% in seven of the ten trials, and the testing accuracy reaches 100% in three of the ten trials. All the accuracies of the proposed model maintain an accuracy level above 95%, which shows that the proposed model has excellent performance not only in high classification accuracy but, also, in classification stability. The classification performances of the different models, which were achieved from ten repeated experiments with a maximum of 2000 iterations, are listed in Table 14. Due to fewer parameters, the average training speed of the SSIF-CNN, with 64.8 s, is much faster than the general CNN and SIF-CNN. Compared to the result of case 1, the SIF-CNN has more parameters, but the training time is shorter than the general CNN, which shows that fusing shallow information to the fully connected layer can improve the training efficiency.   The training accuracies and testing accuracies of the ten trials are shown in Figure 15. As shown in Figure 15, the training accuracy of the proposed model reaches 100% in seven of the ten trials, and the testing accuracy reaches 100% in three of the ten trials. All the accuracies of the proposed model maintain an accuracy level above 95%, which shows that the proposed model has excellent performance not only in high classification accuracy but, also, in classification stability. The classification performances of the different models, which were achieved from ten repeated experiments with a maximum of 2000 iterations, are listed in Table 14. Due to fewer parameters, the average training speed of the SSIF-CNN, with 64.8 s, is much faster than the general CNN and SIF-CNN. Compared to the result of case 1, the SIF-CNN has more parameters, but the training time is shorter than the general CNN, which shows that fusing shallow information to the fully connected layer can improve the training efficiency. can be clearly classified, which indicates that the SSIF-CNN is an effective approach for high-speed train axle bearing fault classification. The training accuracies and testing accuracies of the ten trials are shown in Figure 15. As shown in Figure 15, the training accuracy of the proposed model reaches 100% in seven of the ten trials, and the testing accuracy reaches 100% in three of the ten trials. All the accuracies of the proposed model maintain an accuracy level above 95%, which shows that the proposed model has excellent performance not only in high classification accuracy but, also, in classification stability. The classification performances of the different models, which were achieved from ten repeated experiments with a maximum of 2000 iterations, are listed in Table 14. Due to fewer parameters, the average training speed of the SSIF-CNN, with 64.8 s, is much faster than the general CNN and SIF-CNN. Compared to the result of case 1, the SIF-CNN has more parameters, but the training time is shorter than the general CNN, which shows that fusing shallow information to the fully connected layer can improve the training efficiency.

Conclusions
For the security and stability of high-speed trains, the failure monitoring of axle-box bearings has stringent reliability and real-time requirements. To address this challenge, we proposed a fault diagnosis method for axle-box bearing of high-speed trains based on a novel CNN model to improve the computational efficiency and the diagnostic accuracy of the fault diagnosis in this paper. The proposed approach takes advantage of the shallow information while reducing the dimension of the parameters of the CNN model to shorten the training time and improve the accuracy of the fault diagnosis. Two case studies are carried out to verify the effectiveness of the proposed model, and the results show that the SSIF-CNN has a higher recognition accuracy and faster convergence speed.
In future works, the axle-box bearing fault diagnosis method based on SSIF-CNN needs further optimization. The sensitive features selection can be carried out to improve the diagnostic efficiency. The construct of SSIF-CNN, such as the layer number and the number of neurons in each layer, needs to be optimized to achieve better adaptability. In addition, the SSIF-CNN-based end-to-end deep-learning method should be applied to detect the axle-box bearing fault using the raw acceleration signal rather than inputting the signal feature.