Development of Compound Fault Diagnosis System for Gearbox Based on Convolutional Neural Network

Gear transmission is widely used in mechanical equipment. In practice, if the gearbox is damaged, it not only affects the yield rate but also damages other parts of machines; thus, increases the cost and difficulty of maintenance. With the advancement of technology, the concept of unmanned factories has been proposed; an automatic diagnosis system for the health management of gearboxes becomes necessary. In this paper, a compound fault diagnosis system for the gearbox based on convolutional neural network (CNN) is developed. Specifically, three-axis vibration signals measured by accelerometers are used as the input of the one-dimensional CNN; the detection of the existence and type of the fault is directly output. In testing, the model achieved nearly 100% accuracy on the fault samples we captured. Experimental evidence also shows that the frequency-domain data can provide better diagnostic results than the time-domain data due to the stable characteristics in the frequency spectrum. For practical usage, we demonstrated a remote fault diagnosis system through a local area network on an embedded platform. Furthermore, optimization of convolution kernels was also investigated. When moderately reducing the number of convolution kernels, it does not affect the diagnostic accuracy but greatly reduces the training time of the model.


Introduction
In modern society, various kinds of products are produced by machines where the gear mechanism relies on the teeth on the rim to mesh with other gears to transmit torque. Due to its high transmission efficiency, accurate transmission ratio, large power range, and the ability to change the speed or direction of movement, gears are widely used in machinery. In recent years, with the advancement of technology and the concept of unmanned factories, the use of machines to replace workers can not only reduce labor costs but also reduce human errors and dangers caused by long hours of work. Thus, it increases the production rate and stability.
The research on the fault diagnosis of the rotor system can be traced back to Randall [1] where some common types of gear faults and frequency spectrum characteristics were introduced. Generally, the accelerometers or laser vibrometers were the most used for signal detection; and the spectrum distribution of the signal can be interpreted by the experienced technicians to determine the type of fault. However, human judgments sometimes make uncertain errors. In recent years, with the progress of science and technology, the computing speed of computers has increased dramatically, so some machine learning methods that require high-speed computing, such as convolutional neural networks (CNN), support vector machines (SVM) [2], etc., once again attract everyone's attention.

Brief of CNN
The main difference between CNN and traditional MLP lies in the addition of the convolutional layer and the pooling layer as shown in Figure 1. The emergence of CNN has made rapid progress in machine learning in the visual part. It mainly uses the concept of mask in a machine-vision study where convolution operations are performed in the convolutional layer, and the amount of calculation can be reduced through the introduction of the pooling layer. Finally, a large number of complex neuron connections in MLP are used to assist the machine in classification and learning [22,23]. The calculation method is described below.

Convolutional Layer
The convolutional layer contains several convolution kernels. Each convolution kernel will perform convolution operations on the partial area of the input data, and then add a bias term and get the output data after the activation function. Convolution operation can help extract different features of the input image such as edges or lines, etc., which is why CNN is good at handling graphics problems. Here, and are used to represent the ith convolution kernel and bias term in the l layer, respectively, and ( ) represents the jth part of the input data in the l layer. represents the activation function, ( ) is the output result of the ith convolution kernel for the jth partial data in the l layer, so the calculation equation of the convolution layer is as follows:

Pooling Layer
The pooling layer is generally set after the convolutional layer. The pooling layer can retain the output characteristics of the convolutional layer while reducing the size of the output data, thereby reducing the overall calculation of the CNN. The most commonly used method of the pooling layer is Max-pooling, which uses the largest value in the area data as the representative output, and the calculation method of the maximum pooling is as follows: where ( ) represents the tth data point in the output of the ith convolution kernel in the lth layer, W represents the pooled area, and +1 ( ) represents the output value of layer l + 1.

Flatten Layer
The flatten layer is an important design to connect pooling layer and MLP. Its function is to convert the output data after the convolution operation into a one-dimensional form before inputting the data into the MLP.

Convolutional Layer
The convolutional layer contains several convolution kernels. Each convolution kernel will perform convolution operations on the partial area of the input data, and then add a bias term and get the output data after the activation function. Convolution operation can help extract different features of the input image such as edges or lines, etc., which is why CNN is good at handling graphics problems. Here, K l i and b l i are used to represent the ith convolution kernel and bias term in the l layer, respectively, and x l ( j) represents the jth part of the input data in the l layer. g represents the activation function, y l i ( j) is the output result of the ith convolution kernel for the jth partial data in the l layer, so the calculation equation of the convolution layer is as follows:

Pooling Layer
The pooling layer is generally set after the convolutional layer. The pooling layer can retain the output characteristics of the convolutional layer while reducing the size of the output data, thereby reducing the overall calculation of the CNN. The most commonly used method of the pooling layer is Max-pooling, which uses the largest value in the area data as the representative output, and the calculation method of the maximum pooling is as follows: where q l i (t) represents the tth data point in the output of the ith convolution kernel in the lth layer, W represents the pooled area, and P l+1 i ( j) represents the output value of layer l + 1.

Flatten Layer
The flatten layer is an important design to connect pooling layer and MLP. Its function is to convert the output data after the convolution operation into a one-dimensional form before inputting the data into the MLP.

Multilayer Perceptron
Multilayer perceptron consists of three parts, i.e., input layer, hidden layer and output layer as shown in Figure 2. The calculation method of neuron a [l] of layer l is as follows: a [1] = g [1] z [1] (6) a [2] = g [2] z [2] =ŷ (8) where w represents the weighting matrix, b represents the bias term, and g represents the activation function.

CNN Architecture of This Study
In this study, the adopted CNN architecture refers to the work of Jiao et al. [19], and modifies the size of the convolution kernel of the convolutional layer. Here, the detailed number of parameters used in CNN and the architecture of the CNN are shown in Table 1 and Figure 3:

CNN Architecture of This Study
In this study, the adopted CNN architecture refers to the work of Jiao et al. [19], and modifies the size of the convolution kernel of the convolutional layer. Here, the detailed number of parameters used in CNN and the architecture of the CNN are shown in Table 1 and Figure 3:

Description of Experimental System
The experiment in this study uses a servo motor to drive helical gearboxes as shown in Figure 4 where three accelerometers are mounted on the basement to capture the vibration waveform of the gearbox. The material of the basement is aluminum and the size is about 1000 mm × 200 mm × 25 mm. In addition, the structure of the gearbox is shown in Figure 5. Here, the system development is based on Python 3.7.6, and Tensorflow 2.1 is used to construct CNN. The CPU of the computer used for network training is i7-8700, memory is 24 GB, and GPU is NVIDIA RTX 2060 (6 GB). The sampling frequency of the vibration signal is 10 kHz and the time span of sampling is 1 s. The captured signals are then fed into CNN model for classification of faults.

Description of Experimental System
The experiment in this study uses a servo motor to drive helical gearboxes as shown in Figure 4 where three accelerometers are mounted on the basement to capture the vibration waveform of the gearbox. The material of the basement is aluminum and the size is about 1000 mm × 200 mm × 25 mm. In addition, the structure of the gearbox is shown in Figure 5. Here, the system development is based on Python 3.7.6, and Tensorflow 2.1 is used to construct CNN. The CPU of the computer used for network training is i7-8700, memory is 24 GB, and GPU is NVIDIA RTX 2060 (6 GB). The sampling frequency of the vibration signal is 10 kHz and the time span of sampling is 1 s. The captured signals are then fed into CNN model for classification of faults.

Description of Experimental System
The experiment in this study uses a servo motor to drive helical gearboxes as shown in Figure 4 where three accelerometers are mounted on the basement to capture the vibration waveform of the gearbox. The material of the basement is aluminum and the size is about 1000 mm × 200 mm × 25 mm. In addition, the structure of the gearbox is shown in Figure 5. Here, the system development is based on Python 3.7.6, and Tensorflow 2.1 is used to construct CNN. The CPU of the computer used for network training is i7-8700, memory is 24 GB, and GPU is NVIDIA RTX 2060 (6 GB). The sampling frequency of the vibration signal is 10 kHz and the time span of sampling is 1 s. The captured signals are then fed into CNN model for classification of faults.   In this study, five types of gearboxes with different faults were produced by hand-made processing, including wear, broken teeth, loosening, misalignment of the input shaft, and eccentricity of the gear shaft, as shown in Figure 6. Broken tooth: grind out one tooth of gear C;  Loosening: no key is installed between gear and shaft, but they are tightly matched to make it possible to slip;  Input shaft with misalignment: enlarge the input shaft bearing seat by 0.3 mm, and pad the input shaft so that it is not concentric with the shaft of gear A;  Gear shaft with eccentricity: enlarge the bearing inner diameter of gear B and gear C by 2 mm.
In order to diagnose compound faults, we use a coupling to connect two faulty gearboxes to simulate the compound fault as shown in Figure 7. For data acquisition, it is divided into three kinds of speeds of 1800, 2400 and 3000 rpm, and 500 data records for each speed. And the ratio of the training set to test set is 7:3, that is, in each fault combination, the training set has 350 data records, and the test set has 150 data records. In this study, five types of gearboxes with different faults were produced by hand-made processing, including wear, broken teeth, loosening, misalignment of the input shaft, and eccentricity of the gear shaft, as shown in Figure 6. In this study, five types of gearboxes with different faults were produced by hand-made processing, including wear, broken teeth, loosening, misalignment of the input shaft, and eccentricity of the gear shaft, as shown in Figure 6. Broken tooth: grind out one tooth of gear C;  Loosening: no key is installed between gear and shaft, but they are tightly matched to make it possible to slip;  Input shaft with misalignment: enlarge the input shaft bearing seat by 0.3 mm, and pad the input shaft so that it is not concentric with the shaft of gear A;  Gear shaft with eccentricity: enlarge the bearing inner diameter of gear B and gear C by 2 mm.
In order to diagnose compound faults, we use a coupling to connect two faulty gearboxes to simulate the compound fault as shown in Figure 7. For data acquisition, it is divided into three kinds of speeds of 1800, 2400 and 3000 rpm, and 500 data records for each speed. And the ratio of the training set to test set is 7:3, that is, in each fault combination, the training set has 350 data records, and the test set has 150 data records. Various faults are described as follows: • Wear: wear gear C and gear D by carborundum; • Broken tooth: grind out one tooth of gear C; • Loosening: no key is installed between gear and shaft, but they are tightly matched to make it possible to slip; • Input shaft with misalignment: enlarge the input shaft bearing seat by 0.3 mm, and pad the input shaft so that it is not concentric with the shaft of gear A; • Gear shaft with eccentricity: enlarge the bearing inner diameter of gear B and gear Cby 2 mm.
In order to diagnose compound faults, we use a coupling to connect two faulty gearboxes to simulate the compound fault as shown in Figure 7. For data acquisition, it is divided into three kinds of speeds of 1800, 2400 and 3000 rpm, and 500 data records for each speed. And the ratio of the training set to test set is 7:3, that is, in each fault combination, the training set has 350 data records, and the test set has 150 data records. In this study, five types of gearboxes with different faults were produced by hand-made processing, including wear, broken teeth, loosening, misalignment of the input shaft, and eccentricity of the gear shaft, as shown in Figure 6. Broken tooth: grind out one tooth of gear C;  Loosening: no key is installed between gear and shaft, but they are tightly matched to make it possible to slip;  Input shaft with misalignment: enlarge the input shaft bearing seat by 0.3 mm, and pad the input shaft so that it is not concentric with the shaft of gear A;  Gear shaft with eccentricity: enlarge the bearing inner diameter of gear B and gear C by 2 mm.
In order to diagnose compound faults, we use a coupling to connect two faulty gearboxes to simulate the compound fault as shown in Figure 7. For data acquisition, it is divided into three kinds of speeds of 1800, 2400 and 3000 rpm, and 500 data records for each speed. And the ratio of the training set to test set is 7:3, that is, in each fault combination, the training set has 350 data records, and the test set has 150 data records. In general, the characteristics of the fault may appear in the time domain or the frequency domain. In this study, in addition to the time-domain and the frequency-domain data as the training input data of the CNN model, the time-domain and frequency-domain data are also combined as the new training input data. Due to the large difference in amplitude between time-domain data and frequency-domain data, it will be normalized before splicing. In addition, in mechanical applications, high-frequency data is often filtered out because it is considered to contain a lot of noise. Therefore, this study also compares the results of filtered and unfiltered data; so, here will cover five data formats, namely full-time (FT) data (as shown in Figure 8), low-frequency (LF) data (as shown in Figure 9), full-frequency (FF) data (as shown in Figure 10), full-time combining low-frequency (FTCLF) data with normalization (as shown in Figure 11), and full-time combining full-frequency (FTCFF) data with normalization (as shown in Figure 12). Here, the waveforms in Figures 8-12 were captured under the condition of wear with broken-tooth. In general, the characteristics of the fault may appear in the time domain or the frequency domain. In this study, in addition to the time-domain and the frequency-domain data as the training input data of the CNN model, the time-domain and frequency-domain data are also combined as the new training input data. Due to the large difference in amplitude between time-domain data and frequency-domain data, it will be normalized before splicing. In addition, in mechanical applications, high-frequency data is often filtered out because it is considered to contain a lot of noise. Therefore, this study also compares the results of filtered and unfiltered data; so, here will cover five data formats, namely full-time (FT) data (as shown in Figure 8), low-frequency (LF) data (as shown in Figure 9), full-frequency (FF) data (as shown in Figure 10), full-time combining low-frequency (FTCLF) data with normalization (as shown in Figure 11), and full-time combining full-frequency (FTCFF) data with normalization (as shown in Figure 12). Here, the waveforms in Figures 8-12 were captured under the condition of wear with broken-tooth.   In general, the characteristics of the fault may appear in the time domain or the frequency domain. In this study, in addition to the time-domain and the frequency-domain data as the training input data of the CNN model, the time-domain and frequency-domain data are also combined as the new training input data. Due to the large difference in amplitude between time-domain data and frequency-domain data, it will be normalized before splicing. In addition, in mechanical applications, high-frequency data is often filtered out because it is considered to contain a lot of noise. Therefore, this study also compares the results of filtered and unfiltered data; so, here will cover five data formats, namely full-time (FT) data (as shown in Figure 8), low-frequency (LF) data (as shown in Figure 9), full-frequency (FF) data (as shown in Figure 10), full-time combining low-frequency (FTCLF) data with normalization (as shown in Figure 11), and full-time combining full-frequency (FTCFF) data with normalization (as shown in Figure 12). Here, the waveforms in Figures 8-12 were captured under the condition of wear with broken-tooth.

Remote Diagnosis
In a factory, multiple gearboxes are often used in the transmission mechanism, and the gearboxes may be scattered widely. Usually, an accelerometer is placed next to the gearbox to measure the vibration signal. In order to facilitate the collection of signals, wireless transmission can be used. The remote site will build a receiving center to collect the signals and perform individual fault diagnosis; then, the result can be uploaded to the control center. Thus, it can realize instant fault detection on the production line. In order to verify this approach, the platform of Raspberry Pi [24] is used for remote fault diagnosis. When the training of the CNN model is completed, the model will be stored in the Raspberry Pi and the receiving program will be executed. When the new vibration data is captured, the data will be sent to the Raspberry Pi via Wi-Fi by the "Socket" API function of the computer. And Raspberry Pi will perform pre-processing such as FFT conversion of the vibration data, and then the data are fed into the CNN model for diagnosis, and then the diagnosis results are displayed. Here, the diagnosis process is shown in Figure 13: Sensors 2020, 20, x FOR PEER REVIEW 9 of 14

Remote Diagnosis
In a factory, multiple gearboxes are often used in the transmission mechanism, and the gearboxes may be scattered widely. Usually, an accelerometer is placed next to the gearbox to measure the vibration signal. In order to facilitate the collection of signals, wireless transmission can be used. The remote site will build a receiving center to collect the signals and perform individual fault diagnosis; then, the result can be uploaded to the control center. Thus, it can realize instant fault detection on the production line. In order to verify this approach, the platform of Raspberry Pi [24] is used for remote fault diagnosis. When the training of the CNN model is completed, the model will be stored in the Raspberry Pi and the receiving program will be executed. When the new vibration data is captured, the data will be sent to the Raspberry Pi via Wi-Fi by the "Socket" API function of the computer. And Raspberry Pi will perform pre-processing such as FFT conversion of the vibration data, and then the data are fed into the CNN model for diagnosis, and then the diagnosis results are displayed. Here, the diagnosis process is shown in Figure 13: From Figure 13, the personal computer is the client, and the Raspberry Pi is the server. In the server, the input of the port number, the location of the vibration data, and the selected CNN model are required. When the data is transmitted to Raspberry Pi, the diagnosis is executed; then the result will be displayed on the screen of Raspberry Pi when the calculation is complete. It only takes about 1 to 2 s for each data to be transmitted and diagnosed.

Experimental Results
Here, the five different fault gearboxes described in Section 3.1 and a normal gearbox are used to perform a compound fault diagnosis. Considering the operating conditions of combining two gearboxes (as shown in Figure 7), there are 15 different combinations. And two situations under compound faults are discussed as follows.

Case (a): Comparison of Various Input Data Types and a Different Number of Neurons in a Fully Connected Layer
Since the initial random value of parameters in the training process will affect the results of the CNN, this study will train each model 10 times, and use a different random sequence for each training, thereby reducing the impact of randomness. Then, the model will be represented by the best performance among the 10 training results, and the hyperparameters are shown in Table 2: From Figure 13, the personal computer is the client, and the Raspberry Pi is the server. In the server, the input of the port number, the location of the vibration data, and the selected CNN model are required. When the data is transmitted to Raspberry Pi, the diagnosis is executed; then the result will be displayed on the screen of Raspberry Pi when the calculation is complete. It only takes about 1 to 2 s for each data to be transmitted and diagnosed.

Experimental Results
Here, the five different fault gearboxes described in Section 3.1 and a normal gearbox are used to perform a compound fault diagnosis. Considering the operating conditions of combining two gearboxes (as shown in Figure 7), there are 15 different combinations. And two situations under compound faults are discussed as follows.

Case (a): Comparison of Various Input Data Types and a Different Number of Neurons in a Fully Connected Layer
Since the initial random value of parameters in the training process will affect the results of the CNN, this study will train each model 10 times, and use a different random sequence for each training, thereby reducing the impact of randomness. Then, the model will be represented by the best performance among the 10 training results, and the hyperparameters are shown in Table 2:  Figures 14 and 15, respectively, where FF data are used as input and 240 neurons are adopted in fully connected layer. From Figure 14, it can be seen that the accuracy of the training set gradually increases with successive training, while the accuracy of the cross-validation set follows the trend and eventually approaches 100%. Furthermore, from Figure 14, the accuracy gap between the two curves is related to the dropout mechanism. Here, due to the software configuration of library, the dropout enabled in the training process would reduce the accuracy. On the contrary, the dropout is turned off in the validation process. Another reason for the gap is the fault signals of gearboxes are not diverse enough due to limited fault samples.   Figures 14 and 15, respectively, where FF data are used as input and 240 neurons are adopted in fully connected layer. From Figure 14, it can be seen that the accuracy of the training set gradually increases with successive training, while the accuracy of the crossvalidation set follows the trend and eventually approaches 100%. Furthermore, from Figure 14, the accuracy gap between the two curves is related to the dropout mechanism. Here, due to the software configuration of library, the dropout enabled in the training process would reduce the accuracy. On the contrary, the dropout is turned off in the validation process. Another reason for the gap is the fault signals of gearboxes are not diverse enough due to limited fault samples.  In order to find the most suitable number of neurons in the fully connected layer, the number of neurons in the fully connected layer in this study starts from 40, and increases by 40 each time till to 240. After the training, the accuracy results of the test set are shown in Table 3 and Figure 16:   As described in Section 3.1, 30% of data records are reserved for test or validation which are not used in model training. The overall accuracy and loss trends with each epoch for training set and cross-validation set are shown in Figures 14 and 15, respectively, where FF data are used as input and 240 neurons are adopted in fully connected layer. From Figure 14, it can be seen that the accuracy of the training set gradually increases with successive training, while the accuracy of the crossvalidation set follows the trend and eventually approaches 100%. Furthermore, from Figure 14, the accuracy gap between the two curves is related to the dropout mechanism. Here, due to the software configuration of library, the dropout enabled in the training process would reduce the accuracy. On the contrary, the dropout is turned off in the validation process. Another reason for the gap is the fault signals of gearboxes are not diverse enough due to limited fault samples.  In order to find the most suitable number of neurons in the fully connected layer, the number of neurons in the fully connected layer in this study starts from 40, and increases by 40 each time till to 240. After the training, the accuracy results of the test set are shown in Table 3 and Figure 16: In order to find the most suitable number of neurons in the fully connected layer, the number of neurons in the fully connected layer in this study starts from 40, and increases by 40 each time till to 240. After the training, the accuracy results of the test set are shown in Table 3 and Figure 16: Table 3. Accuracy results of the test set (content: accuracy).   From Figure 16, it is obvious that the accuracy result of the FT case is the worst, on the other hand, the FF data type can obtain the best result. This is because the starting point of the time-sequential data is not fixed, this implies the characteristics of the time-domain data is not stable. Since CNN is not good at handling sequential problems, it is hard to obtain sufficient and stable features from the time-domain data, so the accuracy is the lowest. On the other hand, the frequency-domain data can be obtained by converting time-domain data through FFT. In contrast, since the frequency-domain data is not dependent on the starting point of each sampling, so it can provide more stable characteristics than the time-domain data. Therefore, the diagnostic accuracy in FF case is the best. For the same reason, the results of FTCLF and FTCFF are worse than FF can also be explained by the usage of time-domain data in FTCLF and FTCFF would deteriorate the characteristic information.

Number of Neurons
Next, comparing the cases of FF and LF, it can be found that the FF method performs better than the LF method, indicating that the high-frequency signal would contain the characteristic of the fault. If the high-frequency signal is discarded, it will naturally affect the accuracy of its diagnosis.
Then, comparing the results of different amounts of neurons in the fully connected layer, it can be found that a moderate number of neurons can get a more stable result. If 40 neurons are used, the neural network will not be able to effectively respond to various inputs due to insufficient neurons. If more neurons like 240, there may also be a risk of overfitting.

Case (b): Optimization of the Number of Convolution Kernels
From Table 1, it can be seen that the numbers of convolution kernels used in the first and second convolution layers are 16 and 32, respectively. When we observe the output of the pooling layer, we find that only a few convolution kernels have outputs, and the others are almost zero. This may indicate that the scale of this CNN is too large for this diagnostic system, resulting in only a small number of variables are used; thus, many convolution kernels are corrected to approaching zero during the training process. A more suitable number of convolution kernels can be adjusted by observing the output of the convolution kernel. After checking the output value, we can find that From Figure 16, it is obvious that the accuracy result of the FT case is the worst, on the other hand, the FF data type can obtain the best result. This is because the starting point of the time-sequential data is not fixed, this implies the characteristics of the time-domain data is not stable. Since CNN is not good at handling sequential problems, it is hard to obtain sufficient and stable features from the time-domain data, so the accuracy is the lowest. On the other hand, the frequency-domain data can be obtained by converting time-domain data through FFT. In contrast, since the frequency-domain data is not dependent on the starting point of each sampling, so it can provide more stable characteristics than the time-domain data. Therefore, the diagnostic accuracy in FF case is the best. For the same reason, the results of FTCLF and FTCFF are worse than FF can also be explained by the usage of time-domain data in FTCLF and FTCFF would deteriorate the characteristic information.
Next, comparing the cases of FF and LF, it can be found that the FF method performs better than the LF method, indicating that the high-frequency signal would contain the characteristic of the fault. If the high-frequency signal is discarded, it will naturally affect the accuracy of its diagnosis.
Then, comparing the results of different amounts of neurons in the fully connected layer, it can be found that a moderate number of neurons can get a more stable result. If 40 neurons are used, the neural network will not be able to effectively respond to various inputs due to insufficient neurons. If more neurons like 240, there may also be a risk of overfitting.

Case (b): Optimization of the Number of Convolution Kernels
From Table 1, it can be seen that the numbers of convolution kernels used in the first and second convolution layers are 16 and 32, respectively. When we observe the output of the pooling layer, we find that only a few convolution kernels have outputs, and the others are almost zero. This may indicate that the scale of this CNN is too large for this diagnostic system, resulting in only a small number of variables are used; thus, many convolution kernels are corrected to approaching zero during the training process. A more suitable number of convolution kernels can be adjusted by observing the output of the convolution kernel. After checking the output value, we can find that only two convolution kernels (1st, 13th) are required in the first convolutional layer and only one convolution kernel (5th) is required in the second convolutional layer. Therefore, after changing the number of convolution kernels in the two convolutional layers, the number of internal variables in the model can be greatly reduced, as shown in Table 4. After the revised CNN model is trained, the accuracy results of the test set are shown in Table 5 and Figure 17: Table 5. Accuracy results of the test set (content: accuracy). number of convolution kernels in the two convolutional layers, the number of internal variables in the model can be greatly reduced, as shown in Table 4. After the revised CNN model is trained, the accuracy results of the test set are shown in Table 5 and Figure 17:  Comparing Figures 16 and 17, it can be seen that the overall performance in revised CNN has been significantly improved. It is due to the chance of each variable of the revised model being trapped in the local minimum during training is reduced. Except for the case of FT data type, when the number of neurons in the fully connected layer is 80 or more, the diagnostic accuracy rate of each model can reach more than 95% where the accuracy of FF or FTCFF cases is almost 100%.

Number of Neurons
In order to compare the training time before and after the revision, each model is trained for 2 epochs and the required time is recorded as shown in Table 6:  Comparing Figures 16 and 17, it can be seen that the overall performance in revised CNN has been significantly improved. It is due to the chance of each variable of the revised model being trapped in the local minimum during training is reduced. Except for the case of FT data type, when the number of neurons in the fully connected layer is 80 or more, the diagnostic accuracy rate of each model can reach more than 95% where the accuracy of FF or FTCFF cases is almost 100%.

Revised Model Model before Revision
In order to compare the training time before and after the revision, each model is trained for 2 epochs and the required time is recorded as shown in Table 6: It can be seen from Table 6 that the training time of the revised model can be effectively reduced by 27.42%.

Conclusions
In this paper, a remote compound fault diagnosis system for gearbox is developed where one-dimensional CNN is used. In testing, the model achieved nearly 100% accuracy on the fault samples we captured. The main contributions of this work are the comparison of various input data types and the optimization of the convolution kernel size. In our results, the frequency-domain data can provide better results of diagnostic accuracy than the time-domain data because of stable characteristics in the frequency spectrum. Furthermore, it also shows a high-frequency signal retains useful information for fault diagnosis. In addition, appropriate selection of the kernel size can not only reduce training time but also improve accuracy. For practical usage, a measured vibration signal can be transmitted via a local area network. Ideally, it only takes about 1 to 2 s for data transportation and processing; thus, achieves rapid diagnosis. Finally, it should be noted that the amount of fault gearboxes used in this paper is limited; in actual applications, if more data on the spot can be used to increase the diversity of training data, the neural network model will be more complete.

Conflicts of Interest:
The authors declare no conflict of interest.