With the development of intelligent manufacturing, mechanical equipment has become increasingly sophisticated, which makes the production process increasingly complex. The links between the various components are getting close. Failures at any point can trigger a series of chain reactions with serious consequences. Therefore, the condition monitoring and diagnosis of mechanical equipment is an indispensable means to ensure the safe and reliable operation of the equipment. Only by mastering the health status of the equipment in time can the hidden trouble be found and eliminated effectively by technicians, thereby improving the production efficiency and reducing the economic loss of the enterprise. Mechanical equipment is composed of transmission parts such as shaft, bearing, gear, and belt. As the most typical key component of rotating equipment, gears are in a state of high speed and high load for a long time, and they are most prone to failure, which is widely representative [1
]. Therefore, this study aims to investigate the fault diagnosis method of gear.
The traditional fault diagnosis method relies on the vibration signal measured by the acceleration sensor. However, in a high-temperature, high-corrosion and toxic environment, the contact measurement method is limited. Moreover, the fault characterization is not the same because of the diversity of faulty types. In some cases, the vibration characteristics are better than the sound, and in some cases, the opposite is true. Therefore, using non-contact measurement to acquire the sound characteristics of the signal is particularly important. Sound is a wave generated by the vibration of an object, which is transmitted through the air and sensed by the microphone. The non-contact measurement can also reflect the running state of equipment. On the production line, some experienced maintenance technicians can judge the fault by the abnormal sound during the operation of equipment. The vibration and sound signals of the equipment are complementary and mutually enhanced. Lu [2
] proposed a sound-field feature extraction method based on acoustic information, which was combined with support vector machine to establish the relationship between sound field characteristics and bearing state and realize fault classification. Zhao [3
] proposed a combined acoustic and vibration diagnosis method using acoustic sensors, which effectively improved the correctness and practicability of fault diagnosis of high-voltage circuit breakers. Khazaee [4
] proposed an effective method for fault diagnosis of planetary gearbox based on vibration data fusion, which achieved 98% accuracy by means of data fusion. Moosavian [5
] analyzed the sound and vibration signals of automotive spark plugs and achieved a fault accuracy of 98.56% based on evidence theory. Othman [6
] analyzed different vibration and sound signal processing methods of bearing. The experimental results show that the diagnosis results of combining vibration with sound are better than those of the single signal source.
According to the characteristics of vibration signal of machine, data-driven method, mathematical model and pattern recognition method are widely applied the fault diagnosis [7
]. Convolutional neural network (CNN) is a deep feedforward artificial neural network [10
]. The CNN automatically learns, extracts and memorizes features from the training set through convolution and pooling, thereby realizing the classification or prediction of test sets, effectively avoiding the dependence on artificial experience, and fully reflecting the inherent relationship between data. Moreover, the CNN is widely used in various fields. In the ImageNet, Krizhevsky [12
] used the deep CNN to classify the images and achieved the best results in the competition, which opened the craze for CNN learning. Sermanet [13
] used CNN for house identification, using pooling operation to adjust feature weights to make the CNN strong and weak. This method obtained 95.1% classification accuracy on Street View House Number (SVHN) dataset. [14
] used CNN for handwriting recognition, enhanced data with elastic deformation. In 2015, Simard [15
] designed a ResNet network based on the idea of residual convolution operation, which achieved a recognition rate higher than that of human eyes for the first time in the field of computer vision. In 2016, Aytar [16
] proposed the famous SoundNet network, which uses 2D and 1D CNN to extract video and audio features, respectively, which greatly improved the speech recognition rate. In the same year, Silver’s [17
] AlphaGo defeated the chess champion Li Shishi, shocked the world, and perfectly presented the highest level of current artificial intelligence. The essence is still CNN. In 2017, Esteva [18
] designed a CNN network for diagnosing skin cancer, which reached the level of human experts for the first time and was published in Nature. From the above broad academic achievements, CNN has achieved fruitful results in various fields.
The preceding research shows that the effective fusion of the vibration and sound of the workpiece has a positive effect on the diagnosis of equipment failure. Combined with CNN’s strong recognition capacity, this study takes the gear, which is the most common part of mechanical equipment, as the research object and introduces sound signal on the basis of vibration signals to form multi-source information, which complement each other. In this study, a method that fuses vibration and sound signals is proposed based on IDS theory. A convolutional neural network to implement gearbox fault diagnosis (IGFD-CNN) is used. First, the gearbox fault diagnosis platform is built in a semi-anechoic chamber environment, and vibration and sound signals under different working conditions are collected. Second, an adaptive stacked convolution neural network (ASCNN) is proposed for the vibration signal. In [19
], the authors proposed an approach based on the Continuous Wavelet Transform (CWT) for broken bar diagnosis and got great result. According to the conventional feature extraction method, the signal is transformed from the time domain to the time-frequency domain by using wavelet transform in this paper. The time-frequency diagram is sent to the ASCNN model for diagnosis. Third, an end-to-end stacked convolution neural network (ESCNN) is proposed for the sound signal, which avoids background dependence of manual feature extraction. The two steps of feature extraction and fault classification are combined into one model to complete adaptively. The original sound signal is sliced and then directly sent to the ESCNN model. Finally, to solve the limitation of single signal source, which cannot fully reflect the information of the measured object, the multi-sensor fusion algorithm IDS theory is used to further fuse the diagnosis output of vibration and sound signals, and obtain a more accurate and reliable equipment operation state.
2. Establishing a Diagnostic Model
Four steps in our proposed IGFD-CNN diagnosis model are based on vibration and acoustic signals. In the first part, data acquisition of the original vibration and sound is completed, as shown in Section 3
. In the second part, an ASCNN model for gear fault diagnosis based on time-frequency diagram of vibration signal is proposed. The model includes an input layer, three convolutional layers, two sub-sampling layers (pooling layers), a fully connected layer and an output layer. To extract as many local features as possible, a small-scale convolution kernel is used to filter the time-frequency map in the convolutional part. The adaptive feature extraction and dimensionality reduction of the time-frequency map is achieved by the stacking operation of the convolution and pooling layer. Thus, the pattern recognition of gear fault is completed. In the third part, an ESCNN model for gear fault diagnosis based on acoustic signal is proposed. The model includes an input layer, four convolutional layers, two sub-sampling layers, a fully connected layer, and an output layer. Acoustic signals are directly fed into the model, which omits the process of manually extracting features. Feature extraction is completed by the first two convolution layers. The parameters of the ASCNN and ESCNN models are randomly initialized at the beginning of the training. By calculating the error between the predicted and true value during the training process, the error back propagation is used to correct the parameters until the termination condition is satisfied. In addition, the rectified linear unit is used as the activation function followed by each convolutional layer. A batch normalization layer is used to accelerate and improve the learning process of models, and a 50% dropout probability is applied to the fully connected layer to prevent overfitting. After training, the testing dataset is sent to the model. The accuracy of gear fault is obtained by comparing real and predicted labels of the samples. In the fourth part, the output of the ASCNN and ESCNN models are fused using the improved DS theory. (For further theoretical details, please refer to [20
]). The diagnosis of the single signal is taken as evidence, and further fusion decision is made to improve the stability and reliability of the running state of gear. The architecture of our proposed IGFD-CNN model is shown in Figure 1
3. Experimental Setup
To verify the effectiveness of the proposed model, the experiments are conducted on a gearbox fault experimental platform in a semi-anechoic room. Figure 2
indicates the composition of the experimental platform and positions for vibration and acoustic sensors. The platform is composed of a two-stage gearbox, a variable frequency motor, a frequency converter, a magnetic brake component, a tension controller, sensor, acquisition card and system [21
]. The entire experimental platform is located below the microphone array rack. The 4189-A-021 free-field microphone is fixed on the array frame for collecting acoustic signals of different states of the gear. The CY1010L piezoelectric accelerometer is mounted horizontally on the side of the gearbox for vibration signals. The vibration and acoustic signals are saved in the terminal through data acquisition card for analysis. The top-left corner of Figure 2
shows the internal configuration of the gearbox. In this study, the big gear is selected as the faulty gear. Three fault conditions, including pitting, broken teeth and wear, were set up by the electro discharge machining (EDM) process, as shown in Figure 3
. During the process, the entire gear transmission system was driven by a motor. The motor speed was adjusted to 900, 1800 and 2700 r/m to simulate under different working conditions. The load condition was controlled by the magnetic powder brake, and two states were set with load and no-load. The experiment assumed that the interference from other parts of the gearbox was small, and the vibration and acoustic signals measured were considered as only containing gears. The sampling frequency of the vibration and acoustic signal is 12 KHZ and 16 KHZ. The sampling interval is 5 min and the sampling duration is 60 s.
A total of 10 different operating conditions that correspond to the three speeds of the motor were simulated to ensure the diversity of the samples. For a deep learning diagnostic model, the training set is typically used to train the model, and the testing set is used to test the performance of the model. Experimental data are obtained from the raw data by random sampling to objectively evaluate the performance of the proposed model. Three different datasets: vibration, acoustic and hybrid, which are from the no-load condition, were built. The vibration dataset contains 200 samples for each type of gear fault. A total of 75% of the samples were randomly selected as the training set, and the rest was used as the testing set. Thus, the vibration dataset has a total of 2000 samples, including 1500 training samples and 500 testing samples. The acoustic dataset was similarly created. The hybrid dataset is constructed by mixing the vibration and acoustic datasets, which has a total of 4000 samples, including 3000 training samples and 1000 testing samples. Different datasets are selected for training and testing in different models and methods. The description of the dataset is shown in Table 1
6. Diagnostic Performance Analysis of IGFD-CNN
The sum of the output probabilities of the single source model at the SoftMax layer was exactly 1, which satisfied the requirement that the sum of the basic probability assignments (BPA) of the evidence theory was 1. Therefore, the 10 operating states of the gearbox were used as the identification framework for evidence theory. The output of ASCNN was used as the first evidence (m1), and the output of ESCNN was used as the second evidence (m2
). The results of the two models were further determined by IDS theory to obtain an accurate fault identification of the gearbox. To validate the proposed IGFD-CNN model, we compared the diagnostic results with ASCNN and ESCNN. To reduce the impact of randomness and chance on the results, we ran each model 10 times each. The precision of each run was recorded and drawn as a box diagram, as presented in Figure 19
shows the highest diagnostic rate of ASCNN is the eighth type of fault, and the lowest is the second type. Each of them has an abnormal point. The sixth type of fault also has an abnormal point. The overall performance of the model is dispersed between 93% and 98%. The box length of each type of fault is relatively long, which shows that the 10 diagnosis results are discretely distributed. Except for the sixth category, the median lines of the other nine types of box graphs are on the upper side, tending to the maximum of each category. The highest diagnostic rate of ESCNN is the eighth type of fault and the lowest is the second type. The outliers appear in the second, fourth, fifth, sixth, and eighth types of faults. The diagnosis rate of these five types of faults varies from high to low in the entire ESCNN. The overall performance of the model is dispersed between 92% and 98%, which is low and scattered. Compared with ASCNN, the box length of each type of fault is shorter, which indicates that the 10 diagnosis results have convergence. Although the median line of the fifth type of fault is lower, the other nine types of fault are in the middle trend, showing a stable state. The highest diagnostic rate of IGFD-CNN is the eighth type of fault and the lowest is the second type, which is consistent with the model of ASCNN and ESCNN, indicating that the three models provide consistent conclusions for the fault categories of the highest and lowest diagnostic rates. The diagnosis rate of the eighth type of fault reached 100%, and the abnormal value appeared in the second, third and fifth types of faults, which is close to the box. The overall performance of the model was dispersed between 95% and 99%, which is improved and concentrated compared with the ASCNN and ESCNN. The box of the second type of fault is longer, that of the other nine types is shorter, and their median line is on the upper side, indicating that the 10 diagnosis results are well converged and concentrated. Compared with the three models, the IGFD-CNN model has the shortest box length and small floating range for each type of fault except the second type. Therefore, IGFD-CNN model is more effective than ASCNN and ESCNN.
To further validate the IGFD-CNN model, we compared other fusion methods, such as median voting fusion [29
] (MVF), proportional conflict allocation rule 5 [30
] (PCR5), and traditional evidence theory (DSCNN). Similarly, to reduce the error, we ran each model 10 times, and the average of 10 experimental results was taken as the final result of data fusion, as shown in Figure 20
shows that single and multiple sources produce different diagnostic results. The accuracy of fault identification is as low as 92% and as high as 98%, with large error and wide fluctuation range because of the influence of single sensor accuracy, installation position, environment and other factors, which cannot accurately and comprehensively reflect the health state of gear. The combination of multiple signal sources at multiple levels is necessary to obtain the interpretation and description of the consistency of the tested object. Comparing the results of four fusion methods, we find that the IGFD-CNN model has the highest average fault recognition rate (97.7%). The PCR5 fusion rule has the second highest (97%), and DSCNN, which is the traditional evidence theory fusion method, is the third (96.1%). The MVF fusion method is the lowest (95.6%). The improved fusion algorithm in this study uses the weight of evidence and sensor to modify the BPA of evidence, and selects the fusion rules for the modified evidence according to the relationship between the threshold and conflict factors. The evidence with high confidence level increases continuously. The evidence with low confidence level decreases continuously, and the ideal diagnosis rate is obtained. The PCR5 fusion rule is mainly applicable to the case of complete conflict of evidence. The evidence in this section is not completely in conflict and cannot show its advantages. The BPA of evidence is allocated according to the original confidence level, which is relatively conservative. Thus, the PCR5 is the second highest diagnostic rate. The diagnosis rate of the traditional DS fusion method is higher than that of a single signal, which fully embodies the advantages of fusion. The principle of the median voting algorithm is “voting, majority passing”. This algorithm is a simple and fast method without complex operation, which can be completed in the shortest time, but the diagnostic rate is the lowest of the four methods.