A Study on Gear Defect Detection via Frequency Analysis Based on DNN

: In this paper, we introduce a gear defect detection system using frequency analysis based on deep learning. The existing defect diagnosis systems using acoustic analysis use spectrogram, scalogram, and MFCC (Mel-Frequency Cepstral Coefﬁcient) images as inputs to the convolutional neural network (CNN) model to diagnose defects. However, using visualized acoustic data as input to the CNN models requires a lot of computation time. Although computing power has improved, there is a situation in which a processor with low performance is used for reasons such as cost-effectiveness. In this paper, only the sums of frequency bands are used as input to the deep neural network (DNN) model to diagnose the gear fault. This system diagnoses the defects using only a few speciﬁc frequency bands, so it ignores unnecessary data and does not require high performance when diagnosing defects because it uses a relatively simple deep learning model for classiﬁcation. We evaluate the performance of the proposed system through experiments and verify that real-time diagnosis of gears is possible compared to the CNN model. The result showed 95.5% accuracy for 1000 test data, and it took 18.48 ms, so that veriﬁed the capability of real-time diagnosis in a low-spec environment. The proposed system is expected to be effectively used to diagnose defects in various sound-based facilities at a low cost.


Introduction
Since the advent of the third industrial revolution, automation of various manufacturing plants has been achieved [1][2][3].An automated system refers to a system that does not require manpower but uses equipment such as computers and robots to operating the entire process [4].Smart factory is an intelligent factory that can efficiently produce products by integrating elements from the entire process, such as planning, design, production, distribution, and sales, into Cyber Physical System (CPS), Internet of Things (IoT), robot, 3D printing, and big data [5].Automation plants have been applied to many industrial sites because of their potential to improve productivity and reduce labor costs.Therefore, smart factories have been highlighted and studied intensively [6,7].
Automation machines can reliably increase production, but when defects or failures occur, it is difficult to find out the cause of the problem due to the complex production process and system [8].This is especially true when one has to go through a complicated process of disassembling and inspecting equipment, such as piping and assembled machines [9].Real-time fault diagnosis of automated machines is an important technology that can prevent both economic and human damage/loss.Although periodic failure inspections of such automation equipment are required for stable operation of the automation process, there is the problem that the inspection requires a lot of manpower and cost.
Acoustic analysis refers to analyzing sound signals collected through sensors such as microphones.Acoustic analysis is widely used in that it is possible to obtain target data without dismantling the target using inexpensive sensors [10,11].The analysis of acoustic signals identifies time, amplitude, and frequency components, and identifies interesting characteristics by applying various techniques according to the purpose.Above all, frequency analysis of acoustic signals makes it easy to analyze the periodicity of signals and filter noisy signals.It is also widely used in the analysis of acoustic signals because it can extract frequency characteristics of specific signals well [12], and spectrograms [13], scalograms [14], and mel-frequency cepstrum coefficient (MFCC) [15] are graphs that show changes in frequency intensity over time by converting acoustic signals into time-frequency axes.Each of these time-frequency image-based methods is specialized in analyzing a frequency change pattern over time or a change in a dominant frequency range of an acoustic signal.
Frequency analysis of acoustic signals is widely used to detect various defects, such as determining the degree of wear of the machine or detecting defects in bearings.Before deep learning was studied intensively, numerical, analytical, and experimental research was performed [16][17][18].After application of artificial intelligence became active, various studies to detect mechanical fault based on deep learning were introduced.Research has been conducted to diagnose gear failures using vibration signals based on fuzzy neural networks [19].Research has been conducted on a mechanical defect diagnostic convolutional neural network (CNN) model that uses acoustic signals as input to make them robust to the changing sound of the domain [20,21].Research has been conducted to visually detect defects in gears using image-based, region-based convolutional neural networks (R-CNN) [22].Studies have been conducted to diagnose failure of gear fitting using both vibration and sound emission signals using CNN and gated recurrent unit (GRU) [23].A spectrogram ball-and-roller bearing with the image was used to diagnose defects in a study relating to the model CNN [24].When diagnosing defects in rolling bearings, studies were conducted comparing the performance of spectrogram, scalogram, and Hilbert sulfur images [25].Based on the CNN model, a study on the failure diagnosis technique of the automated machine using spectrogram images was conducted [26].Based on unsupervised learning, a study was conducted to detect the failure using spectrogram images [27].Another study was conducted to diagnose the failure by using spectrogram images of acoustic data filtering ambient noise as input to neural network models [28], and yet another was conducted on the fault diagnosis and analysis of transition learning-based facilities using spectrogram images [29].The advantage of using a spectrogram image to diagnose defects in a machine is that it is possible to check the frequency change over time, so it is possible to make a more accurate diagnosis.However, there is a disadvantage in that the process of converting sound signals into spectrogram images is added, and images are used as input to artificial neural networks, which increases the computation volume and requires high performance, making real-time diagnosis difficult.Although computing power has increased rapidly, there are situations in which low-performance hardware is used for cost-effectiveness issues.
In this paper, the spectral data of recorded sounds are used as input for deep learning models to diagnose defects in gears.Raw spectral data are not suitable for real-time monitoring due to a large amount of computation to use all spectral data as input due to various frequency ranges.Therefore, the sums of frequency bands that represent the characteristics of rotating gears are used as an input to the deep neural network (DNN) model.By selecting frequency bands, it is also available to detect faults for several different gear RPMs.We note that such works are already shown in bearings [30].The model is trained in advance by collecting the acoustic signal of rotating gears by type.For defect diagnosis, the sound of the rotating gear is converted into spectral data and the sum of the frequency bands is calculated to be used as an input to a pre-trained deep-learning-based classifier model to determine the current state [31].

Materials and Methods
In this paper, acoustic data are analyzed in the frequency domain and used as input to a pre-trained deep learning model to diagnose gear defects.Figure 1a shows the setting of this system and Figure 1b shows the types of gear states pre-trained for the classification of defect types.From the top left to the bottom right, there are four types in order: 'normal', 'one tooth broken', 'four teeth broken', and 'all worn out'.
Machines 2022, 10, x FOR PEER REVIEW 3 of 15 In this paper, acoustic data are analyzed in the frequency domain and used as input to a pre-trained deep learning model to diagnose gear defects.Figure 1a shows the setting of this system and Figure 1b shows the types of gear states pre-trained for the classification of defect types.From the top left to the bottom right, there are four types in order: 'normal', 'one tooth broken', 'four teeth broken', and 'all worn out'.Figure 2 shows the schematic diagram of operating the system proposed in this paper.The acoustic data of the gear are converted into a frequency domain through Fast Fourier Transform (FFT).After that, the amplitude of the preset frequency band of interest is summed respectively to use as a feature for diagnosing the type of defect.The feature is used as an input to a pre-trained deep learning model to output the type of defect diagnosis of the gear.

Sound Data Collection for Acoustic Analysis
Sound data are collected in real time using a microphone to diagnose the gear defect of the system.In order to collect gear sound data, we install a microphone in the center of the rotating gear.We used a condenser microphone to collect sound data.The condenser microphone has the advantage of sensitivity and a wide range of polar patterns compared to the dynamic microphone.The condenser microphone has a high risk of howling caused by the sound of other surrounding speakers, but we used a condenser microphone because the howling was less likely to occur in the environment of this paper [32].The condenser microphone used in this paper has a 100~16,000 Hz frequency band and −47 dB ± 4 sensitivity.
We used 'pyaudio' [33] library for sound data collection and a 44,100 Hz sampling rate.Since the maximum frequency of sound data that can be sampled according to the Nyquist theorem is 22,050 Hz, it can collect the gear sound of the 7000 Hz band generated  In this paper, acoustic data are analyzed in the frequency domain and used as input to a pre-trained deep learning model to diagnose gear defects.Figure 1a shows the setting of this system and Figure 1b shows the types of gear states pre-trained for the classification of defect types.From the top left to the bottom right, there are four types in order: 'normal', 'one tooth broken', 'four teeth broken', and 'all worn out'.Figure 2 shows the schematic diagram of operating the system proposed in this paper.The acoustic data of the gear are converted into a frequency domain through Fast Fourier Transform (FFT).After that, the amplitude of the preset frequency band of interest is summed respectively to use as a feature for diagnosing the type of defect.The feature is used as an input to a pre-trained deep learning model to output the type of defect diagnosis of the gear.

Sound Data Collection for Acoustic Analysis
Sound data are collected in real time using a microphone to diagnose the gear defect of the system.In order to collect gear sound data, we install a microphone in the center of the rotating gear.We used a condenser microphone to collect sound data.The condenser microphone has the advantage of sensitivity and a wide range of polar patterns compared to the dynamic microphone.The condenser microphone has a high risk of howling caused by the sound of other surrounding speakers, but we used a condenser microphone because the howling was less likely to occur in the environment of this paper [32].The condenser microphone used in this paper has a 100~16,000 Hz frequency band and −47 dB ± 4 sensitivity.
We used 'pyaudio' [33] library for sound data collection and a 44,100 Hz sampling rate.Since the maximum frequency of sound data that can be sampled according to the Nyquist theorem is 22,050 Hz, it can collect the gear sound of the 7000 Hz band generated

Sound Data Collection for Acoustic Analysis
Sound data are collected in real time using a microphone to diagnose the gear defect of the system.In order to collect gear sound data, we install a microphone in the center of the rotating gear.We used a condenser microphone to collect sound data.The condenser microphone has the advantage of sensitivity and a wide range of polar patterns compared to the dynamic microphone.The condenser microphone has a high risk of howling caused by the sound of other surrounding speakers, but we used a condenser microphone because the howling was less likely to occur in the environment of this paper [32].The condenser microphone used in this paper has a 100~16,000 Hz frequency band and −47 dB ± 4 sensitivity.
We used 'pyaudio' [33] library for sound data collection and a 44,100 Hz sampling rate.Since the maximum frequency of sound data that can be sampled according to the Nyquist theorem is 22,050 Hz, it can collect the gear sound of the 7000 Hz band generated in this paper [34].The collected data are used to diagnose the gear defect in real time through deep-learning-based spectrum analysis.

Sound Data Pre-Processing
If we apply sound data without pre-processing in an artificial intelligence model for diagnosing gear defects, the size of the input data increases unnecessarily and the amount of computation increases, resulting in an increase in processing time.In addition, through pre-processing, the train data increase, and better performance can be expected.Therefore, the feature extraction through acoustic spectrum analysis is performed so that the features of each defect type are prominent.We also use data augmentation to increase train data and improve robustness from external factors [35].

Data Augmentation
One of the representative methods used in artificial intelligence models to improve robustness from external factors such as noise and distortion is data augmentation [36].Data augmentation transforms existing data to increase the amount of data and uses it as train data.Data augmentation can improve the performance of artificial intelligence models by increasing the amount of train data, and appropriate methods are selected and used according to the type and characteristics of the data.
Techniques for sound data augmentation include volume control, stretching, white noise, flip, reverb, and overlap.In this paper, volume control and stretching methods with changing amplitudes and frequency components are not suitable because the state is diagnosed using the sum of amplitudes in a specific frequency band by collecting gear sounds and spectral analysis of sound data.Therefore, in this paper, we used white noise, flip, reverb, and overlap methods for data augmentation.
For white noise, we added a random signal with 1/20 of the maximum amplitude of the signal.Adding white noise makes the system tough from ambient noise.In terms of flip, the graph was flipped by inversing the raw data.This simple method provides more information to the model because the flipped data also still contain acoustic information.As for the reverb, we used 'reverb' in the 'Pedal Board' [37] library and set the room size to 0.25.This method augments the data by adding reverberations so that the model is robust against the environment in which reverberations occur.For overlap, when slicing the raw data, the overlapping part was set to 0.44 s.Overlap is a widely used data augmentation method to augment a small amount of sound data.

Acoustic Spectral Analysis
Sound data collected through the microphone must be extracted for each type of feature through spectral analysis for the train and classification of artificial intelligence models.Figure 3 shows a flowchart of sound spectrum analysis for classification model training.First, the collected sound data are converted into spectral data through Fourier transform, and then the frequency domain of interest is extracted according to the gear noise characteristics of the system.The number of gear teeth used in this paper is 16, operating at RPMs of 140, 280, and 420 [38].We selected three RPMs, since in a real system several different gears or gear RPMs will be used simultaneously.We checked the spectrum of the four types of gears for the corresponding frequency bands, and set the frequency bands such that each type can be distinguished as the frequency domain of interest.After that, the sum of the amplitudes of each section of the frequency domain of interest is calculated.The sum of the amplitudes of each frequency section calculated in this way has a different distribution for each type and is used as the train and input data to be used in artificial intelligence models.
The collected sound data of time series are converted into the frequency domain using Fourier transform for spectral analysis.For low computation and high speed, FFT was used for Fourier transform.For FFT, we used the 'fft' function of the 'numpy' library [39].Note that fft is adopted because the Descrete Fourier Transform (DFT) algorithm is difficult to use in low-spec hardware specifications.According to the Nyquist theorem, since the sampling rate is 44,100 Hz, the frequency band of the sound data is 0 to 22,050 Hz. Figure 4a shows 150 samples of acoustic data in terms of frequency for each case.The collected sound data of time series are converted into the frequency domain using Fourier transform for spectral analysis.For low computation and high speed, FFT was used for Fourier transform.For FFT, we used the 'fft' function of the 'numpy' library [39].Note that fft is adopted because the Descrete Fourier Transform (DFT) algorithm is difficult to use in low-spec hardware specifications.According to the Nyquist theorem, since the sampling rate is 44,100 Hz, the frequency band of the sound data is 0 to 22,050 Hz. Figure 4a shows 150 samples of acoustic data in terms of frequency for each case.The collected sound data of time series are converted into the frequency domain using Fourier transform for spectral analysis.For low computation and high speed, FFT was used for Fourier transform.For FFT, we used the 'fft' function of the 'numpy' library [39].Note that fft is adopted because the Descrete Fourier Transform (DFT) algorithm is difficult to use in low-spec hardware specifications.According to the Nyquist theorem, since the sampling rate is 44,100 Hz, the frequency band of the sound data is 0 to 22,050 Hz. Figure 4a shows 150 samples of acoustic data in terms of frequency for each case.To minimize the input size, we have integrated the spectrums in a few regions.The equation is shown as follows: To minimize the input size, we have integrated the spectrums in a few regions.The equation is shown as follows: where F n is the nth input and f start,n and f end,n constitute the nth range of interest.Figure 4b shows averaged spectrum of the sound data.We used the frequency band corresponding to the peak in the spectrum graph for the analysis.The figure shows an example of frequency band selection.In this case, five bands are chosen to be the band of interest, which are 200~700 Hz, 1000~1500 Hz, 1700~2200 Hz, 2200~2700 Hz, and 3500~4500 Hz.In other words, n is 5 and f start,n is 200 Hz, f end,1 is 700 Hz.We summed the amplitude of each frequency band and used it as the feature for each defect type.When the gear is defective, the characteristics of the noise will be different, and the ratio of each sum of frequency amplitude will be also different, so we can use this difference as the feature of each defect type.These features can be used as train data for classifiers based on deep learning.
Various peaks are observed in Figure 4b.To select the range and region of interest for the model input, we have checked three cases of frequency selection.Table 1 shows the region of the three selection cases.Note that the frequency band of the region has been selected to cover the full-width half-max (FWHM) of each peak.We have trained the model by using these three cases, and the training accuracy in terms of epochs is shown in Figure 5.It is clear that case 2 and case 3 show low accuracy, and the loss oscillates.Therefore, we select the band of interest as the frequency selection case 1 in Table 1.As shown in these results, some wrong selection of frequency band is unavailable to distinguish the fault correctly.However, as some noises are frequency-independent, but some of the noises form a band [40], correct selection of frequency band is proper to detect noise.

Train Dataset
We used the sum of each frequency band of interest amplitude as the train dataset for the classifier based on deep learning through sound data augmentation and acoustic spectral analysis.Sound data used for the train were collected at a sampling rate of 44,100 Hz, and sound data with a length of 1 s were converted into a frequency domain through FFT.The dataset contains four classes: 'normal', 'one tooth broken', 'four teeth broken', and 'all worn out'.In terms of the dataset, a total of 14,486 data were used, with 10,775 train data, 2707 validation data, and 1000 test data.The datasets are divided randomly.
The sum of the amplitude of each frequency band is data with a very large deviation in value.When training the model, if the deviation of the train data is large, the train weights are likely to be overfitting.Therefore, train data are normalized to suppress overfitting of weights [41].In this paper, we used MinMaxScaler of 'sklearn' library for normalizing data, which normalized the data between 0 to 1 [42].

Training
Table 2 and Figure 6 show DNN model architecture for defect diagnosis in this paper.The model architecture used in this paper includes three dense layers and two dropout layers, as well as a classifier for classification.The dropout layer is applied to reduce overfitting.Without the dropout layer, the model showed 96.44% accuracy on validation and test datasets as shown in Figure 7.It is shown that the accuracy of the validation set does not converge without the dropout layer.It is also notable that the hyper-parameters such as node number and number of hidden layers are optimized.The model architecture used in this paper includes three dense layers and two dropout layers, as well as a classifier for classification.The dropout layer is applied to reduce overfitting.Without the dropout layer, the model showed 96.44% accuracy on validation and test datasets as shown in Figure 7.It is shown that the accuracy of the validation set does not converge without the dropout layer.It is also notable that the hyper-parameters such as node number and number of hidden layers are optimized.The inputs use a total of five inputs as the sum of the amplitudes of each section in the frequency band of 200~4500 Hz.There are four types of output as the classification result: '0: normal', '1: one tooth broken', '2: four teeth broken', and '3: all worn out'.Therefore, as shown in Figure 6, the model has 5 inputs and 4 outputs.
Dropout was added between each layer to prevent overfitting of the train data.As an activation function of Classifier, Softmax is used to output the accuracy of each class so that multiple classes can be classified.
To train the model, 14,486 datasets were trained 1000 times with a batch size of 32, using the previous DNN architecture.In this paper, we used the model with the highest accuracy in the training process after 1000 times of train.Stochastic Gradient Descent (SGD) was used as an optimizer, and Categorical Cross Entropy (CCE) was used for loss The inputs use a total of five inputs as the sum of the amplitudes of each section in the frequency band of 200~4500 Hz.There are four types of output as the classification result: '0: normal', '1: one tooth broken', '2: four teeth broken', and '3: all worn out'.Therefore, as shown in Figure 6, the model has 5 inputs and 4 outputs.
Dropout was added between each layer to prevent overfitting of the train data.As an activation function of Classifier, Softmax is used to output the accuracy of each class so that multiple classes can be classified.
To train the model, 14,486 datasets were trained 1000 times with a batch size of 32, using the previous DNN architecture.In this paper, we used the model with the highest accuracy in the training process after 1000 times of train.Stochastic Gradient Descent (SGD) was used as an optimizer, and Categorical Cross Entropy (CCE) was used for loss function calculation.
Figure 8 shows the accuracy and loss of training sets and validation sets during the training process.The figure shows the loss is well converged with an accuracy of 99.97% and a loss of 0.0015.Even with a small number of trains, it shows high accuracy.
The inputs use a total of five inputs as the sum of the amplitudes of each section in the frequency band of 200~4500 Hz.There are four types of output as the classification result: '0: normal', '1: one tooth broken', '2: four teeth broken', and '3: all worn out'.Therefore, as shown in Figure 6, the model has 5 inputs and 4 outputs.
Dropout was added between each layer to prevent overfitting of the train data.As an activation function of Classifier, Softmax is used to output the accuracy of each class so that multiple classes can be classified.
To train the model, 14,486 datasets were trained 1000 times with a batch size of 32, using the previous DNN architecture.In this paper, we used the model with the highest accuracy in the training process after 1000 times of train.Stochastic Gradient Descent (SGD) was used as an optimizer, and Categorical Cross Entropy (CCE) was used for loss function calculation.
Figure 8 shows the accuracy and loss of training sets and validation sets during the training process.The figure shows the loss is well converged with an accuracy of 99.97% and a loss of 0.0015.Even with a small number of trains, it shows high accuracy.To verify the robustness of our model, we used K-Fold cross validation to test the strength of the model.The K value was chosen to be 5 and the training-validation set was divided into five sections, one of which was used as a validation set.The accuracy was 99.97%, 95.78%, 98.09%, 96.52% and 99.61%, respectively, with an average of 97.99%, which confirms the robustness of our model.

Experiment Environment
In the experiment, we used the noise of working gear, and the calculation was performed using a laptop computer without an external GPU to verify the efficiency and compactness of the proposed system.
In the experiment, sound data when various defect types of gears operate were used.In order to verify efficiency, for the hardware specifications we experimented with a lowpower CPU for laptops without an external GPU, which are shown in Table 3.To verify the robustness of our model, we used K-Fold cross validation to test the strength of the model.The K value was chosen to be 5 and the training-validation set was divided into five sections, one of which was used as a validation set.The accuracy was 99.97%, 95.78%, 98.09%, 96.52% and 99.61%, respectively, with an average of 97.99%, which confirms the robustness of our model.

Experiment Environment
In the experiment, we used the noise of working gear, and the calculation was performed using a laptop computer without an external GPU to verify the efficiency and compactness of the proposed system.
In the experiment, sound data when various defect types of gears operate were used.In order to verify efficiency, for the hardware specifications we experimented with a lowpower CPU for laptops without an external GPU, which are shown in Table 3. Experiments were carried out with gear sound that was never used for train and validation, and various defect types were used.A total of 1000 test data were used for the experiment with 250 data for each type.

Result of Experiment with Test Dataset
Figure 9 shows the confusion matrix of the proposed method.The overall classification accuracy was 95.5%.Note that normal gear and defective gears could be perfectly classified.However, 'one tooth broken' and 'all worn out' could not be exactly classified.Among the 250 samples of 'one tooth broken', 45 samples were predicted as an 'all worn out' gear, which seems to have been misclassified due to the similarity between the spectrum of the two cases.From Figure 1, it is shown that the 'one tooth broken' case and 'all worn out' case show similar results; therefore, the model misclassified 45 samples.However, we note that the model still shows 100% accuracy of comparison between normal gear and broken gear.

Comparison between CNN Classifiers
We compare the proposed method with the existing CNN classifier regarding the gear defect classification accuracy and computation time.

Comparison between CNN Classifiers
We compare the proposed method with the existing CNN classifier regarding the gear defect classification accuracy and computation time.

Comparison between CNN Classifiers
We compare the proposed method with the existing CNN classifier regarding the gear defect classification accuracy and computation time.Figure 10 shows the process of the comparison.The same data augmentation method was used for the sound data in Section 2.1, and the corresponding data are converted into a spectrogram image.The architecture of the CNN classifier used for comparison is shown in Table 4 and Figure 11 [43].Figure 12a shows the classification accuracy by class for the test dataset.The 'one tooth broken' accuracy was 82%, which was lower in the proposed method than in CNN.However, both methods could classify normal gear and defective gears.
(a) Figure 12a shows the classification accuracy by class for the test dataset.The 'one tooth broken' accuracy was 82%, which was lower in the proposed method than in CNN.However, both methods could classify normal gear and defective gears.
The calculation time from the moment the sound data are converted to the output of the diagnostic result was calculated, and shown in Table 5.One thousand data were diagnosed, and the computation time required per datum was calculated.We note that the computation time includes the data processing time, such as Fourier transform.In the case of CNN, the 2D image is input, so the acoustic data must be converted into a spectrogram image.The STFT used at this time has a difference in computational time during the conversion process because it performs FFT in short units several times.In addition, compared to DNN whose input size was 5 with 1D, CNN has a much larger size of 30 × 30 with a 2D image, as well as a convolution layer, which increases computational time.Figure 12a shows the classification accuracy by class for the test dataset.The 'one tooth broken' accuracy was 82%, which was lower in the proposed method than in CNN.However, both methods could classify normal gear and defective gears.Generally, CNN uses 2D input; therefore, the complexity of the model, the number of parameters, and the calculation process are much higher than the DNN model.As expected, the results showed that the proposed method using DNN took an average of 18.48 ms computation time per datum, and 0.80 s when spectrogram images were used as input to the CNN model.All data recorded the gear sound for 1 s, and in the case of CNN models, real-time diagnosis was difficult, and it was verified that the proposed method was sufficiently capable of real-time diagnosis.

Conclusions
In this paper, we propose a system for diagnosing gear defects through frequency analysis based on DNN.In the acoustic data, only the sums of frequency bands of interest are used as a feature that distinguishes the type of defect, and these features are used as inputs to a simple DNN model to reduce computation.Compared to CNN-based model methods, unnecessary data are not used for defect diagnosis, so computation can be reduced to diagnose defects in the gear in real-time.Although the existing defect diagnosis method using the CNN model was difficult to diagnose in real time in the computational environment without GPU due to a high computational volume, the proposed system can diagnose defects in real time without difficulty even with only a low-performance CPU.
The performance of the proposed system was evaluated using the sound of gears operating in real time as the test data.In addition, we verified classification accuracy and real-time defect diagnosis-capable processing speed by comparing the conventional sound-based defect diagnosis method, spectrogram images, as inputs to CNN models.It showed 95.5% accuracy for 1000 test data, and it took 18.48 ms-which is 40 times higher in speed compared to CNN model-to diagnose one gear sound data per second, enabling real-time diagnosis in a low-spec environment.
The proposed system has a limitation in that it cannot be classified in the case of new defect types by conducting learning and experiments only on limited defect types.However, it has been shown to have sufficient performance to classify normal and defective gears.In addition, the model successfully classified normal gears of different RPMs with minimum computational resources.
The system proposed in this paper is expected to be able to diagnose defects in real time at a relatively low cost, so it can be effectively used to diagnose various sound-based facilities in real time.As a future research plan, we plan to study a defect diagnosis system that is resistant to noise by considering the noise of the surrounding equipment.

Figure 1 .
Figure 1.Setting for a gear's defect diagnosis: (a) hardware for defect diagnosis; (b) type of normal gear and defect gears.

Figure 2 .
Figure 2. The overall flow of the gear's defect diagnosis system.

Figure 1 .
Figure 1.Setting for a gear's defect diagnosis: (a) hardware for defect diagnosis; (b) type of normal gear and defect gears.

Figure 2
Figure2shows the schematic diagram of operating the system proposed in this paper.The acoustic data of the gear are converted into a frequency domain through Fast Fourier Transform (FFT).After that, the amplitude of the preset frequency band of interest is summed respectively to use as a feature for diagnosing the type of defect.The feature is used as an input to a pre-trained deep learning model to output the type of defect diagnosis of the gear.

Figure 1 .
Figure 1.Setting for a gear's defect diagnosis: (a) hardware for defect diagnosis; (b) type of normal gear and defect gears.

Figure 2 .
Figure 2. The overall flow of the gear's defect diagnosis system.

Figure 2 .
Figure 2. The overall flow of the gear's defect diagnosis system.

Figure 3 .
Figure 3. Flow chart of sound spectrum analysis for classifier model training.

Figure 3 .
Figure 3. Flow chart of sound spectrum analysis for classifier model training.

Figure 3 .
Figure 3. Flow chart of sound spectrum analysis for classifier model training.

Figure 4 .
Figure 4. (a) Total of 150 samples of acoustic data in terms of frequency for each case.(b) Averaged frequency data for each gear case and frequency band selection example.

Figure 4 .
Figure 4. (a) Total of 150 samples of acoustic data in terms of frequency for each case.(b) Averaged frequency data for each gear case and frequency band selection example.

Figure 5 .
Figure 5.The accuracy of training and validation sets for different selections of frequency bands: (a) case 1; (b) case 2; and (c) case 3.

Figure 5 .
Figure 5.The accuracy of training and validation sets for different selections of frequency bands: (a) case 1; (b) case 2; and (c) case 3.

Figure 7 .
Figure 7.The accuracy of training and validation sets without dropout layer.

Figure 7 .
Figure 7.The accuracy of training and validation sets without dropout layer.

Figure 8 .
Figure 8. Accuracy and loss function within 1000 training steps: (a) the train accuracy and the validation accuracy; (b) the train loss function and the validation loss function.

Figure 8 .
Figure 8. Accuracy and loss function within 1000 training steps: (a) the train accuracy and the validation accuracy; (b) the train loss function and the validation loss function.

Figure 9 .
Figure 9. Confusion matrix of the test set calculated by the trained model.
Figure 10 shows the process of the comparison.The same data augmentation method was used for the sound data in Section 2.1, and the corresponding data are converted into a spectrogram image.The architecture of the CNN classifier used for comparison is shown in Table 4 and Figure 11 [43].

Figure 9 .
Figure 9. Confusion matrix of the test set calculated by the trained model.
Figure 10  shows the process of the comparison.The same data augmentation method was used for the sound data in Section 2.1, and the corresponding data are converted into a spectrogram image.The architecture of the CNN classifier used for comparison is shown in Table4and Figure 11[43].

Figure 10 .
Figure 10.Comparison flow chart with the existing classifier.

Figure 10 .
Figure 10.Comparison flow chart with the existing classifier.

Figure 11 .
Figure 11.The visualized architecture of CNN model.

Figure 11 .
Figure 11.The visualized architecture of CNN model.

Figure 11 .
Figure 11.The visualized architecture of CNN model.

Figure 12 .
Figure 12.Comparison of proposed DNN model and CNN model: (a) accuracy in test dataset; (b) computation time per data.

Table 1 .
Frequency band of interest in each case (Hz).

Table 4 .
The details of the CNN model.

Table 4 .
The details of the CNN model.

Table 5 .
Comparison of computation time per data.