Constructing Condition Monitoring Model of Harmonic Drive

Featured Application: Detecting faults in a machine in the early stage reduces loss due to damage. This paper proposes a method to detect machinery anomalies through operation sounds that combines the wavelet transform and state-of-the-art neural network architecture, and can be used in an intelligent factory. Abstract: The harmonic drive is an essential industrial component. In industry, the efﬁcient and accurate determination of machine faults has always been a signiﬁcant problem to be solved. Therefore, this research proposes an anomaly detection model which can detect whether the harmonic drive has a gear-failure problem through the sound recorded by a microphone. The factory manager can thus detect the fault at an early stage and reduce the damage loss caused by the fault in the machine. In this research, multi-layer discrete wavelet transform was used to de-noise the sound samples, the Log Mel spectrogram was used for feature extraction, and ﬁnally, these data were entered into the EfﬁcientNetV2 network. To assess the model performance, this research used the DCASE 2022 dataset for model evaluation, and the area under the characteristic acceptance curve (AUC) was estimated to be 5% higher than the DCASE 2022 baseline model. The model achieved 0.93 AUC for harmonic drive anomaly detection.


Introduction
As the world moves towards Industrial Revolution 4.0, integrating information technology into the manufacturing industry will promote industrial upgrading and allow industry to maintain competitiveness. The development of intelligent manufacturing technology is vital for the manufacturing industry; enabling machinery to have smart functions such as failure prediction is the current focus of industrial development.
A harmonic drive is a special gearbox device consisting of only three essential components: the circular spline, flex spline, and wave generator. Harmonic drives are often used in the aerospace field, medical equipment, and industrial robots, and have the advantages of small size, high transmission efficiency, and low noise. They are crucial components of the six-axis robot, which shows the importance of harmonic drives for industrial development.
When using a machine, it is often necessary to stop the machine so that an inspector can check the mechanical condition, resulting in a cost of human resources and time. The current mainstream inspection method still relies on the inspector's experience; therefore, the detection accuracy is not high. In terms of harmonic drive fault detection, G. Yang et al. [1] proposed the use of multiple acceleration sensors for data fusion and Fast Fourier transform to process the samples, and finally inputting these data into the neural network constructed by the Convolutional Neural Network (CNN). Through this process, a 96.79% accuracy of fault detection could be achieved, but the research did not include noise processing.
However, a fault detection model must possess characteristics such as high accuracy and noise processing in the real industrial environment. Therefore, there is a need for an efficient and accurate method to solve this problem.
To solve the above-discussed problems, this paper proposes a method that uses machine learning methods and signal processing to deal with sound samples. The performance of the proposed model was up to 0.93 AUC of the fault detection.
The proposed model will allow factory managers to swiftly detect any possible issue in the harmonic drive without stopping the machine, and arrange personnel for maintenance and repair in time to avoid subsequent accidents or losses.
This paper is organized as follows: Section 2 discusses discrete wavelet transform (DWT), Log Mel spectrogram, neural network architecture, and the literature on anomaly detection. Section 3 introduces the entire architecture of the proposed monitoring model. Section 4 introduces the experimental results. Finally, Section 5 concludes this research.

Log Mel Spectrogram
The Log Mel spectrogram, commonly used for speech emotion detection and acoustic scene analysis [2,3] and machinery anomaly detection tasks [4], is a spectrogram based on the human ear's perception of sound, rather than linear frequencies.
After the sound sample is processed by Short Time Fourier Transform (STFT), the spectrum is obtained; the spectrum energy is then passed through the Mel filter to obtain the Mel Spectrum. The frequency f is the Mel Frequency, as shown in Equation (1), and finally, the logarithmic operation of the result is obtained to give the Log Mel spectrogram.

Wavelet Transform
Wavelet Transform contains rapidly decaying or finite-length waveforms to express the original signal through scaling and translation, which can avoid inaccurate frequencydomain or time-domain analysis results caused by fixed-length window functions. According to Equation (2), the wavelet transform uses the wavelet function ψ(t) and the scaling function φ(t) to generate each sub-wavelet function ψ a,b (t), thereby fitting the original waveform, where the wavelet function is called the Mother Wavelet, and the scaling function is called the Father Wavelet.
Wavelet transforms can be categorized into discrete or continuous wavelet transforms. The judgment is based on whether scaling parameter a and translation parameter b in Equation (2) are discrete values. Discrete wavelet transforms were used in this research.

EfficientNetV2
In CNN, the bigger the picture resolution, the wider the neural network layer; a deeper neural network improves the accuracy with high computing costs. Although a few studies [5,6] have discussed the influence of resolution, width, and depth, most have only discussed one or two aspects. The research by Tan et al. [7] simultaneously explored the influence of the above three aspects on computing speed and accuracy.
Tan et al. proposed the EfficientNet neural network family, a type of CNN family, which can achieve an accuracy of 84.3% on the ImageNet Dataset [8]. Their research used the Neural Architecture Search technology to search the network structure, with rapid computing speed and high prediction accuracy, and found the best magnifications by adjusting the size of the resolution, width, and depth of the discovered network.
In 2021, Tan et al. proposed the EfficientNetV2 [9] neural network family, an improvement network based on the EfficientNet networks. The EfficientNetV2 networks can Appl. Sci. 2022, 12, 9415 3 of 13 achieve 87.3% accuracy on the ImageNet Dataset; these have reached state-of-the-art level, and have higher prediction accuracy than the previous generation, as shown in Table 1.

Anomaly Detection
The fault detection task of the harmonic drive can be regarded as a kind of rotating machinery fault detection task. The fault in the rotating machinery can be detected through temperature, vibration, image, and sound. An industrial environment requires a fast and accurate detection method with noise processing.
In terms of rotating machinery fault detection, the current mainstream research direction is to use a Support Vector Machine (SVM) or CNN with various preprocessing methods for training. The SVM method can obtain an excellent model but requires a long computing time [10], while the latter has excellent feature mining ability but requires more samples to train a high-performance model [11].
Sahoo et al. [12] built a wind turbine scale-down model and categorized blade failures into three types. Data were collected through an accelerometer, and the vibration signal was passed through a total of 12 statistical methods, such as standard deviation, Root Mean square, and Kurtosis, before being processed as input parameters of the prediction model. The team compared different rotational speeds and different architectures. In this study, the data of different rotational speeds were passed through Decision Tree, K-Nearest Neighbors (KNN), and SVM for blade fault detection. Sahoo et al. found that the accuracy of the fault detection model increased with the increase in rotational speed. Since noise was not processed, the model accuracy of this study was estimated to be 87%, which was slightly lower than that of other studies.
Chen et al. [10] preprocessed the samples by using an improved version combining wavelet transforms and Empirical Mode Decomposition, and used Particle Swarm Optimization to select hyper-parameters of SVM; the accuracy of the fault detection for many types of bearings exceeded 99%.
Yang et al. [1] proposed a harmonic drive fault detection model using the multiscale convolutional neural network (MSCNN). MSCNN includes the coarse-grained layer, the classification layer, the multiscale feature learning layer, and the multisensor data fusion layer. After the fusion of multiple sensor data, the original signal data were decomposed into four layers, and the processed data were subjected to feature learning through a multiscale feature learning layer. Their research obtained 96.79% for the normal-anomaly binary classification. The method proposed in this study did not include noise processing but reduced the negative impact of noise through data from multiple sensors.

The Proposed Approach
This section introduces the architecture and processing flow of the proposed monitoring model.

Monitor Model Architecture Diagram
The process of the proposed monitoring model is shown in Figure 1. In this study, two different datasets were used for model training and evaluation of the model's ability; these were the DCASE 2022 task 2 Dataset [13], which is called the DCASE 2022 Dataset (as discussed below), and the Harmonic Drive Dataset, which contains the harmonic drive operation sounds.
(as discussed below), and the Harmonic Drive Dataset, which contains the harmonic drive operation sounds.
For the DCASE 2022 Dataset, data preprocessing was performed first. For each type of machine operating sound sample, the multi-layer Discrete Wavelet Transforms (DWT) were used to remove noise; then, the logarithmic Mel spectrogram was used to extract the features, and finally, the feature data were entered into the deep neural network for training. For the Harmonic Drive Dataset, each sound sample was cut to increase the number of samples first, and the same process was subsequently applied.

DCASE 2022 Dataset
The DCASE 2022 Dataset is a combination of two datasets, namely the TOYADMOS2 Dataset established by Harada et al. [14] and the MIMII DG Dataset established by Dohi et al. [4]. The TOYADMOS2 Dataset includes sounds of toy trains and toy cars, two different types of industrial machinery, and about 7200 operating sounds recorded under normal and abnormal conditions. The MIMII DG Dataset includes sounds of bearings, fans, gearboxes, slides, and valves. About 18,000 recordings of operating sounds under normal and abnormal conditions were collected from the five different industrial machines.
The TOYADMOS2 Dataset was recorded using a SURE SM11-CN dynamic microphone and a TOMOCA EM-700 condenser microphone. Each sample was of 10 s duration. The team damaged the machine parts and then categorized the damage levels into low, medium, and high levels, finally adding additional factory noises. On the other hand, the sound samples in the TOYADMOS2 Dataset were categorized as the source domain and the target domain. The difference between the two domains lies in the different noise types, signal-to-noise ratios (SNRs), microphone arrangements, and mechanical operating speeds. For the DCASE 2022 Dataset, data preprocessing was performed first. For each type of machine operating sound sample, the multi-layer Discrete Wavelet Transforms (DWT) were used to remove noise; then, the logarithmic Mel spectrogram was used to extract the features, and finally, the feature data were entered into the deep neural network for training.
For the Harmonic Drive Dataset, each sound sample was cut to increase the number of samples first, and the same process was subsequently applied.

DCASE 2022 Dataset
The DCASE 2022 Dataset is a combination of two datasets, namely the TOYAD-MOS2 Dataset established by Harada et al. [14] and the MIMII DG Dataset established by Dohi et al. [4]. The TOYADMOS2 Dataset includes sounds of toy trains and toy cars, two different types of industrial machinery, and about 7200 operating sounds recorded under normal and abnormal conditions. The MIMII DG Dataset includes sounds of bearings, fans, gearboxes, slides, and valves. About 18,000 recordings of operating sounds under normal and abnormal conditions were collected from the five different industrial machines.
The TOYADMOS2 Dataset was recorded using a SURE SM11-CN dynamic microphone and a TOMOCA EM-700 condenser microphone. Each sample was of 10 s duration. The team damaged the machine parts and then categorized the damage levels into low, medium, and high levels, finally adding additional factory noises. On the other hand, the sound samples in the TOYADMOS2 Dataset were categorized as the source domain and the target domain. The difference between the two domains lies in the different noise types, signal-to-noise ratios (SNRs), microphone arrangements, and mechanical operating speeds.
The MIMII DG Dataset was recorded using the TAMAGO-03 microphone. Each sample is of 10 s duration. The abnormal types include fan blade damage, gearbox gear damage, valve blockage, etc. Much like the TOYADMOS2 Dataset, the noises of the factory environment were added, and the samples were also categorized as the source domain and the target domain.

Harmonic Drive Dataset
The model of the harmonic drive used in this study was Liming DSF17-100. The recording device was an Adafruit I2S SPH0645 omnidirectional microphone. The microphone performed the recording at 3 cm from the harmonic drives. The recorded 32-bit floating-point audio file with the sample rate was 44,100 Hz, and the SNR was 60 dB (Lin).
The Harmonic Drive Dataset was marked by experts and contained sound files of the same model of machine with different rotation speeds, while the abnormal type only included gear failure. Since the length of each original sample varied, and also to avoid overfitting problems caused by few samples, this research reduced the original sound sample into those with fixed lengths. After cutting a sound file, if the length was less than the specified cutting length, it was discarded. The number of normal and abnormal samples is shown in Tables 2 and 3. Table 2. The number of original samples.

Normal Abnormal
The number of samples 7 4 Table 3. The number of cutting samples.

Normal Abnormal
Number of samples in one second 285 171 Number of samples in three seconds 95 57

Data Preprocessing
The sound samples may have contained unnecessary sounds, such as environmental noise and factory noise; therefore, the method of separating the sound of machinery from the original samples was an essential step in the monitoring model. In this study, discrete wavelet transforms were used to process sound samples and discard some audio components to remove undesirable noises.
The filter used in this research was the Sym10 wavelet shown in Figure 2, which belongs to the Symlet wavelet family. The Symlet wavelet has the advantage of fast calculation, and is exactly reversible without edge effect problems and memory-saving [15]. environment were added, and the samples were also categorized as the source domain and the target domain.

Harmonic Drive Dataset
The model of the harmonic drive used in this study was Liming DSF17-100. The recording device was an Adafruit I2S SPH0645 omnidirectional microphone. The microphone performed the recording at 3 cm from the harmonic drives. The recorded 32-bit floating-point audio file with the sample rate was 44,100 Hz, and the SNR was 60 dB (Lin).
The Harmonic Drive Dataset was marked by experts and contained sound files of the same model of machine with different rotation speeds, while the abnormal type only included gear failure. Since the length of each original sample varied, and also to avoid overfitting problems caused by few samples, this research reduced the original sound sample into those with fixed lengths. After cutting a sound file, if the length was less than the specified cutting length, it was discarded. The number of normal and abnormal samples is shown in Tables 2 and 3.

Data Preprocessing
The sound samples may have contained unnecessary sounds, such as environmental noise and factory noise; therefore, the method of separating the sound of machinery from the original samples was an essential step in the monitoring model. In this study, discrete wavelet transforms were used to process sound samples and discard some audio components to remove undesirable noises.
The filter used in this research was the Sym10 wavelet shown in Figure 2, which belongs to the Symlet wavelet family. The Symlet wavelet has the advantage of fast calculation, and is exactly reversible without edge effect problems and memory-saving [15].  The process of the wavelet transforms is shown in Figure 3. In the first step, the sample was decomposed into 15-level discrete wavelet decompositions to obtain 15 detail coefficients and 1 approximation coefficient, representing higher frequency audio components and the lowest frequency audio components.
In the second step, after the wavelet decomposition was completed, the coefficients generated in the previous step were reconstructed through the wavelet reconstruction. The reconstructed sounds were the corresponding audio component of the original sound sample in each frequency interval.
The wavelet decomposition process is shown in Figure 4. For example, considering the two-level one-dimensional wavelet decomposition, the original audio signal X was a onedimensional input signal, which was passed through a low pass filter g[k] of length K and high pass filter h[k], thus separating the low-frequency and high-frequency components of the signal.
ple was decomposed into 15-level discrete wavelet decompositions to obtain 15 detail coefficients and 1 approximation coefficient, representing higher frequency audio components and the lowest frequency audio components.
In the second step, after the wavelet decomposition was completed, the coefficients generated in the previous step were reconstructed through the wavelet reconstruction. The reconstructed sounds were the corresponding audio component of the original sound sample in each frequency interval. The wavelet decomposition process is shown in Figure 4. For example, considering the two-level one-dimensional wavelet decomposition, the original audio signal X was a one-dimensional input signal, which was passed through a low pass filter g[k] of length K and high pass filter h[k], thus separating the low-frequency and high-frequency components of the signal. The signal through the down-sampling filter obtained the high-frequency detail coefficient cD1 and the approximate coefficient cA1 of low frequency. Then, cA1 was used as the subsequent input, and the same decomposition steps were performed to obtain the detail coefficient cD2 and the approximate coefficient cA2. The corresponding equations, Equations (3)-(6), are as follows: This research used 15-level wavelet decomposition to obtain the coefficients set S = {cD1, cD2, cD3, cD4, cD5, cD6, cD7, cD8, cD9, cD10, cD11, cD12, cD13, cD14, cD15, cA15}  The wavelet decomposition process is shown in Figure 4. For example, considering the two-level one-dimensional wavelet decomposition, the original audio signal X was a one-dimensional input signal, which was passed through a low pass filter g[k] of length K and high pass filter h[k], thus separating the low-frequency and high-frequency components of the signal. The signal through the down-sampling filter obtained the high-frequency detail coefficient cD1 and the approximate coefficient cA1 of low frequency. Then, cA1 was used as the subsequent input, and the same decomposition steps were performed to obtain the detail coefficient cD2 and the approximate coefficient cA2. The corresponding equations, Equations (3)-(6), are as follows: This research used 15-level wavelet decomposition to obtain the coefficients set S = {cD1, cD2, cD3, cD4, cD5, cD6, cD7, cD8, cD9, cD10, cD11, cD12, cD13, cD14, cD15, cA15} The signal through the down-sampling filter obtained the high-frequency detail coefficient cD1 and the approximate coefficient cA1 of low frequency. Then, cA1 was used as the subsequent input, and the same decomposition steps were performed to obtain the detail coefficient cD2 and the approximate coefficient cA2. The corresponding equations, Equations (3)-(6), are as follows: This research used 15-level wavelet decomposition to obtain the coefficients set S = {cD1, cD2, cD3, cD4, cD5, cD6, cD7, cD8, cD9, cD10, cD11, cD12, cD13, cD14, cD15, cA15} Since this research only required the audio components of each coefficient, except for the detail coefficients of the audio components slated for reconstruction, the remaining coefficients were replaced by 0 arrays. The audio components were obtained after one or more reconstructions. Taking the cD1 audio component as an example, the 0 arrays and the high-frequency coefficients cD1 were passed through the low-pass reconstruction filter g*[k] and the high-pass reconstruction filter h*[k], respectively, then two signals were added to obtain the cD1 audio component, as shown in Figure 5.
Since this research only required the audio components of each coefficient, except for the detail coefficients of the audio components slated for reconstruction, the remaining coefficients were replaced by 0 arrays. The audio components were obtained after one or more reconstructions. Taking the cD1 audio component as an example, the 0 arrays and the high-frequency coefficients cD1 were passed through the low-pass reconstruction filter g*[k] and the high-pass reconstruction filter h*[k], respectively, then two signals were added to obtain the cD1 audio component, as shown in Figure 5. After obtaining the audio components at each frequency interval, this research adopted different audio components for each machinery type. Removing unnecessary noise improves the prediction performance of the neural network. The audio components selected for each category are shown in Tables 4 and 5.  After obtaining the audio components at each frequency interval, this research adopted different audio components for each machinery type. Removing unnecessary noise improves the prediction performance of the neural network. The audio components selected for each category are shown in Tables 4 and 5.

Type Audio Components
Harmonic Drive cD1, cD2 Next, the logarithmic Mel spectrogram was used as the audio feature extraction method, as shown in Figure 6. The sound sample was processed through the short-time Fourier transforms (STFT) shown in Equation (7), the window function w adopted the Hamming window function shown in Equation (8), the STFT frame size was 64 ms, and the frame hop size was 32 ms. The spectrum was obtained after STFT processing.
w(n) = 0.5 1 − cos 2πn Harmonic Drive cD1, cD2 Next, the logarithmic Mel spectrogram was used as the audio feature extraction method, as shown in Figure 6. The sound sample was processed through the short-time Fourier transforms (STFT) shown in Equation (7), the window function w adopted the Hamming window function shown in Equation (8), the STFT frame size was 64 ms, and the frame hop size was 32 ms. The spectrum was obtained after STFT processing.
x(n) is the audio with the length of N. The power spectrum can be obtained by squaring the spectrum, as shown in Equation (9), then the power spectrum was processed through 128 Mel filters, and finally, the Power To Decibel (PTB) operation was performed, as in Equation (10), to obtain the Log-Mel-Spectrogram, which was the input data for the deep neural network.

Deep Neural Network Architecture
The EffienctNetV2S [9] network was used to build the monitoring model. The network architecture is shown in Figure 7. The EffienctNetV2S network was first proposed by Tan et al.; their team used the Fused-MB convolution layer in the early stage of the EffienctNetV2S network and then the MB convolution layer in the later stage to enhance the training efficiency and model performance.
The architecture of Fused-MB convolution and MB convolution are shown in Figure  8 and Figure 9, respectively. The primary difference is whether the neural layer contains the deep-wise convolution structure or the traditional convolution structure. x(n) is the audio with the length of N. The power spectrum can be obtained by squaring the spectrum, as shown in Equation (9), then the power spectrum was processed through 128 Mel filters, and finally, the Power To Decibel (PTB) operation was performed, as in Equation (10), to obtain the Log-Mel-Spectrogram, which was the input data for the deep neural network.

Deep Neural Network Architecture
The EffienctNetV2S [9] network was used to build the monitoring model. The network architecture is shown in Figure 7. The EffienctNetV2S network was first proposed by Tan  accuracy of fault detection could be achieved, but the research did not include noise processing. However, a fault detection model must possess characteristics such as high accuracy and noise processing in the real industrial environment. Therefore, there is a need for an efficient and accurate method to solve this problem.
To solve the above-discussed problems, this paper proposes a method that uses machine learning methods and signal processing to deal with sound samples. The performance of the proposed model was up to 0.93 AUC of the fault detection.
The proposed model will allow factory managers to swiftly detect any possible issue in the harmonic drive without stopping the machine, and arrange personnel for maintenance and repair in time to avoid subsequent accidents or losses.
This paper is organized as follows: Section 2 discusses discrete wavelet transform (DWT), Log Mel spectrogram, neural network architecture, and the literature on anomaly detection. Section 3 introduces the entire architecture of the proposed monitoring model. Section 4 introduces the experimental results. Finally, Section 5 concludes this research.

Log Mel Spectrogram
The Log Mel spectrogram, commonly used for speech emotion detection and acoustic scene analysis [2,3]and machinery anomaly detection tasks [4], is a spectrogram based on the human ear's perception of sound, rather than linear frequencies.
After the sound sample is processed by Short Time Fourier Transform (STFT), the The architecture of Fused-MB convolution and MB convolution are shown in Figures 8 and 9, respectively. The primary difference is whether the neural layer contains the deep-wise convolution structure or the traditional convolution structure.   In this study, an image of 64 × 128 was the input of the deep neural network, Adam optimizer was used for optimization, and the learning rate was 0.001.

Methods
This section introduces the experimental process and results.

The Experimental Process
For the DCASE 2022 Dataset, this research used the monitoring model constructed by the log Mel-spectrogram, wavelet transforms, and the EfficientNetV2S network to compare with the other two proposed monitoring models. Then, the two Baseline models provided by the organizers of the DCASE 2022 challenge were compared. The hyper-parameters used by the model are shown in Table 6; the mixed precision was used to speed up the computation. The model with the lowest loss in the validation set was saved and examined in the final test phase. Adam was used for optimization in the training phase. Table 6. Hyperparameters of the monitoring model.    In this study, an image of 64 × 128 was the input of the deep neural network, Adam optimizer was used for optimization, and the learning rate was 0.001.

Methods
This section introduces the experimental process and results.

The Experimental Process
For the DCASE 2022 Dataset, this research used the monitoring model constructed by the log Mel-spectrogram, wavelet transforms, and the EfficientNetV2S network to compare with the other two proposed monitoring models. Then, the two Baseline models provided by the organizers of the DCASE 2022 challenge were compared. The hyper-parameters used by the model are shown in Table 6; the mixed precision was used to speed up the computation. The model with the lowest loss in the validation set was saved and examined in the final test phase. Adam was used for optimization in the training phase. Table 6. Hyperparameters of the monitoring model. In this study, an image of 64 × 128 was the input of the deep neural network, Adam optimizer was used for optimization, and the learning rate was 0.001.

Methods
This section introduces the experimental process and results.

The Experimental Process
For the DCASE 2022 Dataset, this research used the monitoring model constructed by the log Mel-spectrogram, wavelet transforms, and the EfficientNetV2S network to compare with the other two proposed monitoring models. Then, the two Baseline models provided by the organizers of the DCASE 2022 challenge were compared. The hyper-parameters used by the model are shown in Table 6; the mixed precision was used to speed up the computation. The model with the lowest loss in the validation set was saved and examined in the final test phase. Adam was used for optimization in the training phase. For the Harmonic Drive Dataset, the monitoring model with the best performance in the DCASE 2022 Dataset experiment was used as the anomaly monitoring model of the harmonic drive. The test process was the same as the above-described process.

The Experimental Results
The harmonic drive is a type of gearbox, so the gearbox audio samples could be used to evaluate the anomaly detection model for the harmonic drive. For the sample source of the gearbox, the DCASE 2022 Dataset Gearbox audio samples were used.
The model with the highest AUC was selected by comparison through experiments. Three models were proposed in this study for gearbox machinery anomaly detection. The AUC of the three models is shown in Table 7. Model one, which combined logarithmic Mel spectrogram, wavelet transforms, and the EfficientNetV2S network architecture, achieved the best prediction performance. Thus, this research used model 1 for subsequent comparisons with methods proposed in other studies. The proposed model outcomes were compared with those of two baseline models provided by the organizers of the DCASE 2022 Challenge. The results presented in Table 8 show that the average AUC of the proposed model was about 6% higher than those of the two baseline models, suggesting that the proposed model performed better than the two baseline models for gearbox anomaly detection tasks in real factory scenarios. Further, the model's performance in fault detection tasks was examined for various types of machinery in the DCASE2022 Dataset to assess whether the proposed model could detect general machinery anomalies. The results in Table 9 suggest a good capability of the proposed model for the detection of various types of machinery anomalies. The overall AUC was 5% higher than that of baseline models on average, and the AUC of the Slider category was nearly 20% higher. After the proposed model was evaluated on the DCASE 2022 Dataset, the Log Mel spectrogram, discrete wavelet transforms, and EffientNetV2S network were used to build the harmonic drive anomaly monitoring model. As the training set of the experiment, we randomly selected 60% of the data from the normal samples and abnormal samples of the Harmonic Drive Dataset and considered 20% of the data as the validation set; the remaining 20% of the data was used as the test set.
In this research, samples of 1 s duration were used for the experiments. The proposed model was compared with other models for rotating machinery anomaly sound detection, including the method that uses the Fast Kurtogram combined with deep convolution to predict bearing anomalies proposed by Prosvirin et al. [16] and our previous research [17] that uses the wavelet transforms combined with the fully connected network to predict the wind turbine blade anomalies. The results are shown in Table 10.

Accuracy AUC
This paper 0.901 0.9302 Prosvirin et al. [16] 0.91 0.911 Kuo et al. [17] 0.858 0.89 This research explored the impact of sample duration and noise on prediction performance. Based on duration, the samples were categorized into one second and three seconds.
The results of the prediction accuracy of different sample durations are shown in Table 11. The prediction AUC of the monitoring model for the three-second category was lower than that for the one-second category, possibly because the model needs a larger number of samples to detect the pattern of the anomaly sounds. Regarding the noise, additive white Gaussian noise was added to the 1 s duration sound samples. The two kinds of SNRs for the experiments were of the 20 dB(Lin) and 10 dB(Lin) categories. The prediction AUC results with different SNRs in this research are shown in Table 12.  Table 12 shows that the intensities of noise and the prediction AUC results of the model are related. In the case of noise addition, the model could still maintain a good prediction performance, indicating that the proposed mode has noise processing ability and can perform the anomaly detection task of the harmonic drive even in the presence of background noise.

Conclusions
This research proposes a harmonic drive anomaly detection model by combining discrete wavelet transforms, the Log Mel spectrogram, and the EffientNetV2S network architecture. The model uses wavelet transforms to separate the original sample audio into audio components representing each frequency interval, then uses the Log Mel spectrogram to extract features, and finally enters features as inputs into the neural network for training. The detection model exhibited an excellent prediction performance for the DCASE 2022 Dataset and the Harmonic Drive Dataset.
The proposed detection model uses only the sound of mechanical operation as the anomaly judgment. If data such as vibration information or thermal energy are added to the model, the prediction performance of the model may be further improved.
The parameter settings of the denoise algorithm were adjusted manually, and the best combination was compared through experiments. The audio pre-processing efficiency may be improved if the optimization algorithm is used to adjust each parameter, thus further enhancing the prediction capability.