Acoustic Resonance Testing of Small Data on Sintered Cogwheels

Based on the fact that cogwheels are indispensable parts in manufacturing, we present the acoustic resonance testing (ART) of small data on sintered cogwheels for quality control in the context of non-destructive testing (NDT). Considering the lack of extensive studies on cogwheel data by means of ART in combination with machine learning (ML), we utilize time-frequency domain feature analysis and apply ML algorithms to the obtained feature sets in order to detect damaged samples in two ways: one-class and binary classification. In each case, despite small data, our approach delivers robust performance: All damaged test samples reflecting real-world scenarios are recognized in two one-class classifiers (also called detectors), and one intact test sample is misclassified in binary ones. This shows the usefulness of ML and time-frequency domain feature analysis in ART on a sintered cogwheel dataset.


Introduction
Since the industrial age cogwheels (the term cogwheels may be considered as gears but not gearboxes or bearing systems) have been indispensable components in manufacturing, e.g., the textile and automotive industries, and they still play a significant role even in this information age, e.g., robotics and aerospace. This makes developing reliable and costeffective non-destructive testing (NDT) methods an integral part of quality control (QC).
The field involved with cogwheels is vast, and yet most work in the literature has been performed in the context of gearboxes or bearing systems [1][2][3][4][5][6][7], i.e., many gears are inside in a system and attached to each other. For such systems, the main focus of fault detection lies on a system failure dealing with conditions monitoring in lifespan analysis, which occurs mainly due to malfunctioning components suffering from wear, abrasion and pollution, such as sand or lubrication.
However, this necessarily leads to different directions of research, when the structural health diagnosis of cogwheels in manufacturing process, e.g., sintering, comes into focus. Moreover, this often encounters small data problems simply due to lacking data of defective parts; see, e.g., [4]. Here, small data problems refer specifically to the situation when there are not enough data available for training algorithms in machine learning (ML), which poses difficulties in various fields as, e.g., can be seen in [8].
Although gearboxes or bearing systems related work have made progress [5][6][7], usually by means of modern deep learning (DL) [9], the employed methods are not always Table 1. Comparison of gearboxes, bearings, and cogwheels related work, where NM stands for "not mentioned. The symbols " " and "-" denote "being affirmative" and "not applicable", respectively.

Qu et al. [1]
Haidong et al. [2] Oh et al. [3] Saufi et al. [4] Usman et al. [ When it comes to NDT, other than signal-based approaches, there also exist imagebased methods and they have made considerable progress since modern DL-based algorithms have become part of the mainstream across most research disciplines [13,14] due mainly to the work by Krizhevsky et al. [15]. However, in this study, we focus solely on signal-based methods on the grounds that image-based approaches are not as costeffective as signal-based ones [16], and the methods become futile when defects in images are invisible as in our case. In addition, among different ML algorithms, we call them modern when the employed approaches are involved with DL-based ones; otherwise, we call them classical.
Hence, given a summary about the matter in Table 1, apart from gearboxes or bearing systems, to the best of our knowledge, there has actually been no extensive study on sintered cogwheel small data using acoustic resonance testing (ART) [17] with the help of traditional and modern ML methods in the context of non-destructive testing (NDT).
Our Contributions: In this work, we address the aforementioned issues and intend to bridge the gap: We collect a small dataset on cogwheels and perform time-frequency domain feature analysis.
Afterwards, we apply not only classical ML algorithms but also modern DL-based ones to the obtained feature sets in the way of one-class as well as binary classification. In this way, in spite of having small data, our approach is able to achieve robust performance: All defective test samples reflecting real-world scenarios are recognized in two one-class classifiers (also called detectors) and one intact test sample is misclassified in binary classification. This suggests that ART can be an attractive tool on cogwheel data in QC when taking advantage of the combination of ML algorithms and time-frequency domain feature analysis.
Paper Organization: The paper is organized as follows: After we give a brief exposition on data acquisition and feature analysis in Section 2, we provide information on training of ML algorithms in Section 3. Then, we present the result of experiments in Section 4. Finally, the paper is closed with our concluding remarks.

Test Objects
In the experiment, five cogwheels (chain wheels) are examined. They are made of sintered iron and inductively hardened in surface layers. The weight of the cogwheels is approximately 140 g, the outer diameter is 79 mm, and the thickness amounts to 7-9 mm.

Examination Setup of Objects
The testing station for cogwheels, including a lifting device, was developed in Fraunhofer IKTS in Dresden, Germany. It is equipped with a three-point mounting system and is pneumatically controlled. In order to guarantee repeated and reproducible placement, the cogwheel is fixed when it is placed. The cogwheel is raised up to three tip points with compressed air and thus distanced from the test bench. Moreover, in these three tip points, one transmitter and two receivers (channel 1 and channel 2) are mounted, see Figure 1a.

Measurement Method of Signal
For the measurement of signals, a multi-channel acoustic measurement system (MAS) was used: four channels, analog input amplifiers and digitization of the measurement signals, output amplifying stage for exciting acoustic converters, CAN interface to PC. In addition, two preamplifiers (40 dB; 10-500 kHz), one ultrasonic piezo actuator (transmitter) and two ultrasonic piezo sensors (receiver) are used. Each actuator and sensor is with a hard metal tip. The operating software for MAS has the following functionalities: configuring measurement channels, generating and sending excitation functions, as well as recording and storing measured signals in a time-synchronous way.

Measurement on Cogwheels
For collecting data, the aforementioned five sintered cogwheels are used. Four of them are in intact condition, and one has defects. Concerning the defects, they were introduced by a company specialized in this area. These are designed in such a way that real-world scenarios are reflected and thereby almost indistinguishable from real ones.
For more details, we refer to [18][19][20] and the references therein. For each gear wheel, the raw signal of acoustic response that goes through a preamplifier was recorded by two receivers (channels 1 and 2) with a sampling rate of 1041.67 kHz with respect to ten different positions: Although the receivers are mounted in fixed positions, the measurements of structural vibrations are actually obtained in different positions by rotating the wheel, which makes the data acquisition process less biased in terms of the positions of receivers. The reference point for positioning the gear is rotated in a counterclockwise direction every four teeth of the gear wheel and marked from P00 to P09, see Figure 1b. Moreover, each observation is labeled as either "OK" for intact samples or "UNK" for defective ones, respectively. The dataset is organized with respect to three excitation signals:
As described in Tables 2 and 3 20 20 Concerning sensor fusion methods, the late fusion approach is adopted in a sense that pseudo probability scores that are obtained from trained models using each channel are averaged to make a final prediction by incorporating the threshold of equal error rate (EER), see Figure 3. We provide more details on how the aforementioned pseudo probability scores are obtained depending on the deployed ML algorithms in Section 4.  Figure 3. A schematic view of workflow in our approach. Channel 1 and channel 2 are abbreviated as ch1 and ch2, respectively. P 1 and P 2 denote pseudo probability scores obtained from trained models with channel 1 and channel 2, respectively.

Training of Classifiers
Given the dataset, the main goal of our experiment is to investigate which combination of ML methods and feature sets are appropriate for recognizing real-world defects. To this end, we first considered one-class-based methods as applied in anomaly detection in order to deal with the limitations in sample size and imbalance of the acquired dataset: Moreover, we also applied the following methods in the way of binary classification: • feed-forward neural networks (FFNNs), and • convolutional neural networks (CNNs).
Although NN-based methods, such as CNNs, are well-known to be useful for constructing feature maps from raw signals [9], this comes at the expensive price of a large dataset for training [21]. Moreover, this is often not a viable option as in our situation.
On this account, we restrict ourselves to PFA and SFA feature sets for training.

Configuration of Experiments
The dataset is prepared in a way that there is no overlap between training and test sets. Stratified five-fold cross validation (CV) is employed during all experiments to ensure good representation of the whole classes in the training and test folds. For one-class classification, this strategy is realized in such a way that training is performed only on intact samples without a designated fold and tested against all damaged ones with the reserved fold as illustrated in Figure 4. The reasoning behind this is to circumvent overfitting as much as possible by exploiting the common properties of small dataset, i.e., few damaged samples compared to intact ones.

Hidden Markov Models
HMMs can be viewed as an extension of a mixture model, where the choice of the mixture component for each observation is not selected independently but depends on the choice of component for the previous observation. This is called the Markov property [22]. Since HMMs are useful for dealing with sequential data, they are widely used in speech recognition [23] and natural language processing [24].
However, they have also been successfully applied in advanced NDT [25]. Although long short-term memory (LSTM) is known to be good at dealing with variable length of sequential data [26], we instead make use of a simpler model HMM considering that our feature sets PFA and SFA have fixed dimensions. Our HMM is designed in such a way that ten hidden states release observations that correspond to our acquired dataset via one Gaussian probability density function with a full covariance matrix in each state. To detect anomalies, we used the interquartile range by measuring a score characterizing how well our model describes an observation point. The experiments are conducted by means of the dLabPro package [27], and the model parameters are estimated by the Baum-Welch algorithm [28].

Support-Vector Machines
The SVM is a generalization of the maximal margin classifier, and it classifies data points by constructing a separating hyperplane that distinguishes one class from others [29]. SVMs are extremely powerful ML algorithms to solve various classification problems in that not only are they less prone to overfitting due to large margins but they are also relatively manageable to solve due to convex nature. Moreover, it is also well-known that they are effective in dealing with high dimensions of features-particularly when the number of features are much more than training samples-by making use of kernel tricks regarding nonlinear classification problems.
Our experiments were implemented using the scikit-learn [30] interface relying on the LIBSVM library [31]. SVM models were trained using the radial basis function (RBF) kernel, and the following parameters were tuned on about 20% of the training set to obtain optimal results: (1) regularization parameter C (from 10 −5 to 10 7 ), and (2) γ, which defines how far a single sample influences (from 10 −10 to 10 −1 ), or (3) ν, which has the ability to control over the number of support vectors (from 10 −3 to 1), if necessary.

Isolation Forest
Isolation forest belongs to the family of ensemble methods and is a tree-based anomaly detection algorithm that isolates observations as outliers based on the anomaly score delivered by profiling a randomly selected feature with a random split value between minimum and maximum values of the selected feature [32,33]. This has been a useful technique in wide range of fields, e.g., finding anomalies in hyperspectral remote sensing images [34], detecting anomalous taxi trajectories from GPS traces [35], or in analyzing partial discharge signals of a power equipment [36].
Our experiments are realized by scikit-learn [30]: the minimum split number is set to 2, and the maximum depth of each tree is defined by log 2 n , where n denotes the number of samples used to build the tree.

Autoencoder of Bottleneck Type
An autoencoder refers to a type of ANN which aims at approximating original input signal in an unsupervised way [37], which is composed of two parts: encoding and decoding layers. The encoding layers are responsible for finding an efficient representation of the input vectors by learning useful features, and decoding layers attempt to reconstruct the input signal as close as possible from the acquired encoded information. Since AEs are capable of generating the compact representation of input data, which is extremely useful in terms of feature learning, there is an enormous potential to solve various problems, such as anomaly detection [38], image denoising [39] and shape recognition [40].
Our experiments were performed by leveraging Keras [41] with TensorFlow [42] and the following feed-forward bottleneck type architecture is employed: input-512-64-512output. As shown in Figure 5, the input and output size are equal to the dimensions of the vectorized feature sets, i.e., 19,560 for PFA and 15,648 for SFA, respectively.
All layers are fully connected and activated by leaky rectified linear unit (LReLU) to overcome vanishing gradient [43]. In addition, to deal with internal covariant shift batch normalization (BNorm) is applied to each layer [44].
Moreover, as countermeasures against overfitting, which, in our case, is of grave concern particularly due to small data, random dropout with a 0.5 rate in internal layers [45] and the early stopping strategy making use of the patience parameter with 25 are considered [46], where the patience specifies the number of epochs with no improvement in terms of the used loss function, after which, training will be halted [41]. Given the maximum number of epochs to be 500 in our experiments, the early stopping criterion comes into play in a range from epochs 132 to 445 depending on the folds in the datasets. Our AE-BNs have about 20 million parameters, and for training, adaptive moment estimation (Adam) optimization [47] is incorporated along with L 1 regularization to obtain sparse solutions. Hyperparameter optimization using grid search is conducted on about 20% of training set in pre-training stages to obtain suitable parameter values, such as training batch size 512 and the aforementioned dropout rate 0.5.

Deep Learning for Binary Classification
DL may be defined as a class of ML algorithms that typically make use of multilayer NNs in order to progressively extract different levels of representations of input data, which correspond to a hierarchy of features [48]. While the input data are being processed in multiple layers, each layer allows to reveal additional features of the input in such a way that higher level features are described in terms of lower level ones to help understand the data. As in [49], this can be understood in the following example from image classification: Given an image of a dog as input, for instance, pixel values are detected in the first layer; edges are identified in the second layer; combinations of edges and other complex features based on the edges from the previous layer are identified in next several layers; and finally the input image is recognized as a dog in output.
Apart from the different levels of abstraction, due to the capability of nonlinear information processing, DL-based approaches have recently become popular in many fields, including, but not limited to, image processing, computer vision, speech recognition, and natural language processing [50]. As in the case of AE-BN, our DL routines were also realized by Keras [41] using TensorFlow [42], and the following architectures were employed: Three hidden layers are stacked and fully connected, see Figure 6. These hidden layers are incorporated with 600, 300 and 100 nodes and activated by the LReLU function. In addition, BNorm and a dropout rate of 0.5 are employed in each layer. Other configurations are similar to those of AE-BE: The Adam optimizer along with L 1 regularization, early stopping by means of the patience parameter with 25, batch size with 256 and maximum number of epochs with 200 are considered. From one node in the output layer, binary classification is realized using binary cross-entropy loss by mapping "UNK" to 0 and "OK" to 1  In the case of CNN, three 2-D convolution layers with the kernel size of 3 × 3 are employed, each of which has 16, 32 and 64 feature maps and is downsampled with the stride of 2 × 2. Then, the LReLU activation function, BNorm and dropout rate with 0.75 and a 2-D max pooling layer with 2 × 2, which is another way to deal with overfitting, are applied to each layer. Then, the result is flattened and fed into a fully connected layer with 50 nodes activated by LReLU, where BNorm and the dropout rate 0.75 are also used.
As can be noticed, a relative high dropout rate is chosen for reducing model complexity in the light of overfitting owing to the small size of defective samples. Compared to the case of FFNN, other configurations for training remain unchanged except the maximum number of epochs at 300. The architecture of CNN is provided in Table 4. Binary classification is implemented in the same way as in the case of FFNN. Our CNN has approximately sixty thousand parameters.

Evaluation Metrics
In order to evaluate different classification algorithms, we provide the following performance metrics: balanced accuracy rate (BAR) along with corresponding 95% confidence interval (CI) [51], area under curve (AUC), Matthews correlation coefficient (MCC) [52] and the histogram of scores computed by one-class classifiers along with a classification margin (CM) if classes are clearly separable, i.e., if EER equals to 0.
Since scores are close to 0 and 1 for defective and intact classes, respectively, CM is defined by where S UNK and S OK denote the scores of the classes "UNK" and "OK" and max(·) and min(·) stand for the maximum and minimum of the scores of the designated class, respectively. This measure represents a ratio of a maximum margin of scores between classes to the whole spectrum of scores from both classes, where a maximum margin of scores can be computed by subtracting the maximum score of the defective class from the minimum score of the intact class.
To make an inference of a class c ∈ {OK, UNK} for a test set, the aforementioned scores for each detector are defined and computed based on [53] in the following way: • HMM: where D test denotes a test set, NLL stands for the negative log likelihood, and with | · | as the cardinality of a set. • SVM: where score(x) denotes the distance from x to the separating hyperplane. • IF: where score(x) is defined as in [33]. • AE-BN: where score(x) denotes the mean squared error (MSE) of the cross entropy loss function.

One-Class Classification
When it comes to one-class classification, despite small and imbalanced data, SVM and AE-BN perform equally well in terms of BAR across all feature sets and all three excitation functions, see Table 5. Please note that the given BAR and CI are based on beta distribution, which necessarily leads to asymmetric CIs and slightly lower values of BAR than the conventional accuracy although all test samples are correctly classified in the case of SVM and AE-BN.
In contrast to SVM and AE-BN, HMM has some difficulties with SFA in all three datasets. Moreover, when PFA is combined with either HMM or IF in the Sinc-150k dataset two misclassifications occur: intact samples are recognized as damaged ones, which is less severe than the opposite situation in a production line. The result of BAR is consistent in terms of MCC, see Tables 5 and 6. However, AUC scores tend to be higher particularly for the case of binary classification in spite of the occurrence of one misclassification, see Tables 5 and 7.
As shown in Figures 7-9, the juxtaposed histograms of scores with the help of a CM allow us to further investigate how well classifiers behave with respect to feature sets, thresholds and robustness. The CM is available as long as classes are not overlapped.
From Figures 7d, 8d and 9d, one can notice that among all combinations between classifiers and feature types SVM with SFA delivers a best performance in terms of CM, which is followed by SVM with PFA, AE-BN with SFA and AE-BN with SFA in each dataset. It can be also recognized that IF gives better performance with SFA than with PFA in all databases, which, however, is not the case with HMM or AE-BN. From the perspective of excitation functions, more classifiers are able to recognize all test sets correctly in the dataset Crp1k-200k and RC2-75k than in Sinc-150k.
The result of our approach suggests that one-class classification still allows for reliable anomaly detection even though training is performed only on intact samples. Moreover, our proposed method gives robust performance by showing fairly large CM not only with classical methods but also with modern DL-based ones, e.g., 46% of SVM with SFA and 40% of AE-BN with SFA as shown in Figures 7d and 9h. This makes an important point of our contribution since real-world scenarios of data skewness in a production line, i.e., numerous intact samples but few damaged ones, are considered.

Binary Classification
While one-class-based experiments show different results depending on combinations of classifiers and feature sets in each dataset, binary classification experiments yield one misclassification in all cases: an intact sample is misclassified as the damaged one, see Table 5. It should be noted that binary classifications, in contrast to the one-class case, make use of not only intact samples but also defective ones for training.
Since the number of flawed samples are much less than that of flawless ones, obtained models from training are prone to overfitting, which forces us to take various countermeasures, such as less complex NN architectures, high dropout rates and a higher weight of regularization. Although FFNN and CNN deliver solid performance in our case, it should be noted that it may sometimes be difficult to deal with small data.
To improve the overall performance of binary classification, it is therefore desirable to provide more data of faulty samples. In this context, data augmentation by considering the physical properties of cogwheels, e.g., numerical simulation, may be a possible approach to deal with the difficulties.

Conclusions
In this article, we presented the ART approach on small data of sintered cogwheels by utilizing not only classical ML algorithms but also modern ones. In consideration of data imbalances, our experiments were performed in two ways: one-class classification and binary classification. Our experimental results with a large safety margin classification demonstrated that one-class classifiers (detectors) had considerable potential to serve as an effective and thereby attractive tool in a reliable anomaly detection system in NDT. In addition, the experiments of binary classifiers support that they were still able to deliver robust performance in spite of small data. This shows the usefulness of ML along with time-frequency domain feature analysis on the cogwheel dataset in ART for QC.