Acoustic-Sensing-Based Attribute-Driven Imbalanced Compensation for Anomalous Sound Detection without Machine Identity

Acoustic sensing provides crucial data for anomalous sound detection (ASD) in condition monitoring. However, building a robust acoustic-sensing-based ASD system is challenging due to the unsupervised nature of training data, which only contain normal sound samples. Recent discriminative models based on machine identity (ID) classification have shown excellent ASD performance by leveraging strong prior knowledge like machine ID. However, such strong priors are often unavailable in real-world applications, limiting these models. To address this, we propose utilizing the imbalanced and inconsistent attribute labels from acoustic sensors, such as machine running speed and microphone model, as weak priors to train an attribute classifier. We also introduce an imbalanced compensation strategy to handle extremely imbalanced categories and ensure model trainability. Furthermore, we propose a score fusion method to enhance anomaly detection robustness. The proposed algorithm was applied in our DCASE2023 Challenge Task 2 submission, ranking sixth internationally. By exploiting acoustic sensor data attributes as weak prior knowledge, our approach provides an effective framework for robust ASD when strong priors are absent.


Introduction
Acoustic-sensing-based anomalous sound detection (ASD) has become an increasingly important technique for predictive maintenance and condition monitoring in industrial environments, especially with the emergence of Industry 4.0.ASD aims to detect anomalous noises in acoustic signals that may indicate a fault or deterioration in mechanical equipment.When machinery begins to degrade, the vibration and sounds emitted often change subtly before failure occurs.By identifying these anomalous acoustic patterns, ASD systems can provide early warning of impending faults, enabling proactive maintenance to avoid catastrophic breakdowns.Traditional manual acoustic monitoring is labor-intensive and prone to human variability.The emergence of automated ASD systems addresses these limitations, reducing personnel costs and providing more consistent machine health assessment.
In most real-world scenarios, abnormal samples cannot be obtained by damaging the machine, and the complex engineering environment introduces much noise into the sound samples.The operating settings of different machines are also diverse.Therefore, the main challenge of the acoustic-sensing-based ASD task is to detect anomalous sounds when only normal sound samples are provided as training data [1][2][3].
In addition to acoustic sensing data, anomaly detection and fault diagnosis methods designed for other data types are also worth considering as references.In [4], the authors innovatively utilized event-based cameras for anomaly detection, collecting vibration signals of machines in a contactless manner and providing a new perspective for machine condition monitoring.Facing similar challenges of imbalanced data and dynamic operations, the authors in [5] combined a self-supervised anomaly detector based on a local outlier factor (LOF) and a deep Q-network (DQN) supervised reinforcement learner to classify interturn short-circuit, local demagnetization, and mixed faults.Additionally, in the context of small datasets in industrial settings, the authors in [6] optimized the friction-drilling process through model ensembling in order to work with no complete information.Feature engineering is also crucial for rotary machine monitoring.The authors in [7] proposed a novel feature extraction method called weighted multi-scale fluctuation-based dispersion entropy for detecting faults in planetary gearboxes.In [8], permutation entropy was integrated with a flexible analytical wavelet transform for bearing defect detection.These real-world practices in industrial scenarios provide valuable references for ASD work.
To drive the development of acoustic-sensing-based ASD technology, a sub-challenge (Task 2) of 'Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring' has been launched in the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) since 2020 [9].In the previous work of the DCASE Challenge Task 2, the anomaly detection system AutoEncoder (AE) [9] based on generative model was widely used due to its simple design and efficient inference.In addition to acoustic sensing tasks, AEs have also been extensively utilized as unsupervised anomaly detectors in many other application domains [10,11].However, AE-based anomaly detection relies on the assumption that anomalies are difficult to reconstruct.Given the inherent denoising characteristics of AE [12], enhancing the representation capacity of AE may inadvertently treat anomalies as noise, thus constraining the anomaly detection performance reliant on AE.In addition to AE, other generative models, such as IDNN [13], Efficient GAN [14], and Glow_Aff [15], detect anomalies by modeling the distribution of normal sounds and determining whether the sound under test is within the range of the normal sound distribution.However, due to the complexity of anomalous sounds, it is difficult to model a stable distribution for anomaly detection [16], which limits the generative model.
Therefore, in order to better model the characteristics of normal sounds, systems based on discriminative models [17] were designed and achieved excellent performance.These models utilize powerful deep learning feature extraction networks, such as ResNet [18], Mo-bileNetV2 [19], and STgram [20], for self-supervised classification tasks related to machine ID.During inference, abnormal samples are exposed due to the difficulty in classification, allowing for effective anomaly detection.Undoubtedly, in previous challenges regarding DCASE Task 2 evaluations, the competitive anomaly detection systems were also based on utilization of machine ID classification.The algorithm's success is attributed to the high-quality classification boundaries established by leveraging strong prior knowledge of machine ID.However, in practical applications, obtaining such high-quality prior knowledge of machine ID is often unfeasible.This raises an important question: how can we adapt anomaly detection algorithms based on discriminative models to operate effectively under limited prior knowledge conditions?
Instead, we need to design anomaly detection algorithms under weak prior knowledge conditions.According to the task setting of DCASE2023 Task 2, we cannot obtain highquality prior knowledge such as machine ID, but we can obtain the attributes information of each audio clip, such as microphone number, machine running speed, machine load status, etc.We define the attributes information of the audio clips as weak prior knowledge.Unfortunately, there is no free lunch in the world.Intuitively, these attribute labels are extremely imbalanced, complex categories, and cannot form clear classification boundaries, but they are more accessible in the real world.
In this paper, we propose the attribute-driven imbalanced compensation (AIC) method, aiming to overcome the disadvantages of weak prior knowledge and use attribute labels to build discriminative models for anomaly detection.Our main contributions are as follows: (1) we propose an attribute classifier using the weak prior knowledge, making the application of discriminative models possible when machine ID labels are limited.(2) We propose the imbalanced compensation strategy to solve the common problem of extreme sample imbalance in attribute labels.(3) We propose a score fusion method based on AIC to enhance the robustness of the model.

Proposed Method
The AIC framework we propose contains an imbalanced compensation module, M attribute classifiers, and an ensemble attribute anomaly detector.The overview of the overall framework is shown in Figure 1.The overall framework consists of training and testing stages.First, the raw data are augmented via imbalanced compensation separately for each attribute.In the training stage, a classifier is trained for the augmented data of each attribute with cross-entropy loss.In the testing stage, the augmented data are fed into the trained classifiers to obtain embeddings, with each attribute corresponding to one embedding space.Then, the embedding of the test samples is extracted through the trained classifier, and KNN is used to calculate the score in the ensemble attribute anomaly detector.

Attribute Classifier
Although the previous work using machine ID for classification achieved good results [1,2], it is often more limited in practical applications, such as only one machine is working.In this case, strong prior knowledge such as machine ID cannot be used.Nevertheless, machines still have weak prior knowledge that is easy to obtain, such as attributes in Table 1; both the 'ToyCar' and 'ToyTrain' have three types of weak attributes.Therefore, in this study, we propose to train the attribute classifier using such weak prior knowledge as attributes.In real-world applications, different attributes of a machine may work under different operating status.These statuses can be easily collected and labeled.For example, as illustrated in Table 2, in the ToyCar dataset provided by the DCASE2023 Challenge, the attribute 'Mic' has two types of operating status: '1' and '2'.These status labels can be used to train an anomaly classifier, allowing it to distinguish between different operating status for each attribute.This approach helps the anomaly classifier learn the details of the training dataset more comprehensively and deeply, similar to observing the same object from different perspectives.The operating status information provided by DCASE [21,22] naturally accompanies machine operation and is readily accessible.Similar to previous machine ID classifiers [1], in our attribute classifier, we adopt the cross-entropy loss to classify each operating status of each attribute of each machine.As shown in Figure 1 training classifier step, and both of Tables 1 and 2, this can be described as training M classifiers for M attributes, where each classifier performs an K m -class classification task and K m represents the total distinct types of operating status for the m-th attribute.We use ResNet18 [23] as the backbone encoder of the classifier to obtain the embedding of each attribute.The proposed m-th attribute classifier (AC) is trained with the cross-entropy loss function as where N means the total input samples of the m-th attribute, x ik m , y ik m represent the i-th input sample and its k-th operating status in attribute m. p θ E (•) is the softmax output of the encoder with parameters θ E .As depicted in Figure 1 training detector step, following the independent training of the M attribute classifiers, we correspondingly learn M anomaly detectors, with each detector associated with one of the M-trained attribute classifiers.We utilize KNN as the anomaly detector, with the embeddings extracted by the trained classifiers used as the training data.During testing, the test sample is passed through each of the M-trained classifiers to obtain M embeddings, which are then fed into their corresponding M anomaly detectors to produce anomaly scores.The final aggregated anomaly score is obtained by taking the harmonic mean of the individual scores from each detector.However, classifiers trained solely on operating status information for machine attributes often struggle to converge.Taking ToyCar dataset as an example, in Table 2, the information of training samples is expressed in the form of 'Category: #samples', and we can see that different attributes have varying numbers of operating status categories.For example, 'Car model' has 10 categories from A1 to E2, 'Speed' has 5 categories, controlled by voltage levels from 2.8 V to 4.0 V, and 'Mic' includes 2 categories, 1 and 2. Additionally, the number of samples for each operating status category is highly unbalanced.These factors make machine attributes weaker prior knowledge compared to machine IDs and pose several challenges with using operating status alone for training attribute classifiers: (1) the inconsistency in the number of attributes and operating status categories across different types of machines makes it difficult to establish consistent classification boundaries; (2) the severe sample imbalance within the same attribute and operating status category affects the classifier's ability to accurately characterize normal samples; (3) during testing, the machine attributes and their corresponding operating status are unknown, further complicating the classification process.As a result, classifiers trained only on operating status are prone to misclassifying normal unseen samples as anomalies.To address this issue, we need to propose a method that strengthens the weak attribute knowledge as prior information in anomaly detection models.

Imbalanced Compensation
To solve the problem of severe unbalanced samples among different operating status shown in Table 2, in this section, we propose an imbalanced compensation module to enhance the proposed attribute classifier training.The module mainly includes two parts: (1) maximum expansion uniform sampling, and (2) robust data transformation.In Figure 2, different symbol shapes denote different categories with originally imbalanced number of samples.After maximum expansion and uniform sampling, the categories are balanced in terms of sample counts.Boundary shape changes following robust data transformation signify altered data distributions.In the first step, we identify the category with the maximum number of samples and expand all other categories to this level via oversampling.While this balances the quantities, the original data distributions remain unchanged, impeding effective training of the attribute classifier.Therefore, we subsequently apply robust data transformations, randomly augmenting the balanced data with 4 different techniques to alter the data distribution and simulate varied recording conditions.This enables successful training of the attribute classifier and enhances model robustness.In summary, our proposed pipeline tackles data imbalance through expansion and synthesizes robustness via data transformation, enabling learning from skewed real-world data.
The detail algorithm of our proposed imbalanced compensation module is presented in Algorithm 1.Given an unbalanced dataset with all N samples for one attribute of one machine, {x ik m } N,K m i=1,k m =1 , which has N k m samples, it is classified into the k m -th category of operating status for each attribute, satisfying N = ∑ )) 8: end for 9: Obtain the final dataset {x ik m IC } R,K m i=1,k m =1 after imbalanced compensation Maximum Expansion Uniform Sampling: As shown in Figure 2, we first introduce the maximum expansion uniform sampling to expand the original dataset to a balanced one.When we apply maximum expansion uniform sampling, we take the maximum value of N k m , set it to T = max N k m .Then, for the N k m samples in each operating status, we copy the data to increase the number of samples defined as ∆ k m = T − N k m .At this time, the total number of samples is expanded to N * = K m × T, and the number of samples among each operating status reaches balance.The original samples x ik m can be represented as x ik m MEUS after applying maximum expansion uniform sampling.Therefore, maximum expansion uniform sampling solves the problem of severe imbalance of training samples, allowing the classifier training to converge.
For example, the machine ToyCar has three attributes: 'Car model', 'Speed', and 'Mic'.We train three separate classifiers for this machine.For the 'Car model' attribute, there are 10 categories 'C1'-'E1' with extremely imbalanced quantities as shown in Table 2.With imbalanced compensation, we first apply maximum expansion uniform sampling.Specifically, we take the number of samples in the largest category 'C1', which is 215.Then, we resample each category to have 215 samples, making the number of samples balanced across categories.Similarly, we apply the same procedure to the other two attributes of ToyCar, balancing the number of samples for each category within every attribute.
Robust Data Transformation: Based on the enhanced maximum expansion uniform sampling, we then perform robust data transformation to augment the environment robustness of training data for improving the generalization ability of the resulting attribute classifier model.When we apply robust data transformation, we first sample R times after maximum expansion uniform sampling according to the law of large numbers.When R is large enough, the distribution of samples after R sampling is the same as the balanced sample distribution after maximum expansion uniform sampling.At the same time, each sampling is accompanied by 4 data augmentations, which are

•
AddGaussianNoise: Directly adding a noise signal obeying a zero-mean Gaussian distribution to the original audio signal in the time domain.In practical environments, many background noises can be regarded as additive noise.After such noise addition to the audio signal, it can capture various and complicate acoustic characteristics of real environments.The above augmentations are represented by T 1 , T 2 , T 3 , and T 4 , respectively.Specifically, these four data augmentations are applied to each sampled sample with a 50% probability, distorting the distribution after sampling.To some extent, robust data transformation simulates unknown samples and enhances the robustness of the classifier, making it less prone to errors when classifying completely unknown samples in the test set.In summary, the samples after robust data transformation can be expressed as ).Still taking ToyCar as an example, after applying maximum expansion uniform sampling, the number of samples is balanced across categories within each attribute.However, merely having a balanced quantity does not mean the data distribution is suitable for training classifiers.Therefore, we apply robust data transformation to transform the data.For each sample, there is a 50% chance to be applied with transformations of 'AddGaussian-Noise', 'TimeStretch', 'PitchShift', and 'TimeShift', which can be combined.After applying robust data transformation to every sample, we discard the original samples.This completes imbalanced compensation.
In summary, the application of maximum expansion uniform sampling balanced the extremely imbalanced data between operating status.After maximum expansion uniform sampling, robust data transformation was applied, and each sampling was accompanied by 4 types of audio time domain conversion, which simulated various noises in real situations to some extent and improved the robustness of the model at the data level.In addition, the oversampling technique increased the sample size and achieved class balance, which solved the problem that the classifier was difficult to train.

Ensemble Attribute Anomaly Detector
Currently, in the field of anomaly detection, probability-based confidence methods [24,25] have been widely used.Specifically, this kind of method trains a classifier on normal samples for classification.During testing, normal samples will be classified into known categories by the classifier, while abnormal samples are difficult to distinguish.Intuitively, abnormal samples will receive a lower confidence score.However, incorrectly classifying samples during testing has a catastrophic impact on the performance of anomaly detection [26].In addition, this method relies on the quality of model training, but, in practical use, the model is often overfitting, which will also affect the performance of anomaly detection.
Instead, we propose ensemble attribute anomaly detector.The key is to combine the traditional machine learning algorithm KNN with the classifier obtained from deep learning to improve the fault tolerance and robustness of the model for anomaly detection.The detailed algorithm of our proposed module is presented in Algorithm 2. Utilizing the data after imbalanced compensation, which are also used to train the classifiers, we train M separate KNN models.Specifically, the embeddings extracted from the trained M classifiers are leveraged as quality training data for each KNN.After training a KNN search tree for each model, test embeddings are extracted by passing the test sample through the corresponding classifier.The trained m-th KNN search tree is then utilized to find the topK nearest neighbors of the test embedding, constructing the set T k (e test ).Subsequently, the Euclidean distance d(e test ) between e test and the samples in T k (e test ) is computed to obtain the distance matrix D test .The anomaly score is calculated as the maximum value in the distance matrix Algorithm 2 KNN for anomaly detection from the perspective of the m-th attribute Finally, we take the mean of the results obtained for each attribute in the score domain to obtain the final ensemble anomaly value Although KNN is a classic machine learning method, it is prone to curse of dimensionality when dealing with high-dimensional data, such as audio in this work.Naturally, we thought of using the outstanding feature extraction capability of deep neural networks to reduce the dimension of audio data to a low-dimensional space that KNN can characterize.Therefore, we use the attribute classifier mentioned above as a proxy task for the anomaly detection task and obtain supervision by distinguishing different operating status.After the training is completed, in the latent space, the samples of each operating status will gather together, while abnormal samples will be exposed because they are difficult to distinguish.Notably, our work trains multiple classifiers for each attribute, which enables each classifier to distinguish abnormal samples from different attribute perspectives.Such an operation improves the fault tolerance of anomaly detection.Even if one classifier makes a mistake, the results of other classifiers can compensate the errors.Taking ToyCar as an example, after applying the imbalanced compensation module, we pretrain three separate classifiers for the three attributes, respectively.Meanwhile, using the training data after imbalanced compensation, three different embeddings are extracted via the three classifiers, which we term as embedding spaces.During testing, a test sample is fed into the three pretrained classifiers to obtain three test embeddings, each corresponding to one embedding space.Then, we apply a KNN algorithm to retrieve the topK nearest neighbors for each test embedding in its embedding space.The Euclidean distances between the test embedding and its topK neighbors are calculated.After obtaining the three sets of Euclidean distances, we take their average as the final anomaly score.
Therefore, for anomaly detection, the three different attributes provide three distinct detection perspectives for the same test sample.Fusing their scores allows the three perspectives to complement each other.Meanwhile, the imbalanced compensation technique enables classifier training and enhances classifier robustness through data augmentation.The resultant high-quality embeddings together with the proposed ensemble attribute anomaly detector boost anomaly detection performance.

Datasets
We evaluate our proposed approach on the development dataset provided for the DCASE2023 Challenge Task 2 [3], which contains two subsets: ToyADMOS2 [21] and MIMII DG [22].The development dataset includes normal and anomalous operating sounds of seven types of machines recorded in single-channel, including Fan, Gearbox, Bearing, Slide rail (slider), Valve, ToyCar, and ToyTrain.For each of the 7 machine types, the dataset provides (1) 990 normal sound clips of 10 s length downsampled to 16 kHz for training in the source domain, (2) 10 normal sound clips for training in the target domain, and (3) 100 clips each of normal and anomalous sounds for testing.The source/target domain labels and attribute labels in the train dataset of each sample are provided, while the test dataset is not.The overview of datasets is shown in Figure 3.This work focuses on attribute labels, which are mentioned in Section 2. Unlike machine IDs that may be unavailable, attribute labels are labels extracted from metadata.Attributes such as operating speed, operating voltage, etc., necessarily accompany the operation of the machine, so obtaining such labels is actually feasible.Unfortunately, there is no free lunch in the world, and such labels are extremely imbalanced and inconsistent, posing challenges to the design of our anomaly detection system.The various labels in the dataset are shown in Figure 4.

Evaluation Metrics
To evaluate the performance of the proposed model, we adopt the area under the curve (AUC) of the receiver operating characteristic (ROC) as the evaluation metric [9].The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.The AUC measures the entire two-dimensional area underneath the ROC curve, which represents the degree or measure of separability between normal and anomalous instances.An AUC of 1 represents a perfect classifier, while an AUC of 0.5 represents a worthless classifier.Compared with metrics such as accuracy, AUC provides a more comprehensive evaluation of the model's performance in the imbalance scenario.Therefore, AUC is more suitable for evaluating anomaly detection methods where negative samples dominate.Furthermore, the pAUC is also used in this work, which is calculated as the AUC over a low FPR range [0, p].The AUC and pAUC are defined as where • is the flooring function, S(x − i ) and S(x + j ) mean the score of normal and anomalous test clips, respectively.N _ and N + represent the number of them, respectively.Otherwise, the function H(x) is In practical acoustic-sensing-based ASD scenarios, a lower FPR is required.Therefore, we set p = 0.1.

Implementation Details
For data preprocessing, we first converted the raw audio signals to log-mel-spectrogram using a short-time Fourier transform (STFT) with a window size of 1024 and a hop length of 512.A mel filterbank with 128 filters was applied and the magnitude of the STFT was converted to decibels.In this study, we use 128-dimensional log-mel-spectrogram as input features for the classifier.We adopt ResNet18 [23] as the backbone to design our classifier.The model is optimized using the Adam [27] optimizer with a learning rate of 0.0001 and trained for 15 epochs with a batch size of 128.In addition, the data augmentation used in the IC module uses the audiomendation [28] toolkit, and the KNN training uses the Pyod toolkit [29], the topK = 5 .In this work, the number of samples R for the imbalanced compensation module is set to 4096.

Experimental Results
In this section, we present a comprehensive analysis of the experimental results.We conducted experiments on the development dataset released in DCASE2023 Challenge Task 2 and compared the experimental results with the official baseline AE.In addition, to verify the effectiveness of the imbalanced compensation module, we applied the module on the AE baseline, called AEIC, and obtained competitive performance.Finally, we ensemble AEIC with our AIC model, whose performance ranked sixth in DCASE2023 Challenge Task 2.

Signal Analysis
Considering the abstract nature of audio signals, we are unable to analyze them through direct observation of the waveforms.Furthermore, the audio in acoustic-sensingbased ASD tasks cannot be distinguished as normal or anomalous through the human ear.Therefore, we transform the signals into the frequency domain through STFT and observe them in the Mel scale, namely the mel-spectrogram.
As shown in the Figure 5, we plot the mel-spectrograms of the audio signals from seven machines.Based on the mel-spectrograms, we can simply classify the audio signals of these seven machines into stationary and non-stationary signals.It is noteworthy that the signal of the Slider machine appears to be non-stationary but is actually a periodic stationary signal [30].The specific classification is as follows:

Results
The proposed AIC model, as illustrated in Figure 1, utilizes the weak attribute labels to train a classifier without machine ID and embeds the data with the classifier to expose anomalies.The experimental results are shown in Table 3, where AE-MSE and AE-MAHA are two official baselines.Both utilize autoencoders trained solely on normal samples with an MSE loss function.The difference lies in the testing phase, where AE-MSE uses the MSE as the anomaly score while AE-MAHA employs the Mahalanobis distance.AC is the result of direct attribute classification without the IC module.
For evaluation metrics, AUCs and AUCt denote the AUC of the model on the source domain data and target domain data, respectively.pAUC represents AUC at low FPR, as mentioned in Section 3. To measure the model performance under these three metrics of AUCs, AUCt, and pAUC, we take the harmonic mean of them, denoted by 'hmean'.Similarly, to measure the model performance across the seven machines of data, we take the harmonic mean of each metric across the seven machines.This is also denoted by 'hmean' for consistency.By taking the harmonic mean of metrics on each subset, we summarize the performance across machines into a single representative value.The experimental results show that the proposed AIC model is competitive compared with the two AE-based baselines, demonstrating the potential of discriminative models based on weak attribute labels for anomaly detection tasks.Notably, attribute classifier performing direct classification of the raw attribute labels achieved poor performance, indicating that the extremely imbalanced sample and other issues of the attribute such as weak labels make the classifier untrainable.However, the AIC model with the imbalanced compensation module, compared with attribute classifier, gained significant performance improvements.This shows that the proposed imbalanced compensation module effectively alleviated the weak label problem of the attributes.Moreover, as observed in the baseline results and other previous findings in the literature [32], it is interesting to find that the experimental results of the seven machines show inconsistent patterns.For example, the best model performance on ToyCar is not the same one as with ToyTrain.This is mainly because the seven machines have different acoustic characteristics, which makes it difficult to design a universally applicable model.

Imbalanced Compensation Module Analysis
As analyzed above, the proposed imbalanced compensation module plays a crucial role in the entire anomaly detection model.Therefore, this section further explores the effectiveness of the imbalanced compensation module and how to determine the number of samples S in the imbalanced compensation module.
First, the imbalanced compensation module is applied to two official baselines, AE-MSE and AE-MAHA.Although AE itself does not utilize attribute labels, it has source domain and target domain labels, with imbalanced scenarios similar to those of attribute labels.Therefore, the imbalanced compensation module is applied according to domain labels.As shown in Table 4, AEIC-MSE and AEIC-MAHA are two baselines with the applied imbalanced compensation module, which achieved significant improvements over the original baselines.This demonstrates the universal effectiveness of the proposed imbalanced compensation module for both generative and discriminative models.
Furthermore, the performance changes in the AIC model with different numbers of imbalanced compensation samples R 1024, 2048, 4096, 8192 are explored.The harmonic mean of AUC on seven machines is used as the evaluation metric.We finally chose 4096 samples.As shown in Figure 6, surprisingly, the model performance of four stable signal machines, ToyCar, Bearing, Fan, and Slider, increases with the increasing number of samples.In contrast, the performance of three non-stable signal machines, ToyTrain, Gearbox, and Valve, cannot be improved with more samples.This could be because the time-domain data transformation of non-stable signals causes distortion, while that of stable signals helps to improve the model's robustness.

Visualization
To better demonstrate the performance of AIC, t-SNE [33] is used to visualize the training set and test set.As shown in Figure 7  Comparing Figure 7a,b, taking Fan as an example, it is found that, for attribute classifier without the imbalanced compensation module, a small portion of normal samples are misclassified into areas close to abnormal samples, which will damage the anomaly detection performance [26].However, for AIC with the applied imbalanced compensation, the misclassified normal samples disappear.This shows that the proposed AIC model alleviates the problem of normal sample misclassification.
Comparing Figure 7c,d, taking Slider as an example, it is found that, compared with attribute classifier, AIC with the applied imbalanced compensation forms a more compact data distribution.In anomaly detection tasks, the more compact the distribution of normal samples, the lower the density of abnormal samples, which is conducive to abnormal sample detection [32].This shows that the proposed AIC model helps normal samples form a more compact distribution.

Ensemble
Finally, anomaly detection is performed by fusing the scores of AEIC-MAHA and AIC models through model ensemble.As shown in Table 5, this fusion of generative and discriminative models significantly improves the anomaly detection performance, ranking sixth internationally in the DCASE2023 Challenge Task 2. The score of model fusion can be expressed as S ensemble = S AEIC + λS AIC (7) By changing the value of lambda multiple times, we obtained the optimal performance of the ensemble model.In this work, we chose λ = 0.3.Figure 8 shows the relationship between the value of λ and the AUC of the seven machines.Here, AUC refers to the hmean of AUCs, AUCt, and pAUC.Experiments show that the AIC model has high complementarity with the AE model.

Figure 1 .
Figure 1.The framework of proposed AIC.
Figure 2 illustrates the schematic diagram of the effects of imbalanced compensation on acousticsensing-based ASD training data.

Figure 2 .
Figure 2. A schematic diagram of the effects of imbalanced compensation on data.'Catagory' means the catagory of operating status.

Algorithm 1
N k m .With the k m varying, N k m takes different values, resulting in an imbalance of data in each attribute.Proposed imbalanced compensation method in m-th attribute Input: An unbalanced dataset of all N samples in m-th attribute, {x ik m } N,K m i=1,k M =1 Output: A balanced dataset of R samples after imbalanced compensation (IC) in m-th attribute, {x ik m IC } R,K m i=1,k=1 1: Find the maximum count of operating status T = max N k m 2: Calculate the sample increment ∆ k m = T − N k m for each operating status 3: Expand sample numbers to N

•:
TimeStretchChanging the speed of audio without altering its pitch by a pre-defined rate.Here, we randomly applied rates in the range of [0.8, 1.25].• PitchShift: Randomly increased or decreased the original pitch.Here, we vary the pitch by pre-defined semitones in the range of [−4, 4].• TimeShift: Shifts the entire audio signal forward or backward.Here, the shift range was [−0.5, 0.5] of the total signal length.

Figure 4 .
Figure 4.The taxonomy of various labels in the dataset.

Figure 6 .
Figure 6.The relationship between the number of samples R in the imbalanced compensation module and model performance.
, dots of different colors represent normal samples of the source domain in the training set, normal samples of the target domain in the training set, normal samples of the source domain in the test set, normal samples of the target domain in the test set, abnormal samples of the source domain in the test set, and abnormal samples of the target domain in the test set, respectively.The embeddings of attribute classifier and AIC are extracted separately for visualization.Since the model training objective is attribute classification, abnormal samples are hard to be classified into any category, thereby exposing the abnormal samples.This is manifested as abnormal samples are far from normal samples, forming lower-density areas in the visualization.

Figure 8 .
Figure 8. System ensemble performance with the varying of score fusion weight λ.

Table 1 .
Attributes of different machines.
, Test data: x test , Trained m-th classifier Output: Anomaly score S m for test sample x test 1: Extract embeddings {e ik m IC } R,K m i=1,k=1 by the m-th classifier from {x ik m IC } R,K m Find topK nearest neighbors of e test using TREE 5: Let T k (e test ) be the set comprising the topK nearest neighbors of e test 6: Compute distance d(e test ) between x test and samples in T k (e test ) 7: Obtain score set D test of test distance 8: return Anomaly score S m = max(D test )

Table 5 .
Performance of the system ensemble.'hmean' represents the harmonic mean of AUCs, AUCt, and pAUC.