Semi-Supervised Machine Condition Monitoring by Learning Deep Discriminative Audio Features

: In this study, we aim to learn highly descriptive representations for a wide set of machinery sounds and exploit this knowledge to perform condition monitoring of mechanical equipment. We propose a comprehensive feature learning approach that operates on raw audio, by supervising the formation of salient audio embeddings in latent states of a deep temporal convolutional neural network. By fusing the supervised feature learning approach with an unsupervised deep one-class neural network, we are able to model the characteristics of each source and implicitly detect anomalies in different operational states of industrial machines. Moreover, we enable the exploitation of spatial audio information in the learning process, by formulating a novel front-end processing strategy for circular microphone arrays. Experimental results on the MIMII dataset demonstrate the effectiveness of the proposed method, reaching a state-of-the-art mean AUC score of 91.0%. Anomaly detection performance is signiﬁcantly improved by incorporating multi-channel audio data in the feature extraction process, as well as training the convolutional neural network on the spatially invariant front-end. Finally, the proposed semi-supervised approach allows the concise modeling of normal machine conditions and accurately detects system anomalies, compared to existing anomaly detection methods.


Introduction
Mechanical equipment usually operates while exposed to hazardous or otherwise challenging working environments, which happen to affect its reliability and can cause system breakdowns with significant safety and economic impact [1,2]. Continuous monitoring and periodic manual inspections are essential practices to prevent any potential issues and ensure the proper maintenance of the equipment, facilitating the operational continuity of industrial production [3]. Automatic machine condition monitoring has long attracted the interest of researchers and engineers, anticipating the development of intelligent and generic methods to promptly detect and diagnose faults in mechanical equipment [4].
Audio signals encompass a substantial amount of machinery information and play a key role in manual maintenance procedures, implying that the presence of anomalous sounds might indicate a mechanical malfunction. As such, audio is a viable source and worthy of consideration in automated machine condition monitoring (CM) and anomaly detection (AD) [5,6]. Real-world industrial conditions pose great challenges to automatic failure detection, as surrounding industrial noise may lead to a low signal-to-noise ratio and eventually impair the performance of audio-driven CM systems [7]. Intelligent signal analysis along with the exploitation of spatial audio information (signals captured by multiple microphones) are essential strategies to address the emerging need for robust and stable condition monitoring. Improvements in automatic machine condition monitoring can be expected, due to the significant progress demonstrated by data-driven and deep learning methods in application areas that can generate massive amounts of data [8,9].
Data-driven AD can be categorized into supervised, semi-supervised, and unsupervised approaches [10]. In supervised approaches, an exhaustive set of normal and anomalous samples is known in advance. Hence, the task is equivalent to a binary classification problem, where anomalous and normal sample representations are separated, under the assumption that anomalous test samples are drawn from the same distribution as in training. Although this can be convenient in some scenarios, it is considered an unrepresentative and unsuitable case for real-world applications of AD, due to the difficulty in obtaining thorough data structures for anomalous conditions.
In unsupervised approaches, available data consists only of normal samples, making it equivalent to a one-class classification task [11]. In such a problem, the goal is to find a concise approximation of the underlying distribution. During inference, the samples that deviate from this profile are considered anomalous. That is, the construction of an unsupervised normality model is beneficial in scenarios where many regular instances are available [12]. Contrarily, the lack of counterexamples in the development dataset poses major differences between statistical and typical methods used for event classification and detection [13].
Semi-supervised approaches lie between supervised and unsupervised AD. They incorporate knowledge from diverse sources in order to precisely model the normal class distribution [14]. At inference, the abnormality of a novel instance is determined using a similarity measure between the training data distribution and the corresponding instance representation. There are also variants of this scenario, in which a small subset of irregular samples might be available, to further refine the detection boundary [15]. Compared to fully-supervised and unsupervised approaches, we argue that semi-supervised AD methods hold great potential in the era of deep learning, as the amount of available data highly affects the detection performance [16]. Semi-supervised methods also allow the exploitation of diverse and large datasets, since they make no assumption about the anomaly class patterns [17]. Hence, generalization to novel anomalies is encouraged by not over-fitting to labeled anomalies [14].
AD methods can be roughly divided into statistical [18,19], neighbor-based [1], and reconstruction-based methods [20]. Statistical methods determine the probability that an object is anomalous based on its statistical properties. Namely, they assume that lowdensity areas of the normal class distribution indicate a high probability of representing abnormal conditions. Neighbor-based methods typically determine the abnormality of a novel instance based on an arbitrary number of nearest neighbors, assuming that the normal class samples might not be tightly clustered [21]. Lastly, reconstruction-based methods consider a compression-decompression model trained on normal-class data. Anomalous patterns are discovered by decompressing the latent representation of a sample at inference and compute the residual error between input and output distributions.
In this paper, we introduce a two-stage approach based on deep neural networks for audio-driven anomaly detection, which consists of (a) supervised embedding learning, and (b) class modeling. The first stage is fully supervised and can also be interpreted as a dynamic feature extraction method, which can be adapted to different audio recognition tasks [22]. The second stage consists of a one-class classifier that explicitly processes samples that correspond to the normal operating condition of a specific machine. The decision module does not consider out-of-distribution samples, but it does incorporate knowledge from the previous fully-supervised learning stage. For this reason, we classify our approach as being semi-supervised.
The main contributions of this study are summarized as follows: • We formulate a novel method for semi-supervised audio-driven AD, which is solely based on deep neural networks. The proposed method exploits data from distinct sources using a modified objective function to train deeper neural networks; • We demonstrate the effectiveness of one-dimensional deep convolutional neural networks to learn useful descriptions of real-world machine equipment from their emitted sound by processing raw audio directly; • We explore the use of multi-channel audio recordings to exploit spatial audio information and propose a naive front-end training strategy that enables the network to effectively learn spatial and spectro-temporal audio features; • We show that by jointly supervising a latent state of the deep convolutional neural network and the corresponding classification output, the model elicits highly discriminative features. This approach is applicable to a wide range of audio recognition tasks in the context of transfer learning.
In most of the above research, a feature extraction stage is first employed to capture the most important temporal, spectral, and cepstral signal properties in a low-dimensionality space [31]. Mainly, these features are selected to reflect the particular conditions of the equipment, imposing the framework to be either machine-specific or machine type-specific [32]. Although this approach reconciles the system performance and interpretability, it lacks generalizability and hinders further practical applications. Thus, there is a shortage of data-driven condition monitoring methods in recent literature that can be considered general, in the sense that they can be applied to a wide scope of machinery with no or minimal modifications.
Second, a decision module is employed to detect out-of-distribution samples, which can be regarded as a one-class classifier [33,34]. Studies with one or an ensemble of support vector machines (SVMs) have been recently conducted in attempts to model the distribution of machine operating conditions for fault assessment and condition monitoring [35]. However, these methods are limited due to the SVM's sensitive hyper-parameters and susceptibility to noise.
With the surge in deep learning and semantic audio analysis [36][37][38], recent studies have focused on the fault detection task through machine operating sounds using neural networks [12,39,40]. Recently, ref. [39] employed a neural network with an autoencoder structure to detect abnormalities in the emitted sound of a surface-mount device. Moreover, ref. [40] proposed an objective function based on the Neyman-Pearson lemma to train an autoencoder, formulating the AD task as a statistical hypothesis test. [12] provided an ensemble of convolutional autoencoders for audio-driven anomaly detection, which follows a cross-mapping strategy between different parts of the frequency spectrum. In the above approaches, the autoencoders are trained to reconstruct regular samples by learning an efficient representation of the input vector. Then, the model reconstruction residual-error is used as a similarity metric to detect machine malfunctions. In these approaches, the role of time-frequency audio data pre-processing and feature extraction is a crucial factor for system performance and generalization [41].
A new method has recently been introduced for neural network-based anomaly detection: the deep support vector data description (Deep SVDD) [10,42]. In Deep SVDD, a neural network is trained to extract representations of the input data that satisfy a one-class classification objective. This can be interpreted as minimizing the volume of a hypersphere that encloses the training data feature representations [43]. This way, the network is forced to extract the common factors of variation since it must closely map the data points to a hypersphere.
In the field of similarity learning, the use of embeddings has been explored as a method to map objects into specific groups of similar properties and features [44,45]. Unlike clustering, this approach benefits from supervising both the embedding extraction stage and the cluster formation in a joint training framework. Hence, the training aims to extract salient features from the input data that support the formation of class-determined clusters based on their corresponding similarity [46]. Moreover, there is no need for employing a separate optimization algorithm for clustering, since the embedding extraction stage is part of the unified model structure and is efficiently trained through statistical gradient descent. Depending on the task, different similarity metrics can be exploited for adapting the clusters to an auxiliary target distribution [47].
A similarity function has been proposed by [48] to detect anomalous sounds using an attention-based feature extractor for measuring similarity in embedded space. The advantage of this approach is that it is robust against changes in time-frequency structure (i.e., absorbing time-frequency stretching in the normal-class modeling).

Overview and Motivation
Our approach to audio-driven anomaly detection can be divided into two stages: feature learning and class modeling. Instead of using a direct approach to anomaly detection, which is to model a one-class classifier on normal class samples, we introduce a two-stage method that provides the one-class classifier with dynamically extracted feature vectors. First, we aim to learn highly descriptive representations for a wide set of machine sounds in a classification framework. Second, we use this knowledge as a feature extraction stage to achieve concise normality modeling for each class. We propose a comprehensive learning approach that leverages information from other classes by enabling the formation of distinct clusters for each machine in an arbitrary low-dimensional space. Hence, the intermediate vector space should apparently be interpreted as a description of deviating examples, by enclosing the target anomalies.
The latter stage consists of class modeling for individual machines. The proposed oneclass classifier consists of a deep neural network that takes as inputs the normal-class latent embeddings for a specific machine and maps them to an arbitrary low-dimensional vector space, so that the output distribution density is maximized. In this case, the Euclidean distance and cosine similarity can be effectively used as similarity metrics [49,50].
The proposed semi-supervised anomaly detection method can be graphically depicted in Figure 1. The following sections describe the data corpus used for the experiments, the proposed model architecture for learning discriminative embeddings from multichannel raw audio (RawdNet), and the deep one-class classifier based on the SVDD premise.

RawdNet: Deep Neural Embeddings from raw audio
Training Inference

Discriminative Features from Multi-Channel Raw Audio
For a microphone array of C ≥ 2 microphones, we first split the audio signal x (i) of each microphone into non-overlapping segments of length L, as: where t is the time index and the operator x[a : b] selects the values of x between the indices a and b. Thus, the model input for each iteration consists of the tensor x t , as: The model processes the input tensor x t using a feed-forward architecture of convolutional blocks, as shown in Figure 2. The core of each convolutional block consists of a temporal convolutional layer (Conv1D), a normalization layer (Layer Norm), and the rectified linear unit (ReLU) activation function g(x) = max(x, 0) [51]. In general, a temporal convolutional layer [52] is mathematically defined as below.
where w c is a two-dimensional tensor with learnable parameters and * is the convolution operator. In our implementation, we set the bias term b ∈ R to zero, as it is negated by the following normalization layer. Then, a downsampling operation is performed through a max pooling layer [53]. The max-pooling operator passes forward the maximum activation over non-overlapping rectangular regions of size P = 4: where · is the ceiling operator. Depending on the depth of the network, convolutional blocks can comprise of more than one core units before the downsampling operation, to enable deeper training [54]. The final model consists of 5 convolutional blocks with a total of 10 convolutional layers with learnable parameters. In the proposed architecture, the five convolutional blocks include 32, 32, 64, 128, and 256 convolving kernels, respectively. The first convolutional layer includes kernels of size 81 with a stride of 4 samples. The rest convolutional layers share the same configuration, where small kernels sizes of 3 and unit striding are employed. Layer normalization [55] is a critical component of the RawdNet architecture, computing the normalization statistics separately for each channel.
Then, mean-pooling is applied to the output of the last convolutional block, followed by two linear layers with no activation function. Moreover, dropout with a probability of 0.2 is applied before each linear layer during training. At inference, the model output y t can be formally defined as: where f θ : R L×C → R K 5 ×L 5 denotes the CNN temporal encoder function parameterized by θ, the length L of each segment corresponds to a 2 s audio clip, K 5 = 256 and L 5 = 31 denote the kernel and signal size at the output of the 5-th convolutional block. The matrices W z ∈ R K 5 ×K , W y ∈ R K×N are the learnable weights and b z ∈ R K 5 , b y ∈ R N are the bias terms of the projection head layers, respectively.

Training Objective
In a supervised setting, we attempt to classify the training samples to their corresponding machine ID label. Samples are drawn only from normal operating conditions. Therefore, we mainly focus on the latent representation z to obtain the discriminative embeddings, while the model output y assigns the model outputs to the ground truth labels of the N machines. By obtaining the embeddings in a latent state of the network and not from the model output, the embedding dimensionality can be arbitrarily chosen based on the task complexity. A determinant factor of this architecture is to not incorporate non-linear activation functions, such as ReLU, between the latent representation and the model output. Namely, the model output y corresponds to a linear transformation of the embeddings z, which prevents from over-training the projection head and assists the learning of prominent features by the convolutional layers.
The training objective for a multi-class classification problem usually relies on minimizing the cross-entropy loss function for each class. When the training converges, both the model predictions and latent representations should be to the most separable. Both outputs are not discriminative enough to provide meaningful information for further processing, since significant intra-class variability in the Euclidean sense is present [56,57]. To remedy this, we focus on minimizing the intra-class distances of the model projection on semantic labels [58].
Center loss enables the model to form qualitative clusters of the target classes into a continuous vector representation, by penalizing the distances between the latent features and their corresponding class centers. The center loss function can be expressed as: where c y i denotes the class prototype of the i-th sample (referred to as c in Figure 1), m denotes the length of the mini-batch, z i and x i denote the encoder and projection head outputs for the i-th sample, respectively. The standard cross-entropy objective function L CE with the SoftMax function is employed to supervise the model output y, as: wherey i denotes the ground truth label for the i-th sample. Cross-entropy loss and center loss can be used to jointly supervise the training process. The resulting loss function can be written as: where λ is a scalar used for balancing the two loss functions. The L CEC loss function considers both the intra-class compactness of the latent representation, which is encouraged by the center loss, and the inter-class separability, which is enforced by the cross-entropy term in the linear mapping of z. Hence, discriminative embeddings for each machine ID would be obtained. The parameter λ is considered a network hyperparameter, which can be changed during training according to some schedule. For this task, we found that joint training with static and equal weighting of the class separability and intra-class compactness objectives (λ = 1) results in faster convergence.
Ideally, c y i would represent the class centers of the training data. However, computing this quantity over the entire dataset would be computationally expensive. Thus, we randomly initialize c y i and update it in every batch using the stochastic gradient descent (SGD) optimization algorithm with respect to L CEC . Moreover, the model parameters and class centers are updated with different learning rates (l r = 0.0001, l c = 0.01) to achieve robustness to sample perturbation and address potential scalability problems [59].

Data Augmentation for Spatial Invariance
Circular microphone arrays are quite common for recording multichannel audio, as they encourage the exploitation of spatial information contained in complex acoustic environments [60]. Techniques, such as independent component analysis, adaptive filtering, and beamforming, have long demonstrated the power of spatially-aware systems in localizing sound sources [61,62] and detecting audio events [63]. However, the majority of spatial filtering techniques require knowledge of the recording setup and are usually based on statistical assumptions that are not always met in real-world conditions, especially when multiple sound sources are present. Considering this, we investigate the efficacy of both single-channel and multi-channel audio in providing useful embeddings for the task of anomaly detection, by selecting the appropriate number of input channels to the RawdNet model, as it was mentioned in Section 3.2.
Regarding the multi-channel approach, our concern lies on the static location of each machine in both the training and testing recording setups, as the system is not adaptively trained to exploit the spatial information of the acoustic scene. So, if the spatial distribution of sound sources is slightly altered at inference, possible degradations to the system performance could be faced, unveiling characteristics of spatial over-fitting.
To address this problem, we formulate a front-end processing strategy that offers spatial invariance in circular microphone arrays, to avoid over-fitting issues arising from the static location of sound sources. In detail, we apply a randomized rotation of the microphone array in the model input, implemented by the roll operator R as: So, the model input x t is transformed to x t , as: where a ∈ Z is a uniformly distributed random variable and R n+1 = R • R n . In such, we enable the learning of directionally-independent spatial features in the deep neural network by maintaining inter-channel correlations of a rotation permutation scheme and simulating the random rotation of the microphone array.

Deep One-Class Classification
The support vector data description [64] is a method proposed for one-class classification that is closely related to the OC-SVM approach. A hypersphere is calculated to enclose the given data samples and eventually to separate inliers from outliers. This objective can be used to train a neural network and be applied to the learned network representation, comprising the unsupervised Deep SVDD method, as described by [42].
That is, φ W : R K → R v is a neural network mapping function with parameters W. The goal is to estimate the optimal parameters W so that (a) a hypersphere encloses the feature representation of the input data distribution assigned by φ and (b) minimize the volume of the hypersphere in the output space. At inference, the distance from the center of the hypersphere is employed as the anomaly score of a sample. Consequently, feature representations that lie outside the learned hypersphere are considered anomalous. Alternatively, various similarity metrics can be used to calculate soft anomaly scores.
The Deep SVDD objective function L SVDD is defined as: where m denotes the length of the mini-batch, and the sensitivity trade-off between class representation volume and penalty of outliers is controlled by the hyper-parameter λ v ∈ R * + . The parameters c v ∈ R v and r v ∈ R * + are vectors that represent the normality center and radius, respectively.
Similarly to Section 3.2.1, we avoid computing the center and radius parameters over the whole dataset. Instead, c v and r v are randomly initialized and are jointly updated through SGD optimization in every mini-batch iteration, using a high and controllable learning rate (l c = 0.5). Thus, the sensitivity of the anomaly detection classifier is determined by the upper bound of the fraction of training errors and the lower bound of the fraction of support vectors [65].
The deep one-class classification (DOC) neural network takes as input an aggregated feature vector, that concatenates the embedding feature representations y for a decisionlevel audio segment. In the case of the MIMII dataset [66], the decision-level segments have a duration of 10 s. Thus, the DOC input vector for the i-th sample is given by: where denotes the concatenation operator. That is, z ∈ R 125 is the concatenated vector of the five 25-dimensional feature representations, each corresponding to the embeddings for a two-second segment. The architecture comprises of four fully-connected layers with no bias term and the ReLU activation function after all but the last layer. The four layers consist of 63, 32, 32, and v = 16 neurons for the given input dimensionality. The DOC model was trained using the Adam optimizer with a learning rate l r = 0.001 on embedding batches of size m = 128 for all SNRs conditions (6, 0, −6) dB of a specific machine ID.

Experimental Setup
In this section, we describe the experiments conducted to evaluate the proposed approach and provide the essential details of the experimental setup. Experiments were conducted on the malfunctioning industrial machine inspection and investigation (MIMII) dataset [66]. The MIMII dataset includes multichannel recordings of twenty-eight industrial machines, which fall into four machine type categories (valve, pump, fan, slide rail). For each machine type, recordings of four individual machines (ID: 0, 2, 4, and 6) are available. Therefore, a single label is assigned to each audio segment depending on the condition of the machine, namely normal or abnormal. Recordings are mixed in variable signal-to-noise ratios (6, 0, and −6 dB) in simulated industrial environments and are provided in decisionlevel segments of 10 s. For a certain signal-to-noise ratio (SNR) γ dB, the noise-mixed data of each machine were created according to the following equation [66].
where t is the time index, i is the channel index, and s and u are the clean target machine and background noise 10-s segments, respectively.
The sound recordings were obtained by a circular array of eight microphones (C = 8); each sample contains eight separate channels for each audio segment. The recorded machines were spatially separated in the recording setup, making it useful for evaluating both single-channel and multi-channel-based approaches. In this study, we investigate the effectiveness of both single-channel and multi-channel approaches and propose a spatial invariance front-end for processing multi-channel raw audio using deep CNNs.
The MIMII dataset was split into training and test sets using stratified linear sampling (no shuffling). The development set, consisting of training and validation sets, includes the 70% and 10% of each machine ID normal data samples, respectively. The rest of the normal samples are used for testing along with all the abnormal samples of the dataset.
Effectiveness of RawdNet embeddings. To evaluate the proposed RawdNet model in extracting useful embeddings for the task of anomaly detection, a standard one-class SVM (OC-SVM) is employed along with the DOC classifier described in Section 3.3. Moreover, the OC-SVM model is used as a baseline to evaluate the performance of the proposed DOC and demonstrate the benefits of employing a neural network architecture as the back-end anomaly detector.
Effects of multi-channel audio. We consider the effectiveness of both single-and multichannel approaches, to examine the potential of one-dimensional CNNs in extracting useful spatial features. For the single-channel approach, the first audio channel was employed as the model input, while for the multi-channel approach, all eight channels were employed.
Effects of the spatial invariance front-end. In Section 3.2.2, we propose to train the multi-channel RawdNet model on a front-end that aims to achieve spatial invariance. This is achieved by inter-changing the configuration of audio channels, simulating the rotation of circular arrays. Hence, the model performs the spatial filtering before extracting the latent embeddings, to reduce the dependence on a static microphone configuration.

Results
The proposed approach is objectively evaluated using the area under the receiver operating characteristics curve (AUC) metric on the soft anomaly scores of each classifier. The performance of the models is validated against two unsupervised anomaly detection models from recent works, which operate on the same dataset and configuration. The first is an autoencoder (AE) neural network model provided as a baseline model by the authors of the MIMII dataset [66]. The latter is a deep convolutional autoencoder (Conv. AE) with a dense-bottleneck structure from our previous work [12]. In an ablation study experiment, different configurations of the embedding extraction model (RawdNet) and one-class classifier (OC-SVM and DOC) are evaluated for their performance contribution, as described in Section 3.4.
The results for each machine type are shown in Table 1. Specifically, four individual machines with IDs of 0, 2, 4, 6 are given for each machine type (Valve, Pump, Fan, Slider). AUC values are averaged over the individual machines and are provided in a single value per SNR condition to deliberately demonstrate the robustness of each method.
The single-channel approach, denoted by RawdNet(S), yielded improved (mean) AUC scores both using DOC (82.4%) and standard OCSVM (79.3%) back-end classifiers, compared to the autoencoder-based models (73.2% and 77.1%). Accordingly, significant improvements over all SNR conditions are observed for Valve (+25.4%) and Pump (+11.6%) machine types by the RawdNet(S)-DOC model over existing methods, while the indicated performance on Fan (+3.4%) and Slider (−3.6%) types are comparable to the unsupervised methods. Hence, the effectiveness of the proposed approach was demonstrated in this scenario, substantially improving the AD performance in cases where existing unsupervised deep learning methods struggle.  0, 2, 4, 6). The proposed single-channel (S) and multi-channel (M) convolutional neural embedding systems are combined with classical OC-SVM algorithm and DOC backend for anomaly detection. The incorporation of the spatial-invariance (R) front-end is denoted by the R indication. The multi-channel approach, denoted as RawdNet(M), demonstrated the potential of exploiting all available audio channels, noting a mean AUC increase of 4.3%. However, the mean AUC difference is mainly affected by the Slider class, where the multi-channel approach outperformed the RawdNet(S) model (+20.3%). The RawdNet(M) model achieved slightly lower performance than the single-channel model variant for Valve, Pump, and Fan machine types. Moreover, the control decision model (OC-SVM) achieved a slightly lower (78.8%) AD score than that of the single-channel approach.

Machine
It is worth noting that the model did not provide the expected performance increase for the amount of information supplied, indicating that it could not utilize the spatial properties of the audio signals. Another explanation is the emergence of potential overfitting issues due to the higher input dimensionality. One possible explanation would be that the architecture of the CNN could be incapable of capturing the intended spatial features, inevitably leading to high input redundancy. To remedy this, we investigated a front-end input processing strategy based on the circular microphone array configuration used in the recording of the MIMII dataset.
The RawdNet(M/R) model consists of the same multi-channel encoder architecture that was trained on the spatial invariance front-end. This approach improved the performance of the latter model by 4.3% for the DOC approach, reaching a mean AUC score of 91.0%. Nevertheless, RawdNet(M/R) showed a 7.6% mean increase compared to the RawdNet(M) model and achieved significantly better performance than the other model variants in the majority of machine IDs, as shown in Figure 3. The model proved to be exceptionally robust to noisy environments, outperforming competitor models at −6 dB SNR (87.3%). Additionally, the effect of noise was less evident in the RawdNet(M/R) model performance, resulting in lower performance reduction and variance for different SNR conditions. Class-dependent anomaly detection performance is illustrated in Figure 4, where different error types are considered. Parametric plotting of false negative rate (FNR) and false positive rate (FPR) are given by: where h 0 and h 1 denote the genuine and impostor match score distributions of the anomaly class predictions, respectively. The spatial invariance front-end also contributes to lower error rates for valve, pump, and fan classes, while no significant contribution is observed between the RawdNet(M/R) and RawdNet(M) models for the slider class.   Furthermore, a comparison between the obtained results and those presented by [48] was conducted. [48] proposed a novel similarity function for AD (SPIDERnet) and validated it against three existing methods on a sub-set of the MIMII dataset, including three individual machines (Fan, Pump, Slider) at 0 dB SNR. The baseline methods include an autoencoder neural network (AE) as used by [40], a mean-squared error (MSE) similarity function that memorizes known anomalous functions [67], and a prototypical networkbased (PROTOnet) AD framework [68]. According to the experiments, the SPIDERnet architecture achieved state-of-the-art AD performance. The authors employed the singlechannel audio spectrogram coefficients as the input features to all models. Table 2 shows that the proposed approach significantly outperforms all existing methods for the two out of the three tested machines. The AD performance in terms of the AUC metric is increased by up to 7% and 5.3% for the Pump (ID:06) and Slider (ID:02) machines, respectively. The proposed method did not perform comparatively for the Fan (ID:02) class. Although it provided better performance than AE and PROTOnet methods, SPIDERnet and MSE similarity functions achieved significantly better performance (+7.8% and +12.4%, respectively). These results are consistent with those of Table 1, in which the Conv. AE and AE models performed adequately on the Fan class at 0 and 6 dB SNRs, using a spectrogram representation as input features. Table 2. AUC scores. Anomaly detection performance of the proposed method compared to those proposed by [48]. IDs 02, 06, and 02 of Fan, Pump, and Slider classes, respectively. All results correspond to the 0 dB SNR condition.

Machine (Type-ID)
Fan The effectiveness of RawdNet discriminative embeddings is demonstrated in Figure 5. In this experiment, we attempt to reduce the dimensionality of the embeddings and train the RawdNet model on the same data but with different objectives. It is apparent that the center loss term of the training objective imposes even the challenging two-dimensional embeddings of each class converge to the same point in the Euclidean sense and feature significant inter-class discriminability, compared to the SoftMax loss. The enhanced performance of the model in most conditions can be possibly attributed to the extraction of more salient spatial features by the first convolutional layer. To demonstrate this, Figure 6 illustrates the spectral and spatial characteristics of the trained filters of the first convolutional layer, including the frequency and phase response of an exemplar multi-channel filter. Most of the thirty-two filters feature a narrow bandwidth to one or more spectral regions. The intra-kernel frequency deviations of multi-channel filters are rare or absent, in contrast to the deviations in the phase spectrum. Thus, it can be implied that the first convolutional layer is trained to exploit spatial information by emulating the responses of a multi-phase filterbank that aims to perform spectral and spatial analysis. For this reason, we attempt to visually interpret the spatial response patterns for the thirty-two filters of the first RawdNet(M/R) convolutional layer, by simulating the recording setup of the dataset by a sensor array of the same configuration [69]. The polar patterns of the initial layer show that a spatial filtering is performed in different patterns of directivity, corresponding to specific spectral regions.

Discussion
In this study, we emphasize the importance of temporal sound characteristics in the determination of a machine condition via deep learning. Experiments are conducted on a large real-world benchmark dataset, while each component of the proposed approach is exclusively evaluated in terms of its contribution to the overall AD performance. Visual insights on the proposed method are also provided, through the illustration of the spatial and spectral properties of the learned convolutional filters and the demonstration of exemplar cluster formation of the network's embeddings.
We explicitly perform processing on raw audio by incorporating deep CNNs, which have recently demonstrated vast potential in modeling high-dimensional data. The learning of spatial audio features is also promoted by showing that multi-channel audio can be exploited to extract valuable spatial features, without the need of specifying the exact microphone configuration. Although this increases the data redundancy in the network, it also enables the search for particular short-duration temporal patterns of target sounds.
The architecture of the RawdNet model primarily consists of convolutional and downsampling layers, while no dropout strategy proved to assist better training. The use of layer normalization instead of batch normalization played a significant role in the model performance, drastically reducing over-fitting in the initial experiments. Layer normalization seems to better stabilize the input of each hidden convolutional layer compared to batch normalization and prevents learning under distribution shifts [70]. The embedding learning process employed cross-entropy combined with the recently-introduced center loss objective. In contrast to [58], we propose to use center loss in a latent state of the network. The combination of the two training objectives in different network states is a vital step in obtaining compact and discriminative embeddings along with stable training.
The experiments in Section 4 demonstrate that the proposed two-stage approach enables the accurate detection of unknown anomalies and is robust under adverse noise conditions. Previous research on this field mainly employed the mel-scaled or linear spectrogram coefficients as input features for deep learning-based anomaly detection [7,39,48,67,71,72]. Here, the enhanced performance in most conditions can possibly be attributed to the extraction of more salient spatial and spectro-temporal features by the one-dimensional CNN.
The superiority of the proposed architecture for detecting faults in Valve and Slider machine types indicates that one-dimensional CNNs are capable of capturing particular short-duration temporal patterns of target sounds. The performance gain is less evident or absent in cases where spectral patterns are more important for detecting anomalies (temporal modulation patterns are absent or not relevant) and high SNR conditions are expected at inference. In these cases (e.g., Fan, Pump), the AD task is better addressed by spectral analysis.
One limitation of the proposed method is that to perform the normality modeling for a novel machine, the model must be trained with all the available data, which leads to a time-consuming training process. This can potentially be addressed by training a large model on a dataset with numerous classes and assess the performance in the AD task for a new machine without retraining. In addition, since the two stages of the proposed approach are independent, the AD performance cannot be easily monitored during the training of the RawdNet model. Practically, this implies that the reduction in the proposed loss in the RawdNet model does not necessarily lead to a direct performance increase.
Future studies on audio-driven AD should explore the potential of end-to-end models for semi-supervised AD, as well as the unsupervised discrimination between different conditions of a machine in clustering-free approaches. Furthermore, adaptive front-ends and trainable spatial filtering methods for deep learning-based audio recognition should be further investigated.

Conclusions
In this study, we investigate the extraction of discriminative embeddings for a wide set of machinery sounds from multi-channel raw audio. Machine embeddings are learned by a deep convolutional neural network and are transferred to a deep one-class neural network to detect faults on individual machines. Experimental results show that the proposed approach can consistently model the normal conditions of various machines and accurately detect system faults. The proposed RawdNet model outperforms stateof-the-art audio-driven fault detection methods in most tested cases and is significantly more robust in noisy environments. Additionally, one-dimensional convolutional neural networks proved capable of extracting valuable spatial and spectro-temporal information from multi-channel audio, which had the effect of substantially improving the robustness of the latent discriminative embeddings. Finally, the proposed training objective of the two neural networks can account for the solid performance of a one-class classifier, by jointly maximizing the similarity and density of the normal data distribution.

Data Availability Statement:
The code for reproducing the experiments and the detailed experimental results are available at a dedicated online repository https://github.com/jthois/semi-supervisedaudio-based-machine-condition-monitoring (accessed on 30 September 2021).