Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention

Bi, Zhongqin; Jiang, Jun; Zhang, Weina; Shan, Meijing

doi:10.3390/math14030530

Open AccessArticle

Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention

by

Zhongqin Bi

¹

,

Jun Jiang

^1,*,

Weina Zhang

¹

and

Meijing Shan

²

¹

Faculty of Artificial Intelligence, Shanghai University of Electric Power, Shanghai 201306, China

²

Institute of Information Science and Technology, East China University of Political Science and Law, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(3), 530; https://doi.org/10.3390/math14030530

Submission received: 29 December 2025 / Revised: 27 January 2026 / Accepted: 31 January 2026 / Published: 2 February 2026

(This article belongs to the Special Issue Intelligent Edge Computing: Algorithms, Security, and Artificial Intelligence-Driven Applications)

Download

Browse Figures

Versions Notes

Abstract

Unsupervised anomalous sound detection aims to learn acoustic features solely from the operational sounds of normal equipment and identify potential anomalies based on these features. Recent self-supervised classification frameworks based on machine ID metadata have achieved promising results, but they still face two challenges in industrial acoustic scenarios: Log-Mel spectrograms tend to weaken high-frequency details, leading to insufficient spectral characterization, and when normal sounds from different machine IDs are highly similar, classification constraints alone struggle to form clear intra-class structures and inter-class boundaries, resulting in false positives. To address these issues, this paper proposes FGASpecNet, an anomaly detection model integrating spectral enhancement and frequency-gated attention. For feature modeling, a spectral enhancement branch is designed to explicitly supplement spectral details, while a frequency-gated attention mechanism highlights key frequency bands and temporal intervals conditioned on temporal context. Regarding loss design, a joint training strategy combining classification loss and metric learning loss is adopted. Multi-center prototypes enhance intra-class compactness and inter-class separability, improving detection performance in scenarios with similar machine IDs. Experimental results on the DCASE 2020 Challenge Task 2 for anomalous sound detection demonstrate that FGASpecNet achieves 95.04% average AUC and 89.68% pAUC, validating the effectiveness of the proposed approach.

Keywords:

anomalous sound detection; self-supervised learning; metric learning

MSC:

68T07; 68T10; 94A12

1. Introduction

With the rapid advancement of industrial technologies, manufacturing is accelerating its transition toward intelligent, networked, and digital operations. The stable operation and efficient production of industrial equipment place higher demands on routine maintenance [1,2]. Against this backdrop, sound-based detection has been widely used for equipment condition monitoring, fault early warning, and predictive maintenance tasks. Its advantages include lower cost, flexible deployment, coverage of visually inaccessible areas such as obstructed zones and equipment interiors, and heightened sensitivity to subtle operational changes—all contributing to enhanced operational reliability and efficient production support [3]. Anomalous sound detection (ASD) [4] aims to automatically identify potential faults by analyzing acoustic differences between normal and abnormal equipment states, leveraging sound signal processing and machine learning methods. Relevant tasks in the DCASE challenge have further advanced ASD research and application in industrial settings [5], attracting significant attention from both academia and industry.

From a machine learning perspective, anomaly detection techniques are typically categorized into supervised and unsupervised learning. Supervised learning methods rely on labeled normal and anomalous samples for training, whereas unsupervised approaches primarily leverage unlabeled data or datasets containing only normal samples to model normal patterns and identify anomalies based on deviation metrics. In real industrial scenarios, collecting and labeling anomalous samples is often costly, especially as fault states during equipment operation are difficult to reproduce. Consequently, recent research focus has gradually shifted from supervised methods reliant on large volumes of labeled anomalous samples toward unsupervised methods trained solely on normal data. This trend aligns more closely with the practical demands of industrial applications.

In unsupervised ASD research, early approaches predominantly relied on autoencoders (AEs) and their variants [6,7,8,9,10]. Typical methods include interpolation deep neural networks (IDNNs) [6] and ID-based conditional autoencoders [7]. These approaches train solely on normal data, minimizing reconstruction error for normal samples and directly using this error as an anomaly score during detection. However, this method has limitations: first, training exclusively on normal samples makes the model prone to overfitting to the training data, resulting in insufficient sensitivity to unseen anomaly patterns and, thus, compromising generalization capabilities. Second, in practice, models must be trained individually for each machine ID, increasing the system’s training and maintenance costs. Recent studies have also begun to focus on further emphasizing model lightweighting and computational efficiency while ensuring detection performance, thereby enhancing usability in resource-constrained scenarios [11].

To overcome the limitations of reconstruction methods, recent research has increasingly focused on utilizing audio file metadata—primarily machine IDs—to construct feature learning and self-supervised classification frameworks [12,13,14,15,16,17,18,19,20,21,22]. These approaches treat metadata as pseudo-labels, training classification networks to distinguish between different machines. During inference, the classifier’s output category probabilities or derived anomaly scores are used to determine whether unseen sounds deviate from normal patterns. Meanwhile, recent work has begun addressing the issue of inconsistent anomaly scores across domains from a domain generalization perspective, enhancing comparability and stability between domains through score calibration [23]. Related studies indicate that these metadata-based self-supervised methods generally outperform traditional unsupervised methods based on autoencoders [5]. However, existing work still has shortcomings in two aspects. First, most methods commonly adopt the Log-Mel spectrogram as the fundamental acoustic representation. Although this feature approximates human auditory perception and is simple to implement, the spectrogram derived from the Mel filter bank is denser at low frequencies and sparser at high frequencies [24]. This may weaken the high-frequency details associated with anomalies, making it difficult to fully characterize the spectral feature information. Second, mainstream loss functions are predominantly based on Softmax and its angular-margin-based loss variants [15,18,19,25,26] or contrastive learning approaches [21,22]. While these improve the distribution structure of the embedding space and enhance feature recognition capabilities to some extent, they struggle to learn clear and robust decision boundaries when normal samples from different machine IDs exhibit high similarity.

To address the aforementioned issues, this paper proposes a frequency-gated enhancement network, FGASpecNet, based on a machine ID self-supervised classification framework. Building upon baseline methods, the model achieves synergistic improvements at two levels: feature modeling and loss design. At the feature level, it incorporates a spectral enhancement branch SpecNet and a frequency-gated attention module to supplement spectral information and highlight key frequency bands and temporal intervals. At the loss level, it combines the Noisy-ArcMix classification loss with the SoftTriple metric learning loss [27] to enhance intra-class compactness and inter-class separability in the embedding space. This clarifies the category structure, thereby improving anomaly detection performance for machines with highly similar acoustic characteristics. Experimental results demonstrate that the proposed model outperforms existing baseline models in typical ASD scenarios, exhibiting robust detection performance and generalization capabilities.

In summary, FGASpecNet has several innovations as follows:

SpecNet spectral enhancement branch. Built upon existing Log-Mel spectrograms and machine ID self-supervised classification frameworks, a dedicated spectral enhancement branch SpecNet is developed to explicitly model spectral features and enrich discriminative information in spectral representations.
Frequency-Gated Attention (FGA) Module. The FGA module adaptively adjusts the weights of Log-Mel spectrograms and SpecNet spectral features across time–frequency units based on temporal context provided by TAgram. This highlights critical frequency bands and temporal intervals associated with anomalies, yielding more discriminative spectral representations.
SoftTriple metric learning loss. SoftTriple metric learning loss is jointly optimized with Noisy-ArcMix classification loss. This further compresses sample embeddings within the same machine ID through multi-center prototype constraints while widening the gap between different machine IDs. This clarifies the category structure in the embedding space, enhancing the ability to distinguish machines with highly similar acoustic features.
Comprehensive evaluation. Systematic experiments conducted on the DCASE 2020 Task 2 dataset demonstrate that the proposed FGASpecNet achieves 95.04% average AUC and 89.68% pAUC, representing improvements of 0.97% and 0.72% over the baseline, respectively, validating the effectiveness of the proposed approach.

2. Materials and Methods

This section details the ASD model FGASpecNet, as illustrated in Figure 1. The model comprises three components: an acoustic feature enhancement module, a frequency-gated attention module, and a joint loss optimization module. The paper takes the raw audio signal

x

as input and follows the three processing paths of the baseline approach: First, TgramNet extracts the time-domain feature Tgram. Second,

x

undergoes short-time Fourier transform (STFT), followed by Mel filtering and logarithmic transformation to obtain the Log-Mel spectrogram. Finally, the time-domain attention branch TAgram is constructed based on the Log-Mel spectrogram. Subsequently, the paper introduces the spectral enhancement branch SpecNet, which generates the spectrogram Specgram through spectral enhancement convolutions and residual connections. Building upon this, a frequency-gated attention module is proposed. This module performs time–frequency gating on the Log-Mel spectrogram and Specgram using the temporal context provided by TAgram as a condition. After weighted fusion, the FGA-Spec feature is obtained. Finally, the four feature streams are concatenated along the channel dimension and fed into MobileFaceNet to learn embedding representations. During training, the model jointly incorporates the SoftTriple metric learning loss on top of the Noisy-ArcMix classification loss to further enhance intra-class compactness and inter-class separation. Subsequent sections will elaborate on the structure and implementation details of each module.

2.1. Acoustic Feature Enhancement

The acoustic feature enhancement module aims to model raw audio signals across both time and frequency domains, providing comprehensive and complementary acoustic representations for subsequent feature fusion and discriminative learning. To achieve this, this paper employs a four-branch feature modeling scheme, including Tgram, Log-Mel, TAgram, and Specgram. Among these, the Log-Mel spectrogram serves as a widely adopted foundational time–frequency representation, delivering robust spectral descriptions; Tgram supplements temporal structural information from the time domain; TAgram highlights key temporal segments through a temporal attention mechanism and provides temporal context; Specgram explicitly enhances spectral details to compensate for potential loss of fine-grained information due to frequency compression in the Log-Mel spectrogram. Subsequently, the four feature streams are fused through a frequency-gated attention module for embedding learning, thereby enhancing the model’s ability to distinguish different machine IDs and improve anomaly detection performance. Specifically, Section 2.1.1, Section 2.1.2 and Section 2.1.3 first introduce the three feature branches used in the baseline, while Section 2.1.4 focuses on the structural design of SpecNet, the spectral enhancement branch designed in this paper.

2.1.1. Tgram

Here, c, k, s, p, and n denote the number of channels, the size of the convolutional kernel, the stride, the padding size, and the number of layers, respectively. TgramNet [16] is a temporal feature extraction network based on convolutional neural networks. It takes the raw audio signal

x

as input to extract temporal features, with its structure shown in Figure 2 and Table 1. First, TgramNet incorporates a one-dimensional convolution layer with a large convolution kernel. The number of channels, kernel size, and stride are set to match the number of Mel filter banks in Log-Mel spectrogram computation, the window length in STFT, and the frame shift, respectively. Subsequently, the network concatenates three identical convolutional blocks. Each block consists of a layer normalization, a Leaky ReLU activation function, and a one-dimensional convolution with a smaller kernel. This module preserves the output dimension of the temporal features. The final temporal feature Tgram can be expressed as follows:

F_{T} = T N (x)

(1)

Here,

T N (\cdot)

denotes the TgramNet used for extracting temporal features, where

F_{T} \in R^{M \times T}

maintains the same dimensionality as the Log-Mel spectrogram.

2.1.2. Log-Mel

The Log-Mel spectrogram is a widely used acoustic feature representation that transforms one-dimensional time-domain signals into a time–frequency representation aligned with human auditory perception. It finds extensive application in tasks such as ASD. This feature not only captures the energy distribution across different frequencies but also smooths the frequency axis based on the Mel scale, demonstrating strong robustness against noise and environmental variations [28].

To extract the Log-Mel spectrogram, the input audio signal

x \in R^{1 \times L}

, a single-channel audio signal of length L, is first divided into frames. A window function is applied to each frame, followed by STFT to obtain

STFT (x) \in R^{B \times T}

, where B represents the number of frequency points and T denotes the number of time frames. Given an audio sequence

x [m]

, the STFT calculation can be expressed as follows:

X [n, k] = \sum_{m = n - (N_{W} - 1)}^{n} x [m] ω [n - m] e^{- j 2 π k m / N}

(2)

Here, m denotes the temporal sample index; n represents the temporal frame index, indicating the current frame’s position on the time axis;

N_{w}

is the window length;

ω [\cdot]

refers to the specific window function, typically the Hann window; k is the frequency index; N denotes the fast Fourier transform (FFT) length. We denote the above process as

STFT (x)

, which maps the audio signal from the time domain to the frequency domain. The resulting complex spectrogram is squared to obtain the power spectrogram

{| STFT (x) |}^{2}

, serving as input to the Mel filter bank. The Mel filter maps linear frequency to the Mel frequency scale. Let f denote the linear frequency, and its corresponding Mel frequency can be expressed as follows:

mel (f) = 2595 {log}_{10} (1 + \frac{f}{700})

(3)

A filter bank consisting of 128 triangular Mel filters is employed to perform weighted summation of the power spectrogram along the frequency axis. Denoting this Mel filter bank as

H_{M}

, the resulting Log-Mel spectrogram

F_{L M}

takes the form

R^{M \times T}

, calculated as follows:

F_{L M} = \log (H_{M} \cdot {| STFT (x) |}^{2})

(4)

2.1.3. TAgram

As shown in Figure 3, to further highlight key temporal regions [19] based on the Log-Mel spectrogram, a temporal attention module processes the features. This module takes the Log-Mel spectrogram

F_{L M} \in R^{M \times T}

as input. After transposition, it performs max pooling and average pooling operations along the frequency axis, yielding a one-dimensional temporal attention vector:

s_{t} (F_{L M}) = AvgPool (F_{L M}^{T}) \oplus MaxPool (F_{L M}^{T})

(5)

Here, ⊕ denotes element-wise addition, used to sum the two time series obtained from average pooling and max pooling at the same temporal positions, capturing the importance of each frame. Subsequently,

s_{t} (F_{L M}) \in R^{T \times 1}

is replicated and expanded in the frequency dimension to obtain the temporal attention map, which is then normalized via the sigmoid function. Finally, the normalized temporal attention map is transposed back to the same dimension as the Log-Mel spectrogram and multiplied element-wise with

F_{L M}

to obtain the feature

F_{T A}

incorporating temporal attention:

F_{T A} = σ (s_{t} (F_{L M}^{T})) \otimes F_{L M}

(6)

Here,

σ (\cdot)

denotes the sigmoid function, ⊗ represents element-wise multiplication, and

F_{T A} \in R^{M \times T}

denotes the TAgram features enhanced by temporal attention.

2.1.4. Specgram

Although the Log-Mel spectrogram and Tgram can characterize sound features from the overall and temporal dimensions, respectively, their high-frequency details may still be weakened, making it difficult to fully capture spectral features. To address this, this paper designs a convolutional neural network branch called SpecNet on top of them, explicitly modeling spectral features, as shown in Figure 4. Given an input Log-Mel spectrogram

F_{L M} \in R^{M \times T}

, a two-dimensional fast Fourier transform (2D-FFT) is first performed along both the frequency and time dimensions to obtain a complex spectrum:

\hat{F} = F (F_{L M})

(7)

Here,

F (\cdot)

denotes the 2D-FFT operator, and

\hat{F} \in C^{M \times T}

represents the complex-valued spectrum. According to the application of the Fourier convolution theorem [29], time-domain convolution can be equivalently transformed into frequency-domain pointwise multiplication, namely,

F (h * F_{L M}) (u, v) = F (h) (u, v) \cdot F (F_{L M}) (u, v)

(8)

Here, h denotes the convolution kernel,

(u, v)

represents the two-dimensional frequency coordinates, ∗ indicates two-dimensional convolution, and · denotes pointwise multiplication. This property demonstrates that operations equivalent to time-domain convolution can be achieved through the frequency-domain pointwise multiplication.

In SpecNet, this paper enhances

\hat{F}

using a residual block composed of two layers: a frequency-domain convolution layer and a frequency-domain normalization layer. Let

C_{1} (\cdot)

and

C_{2} (\cdot)

denote the two-layer frequency-domain convolution functions FourierConv2D, respectively, and

SpectralBN (\cdot)

denote the frequency-domain normalization layer performing batch normalization on the real and imaginary parts, respectively. Then, the frequency-domain residual term

\hat{S}

can be expressed as follows:

\hat{S} = C_{2} (tanh (SpectralBN (C_{1} (\hat{F}))))

(9)

The tanh function serves as the activation function. The residual term

\hat{S}

and the original spectrum

\hat{F}

are then element-wise added in the frequency domain. This sum is converted back to the time–frequency domain via an inverse Fourier transform, yielding the enhanced spectral feature

F_{S}

:

F_{S} = F^{- 1} (\hat{F} + \hat{S})

(10)

Here,

F^{- 1} (\cdot)

denotes the 2D-IFFT operator, and

F_{S} \in R^{M \times T}

represents the spectral enhancement feature from the Specgram branch introduced in this paper, maintaining consistent dimensions with other branches.

Thus, this paper has obtained four complementary time–frequency feature representations,

F_{T}

,

F_{L M}

,

F_{T A}

, and

F_{S}

, derived from the raw waveform, Log-Mel spectrogram, and temporal and spectral enhancement branches.

2.2. Frequency-Gated Attention

In the aforementioned feature extraction module, both the Log-Mel spectrogram

F_{L M}

and the spectral enhancement feature

F_{S}

characterize the energy distribution of acoustic signals across the time–frequency domain, while the TAgram emphasizes critical temporal segments through temporal attention. However, simply concatenating or adding the Log-Mel and Specgram assumes equal importance for all frequency bands at every time position, making it difficult to adjust for specific frequency bands and temporal intervals. In practice, anomaly information typically appears only within limited frequency bands and short-time intervals. Relying more heavily on the spectral enhancement feature in these regions, while preserving the original Log-Mel spectrogram elsewhere may yield more robust discriminative features.

Based on the above considerations, this paper designs a frequency-gated attention module, as shown in Figure 5. It utilizes the TAgram features as global contextual information while modeling attention vectors separately along the frequency and time axes. These two vectors are then combined into a two-dimensional gated map to adaptively weight the Log-Mel and Specgram representations within each time–frequency cell, yielding a more reasonably distributed spectral representation.

The input to this module is the TAgram feature, denoted as

F_{T A} \in R^{M \times T}

, where M and T represent the number of Mel bands and the number of time frames, respectively. First, max pooling and average pooling operations are applied to

F_{T A}

along the time axis, yielding a statistical vector in the frequency dimension:

s_{f} (F_{T A}) = Concat ({MaxPool}_{t} (F_{T A}), {AvgPool}_{t} (F_{T A}))

(11)

Here,

Concat (\cdot)

denotes vector concatenation, while

{MaxPool}_{t} (\cdot)

and

{AvgPool}_{t} (\cdot)

represent max pooling and average pooling along the time axis, respectively. Subsequently, this vector is fed into a linear mapping layer followed by a sigmoid function to obtain the spectral gating vector:

g_{f} = σ (Linear (s_{f} (F_{T A})))

(12)

Here,

σ (\cdot)

denotes the sigmoid function, and

Linear (\cdot)

represents a fully connected linear mapping layer.

On another branch,

F_{T A}

undergoes average pooling along the frequency axis to yield a statistical vector in the temporal dimension. This vector then undergoes one-dimensional convolution and sigmoid mapping to produce the temporal gate vector:

g_{t} = σ (Conv 1 D ({AvgPool}_{f} (F_{T A})))

(13)

Here,

Conv 1 D

denotes one-dimensional convolution, while

{AvgPool}_{f} (\cdot)

represents average pooling along the frequency axis.

Finally,

g_{f}

and

g_{t}

are combined via an outer product to form a two-dimensional gate map

G = g_{f} \otimes g_{t}

. This map is used for adaptive fusion between the Log-Mel spectrogram

F_{L M}

and the spectral enhancement feature

F_{S}

, yielding the enhanced spectral representation:

F_{S_e n h} = (1 - λ G) ⊙ F_{L M} + λ G ⊙ F_{S}

(14)

Here, ⊗ denotes the outer product operation, ⊙ represents element-wise multiplication, and

λ \in [0, 1]

is the fusion weighting coefficient used to control the relative proportion of the two features in the output. When

G

takes a large value at a specific time–frequency position, the fusion result assigns a higher weight to the spectral enhancement feature

F_{S}

at that position. Conversely, when

G

takes a small value, more of the original Log-Mel spectral information is retained.

The aforementioned modules characterize the time–frequency properties of acoustic signals from different perspectives:

F_{L M}

represents the Log-Mel spectrogram, reflecting the fundamental energy distribution;

F_{T}

denotes the temporal convolutional features extracted by TgramNet;

F_{T A}

is the representation enhanced by temporal attention; and

F_{S_e n h}

is the spectral enhancement feature fused obtained by fusing SpecNet and the FGA module. This paper concatenates the four features along the channel dimension to obtain the joint time–frequency representation

F_{f u s i o n} \in R^{4 \times M \times T}

for the final classifier input, denoted as follows:

F_{f u s i o n} = Concat (F_{L M}, F_{T}, F_{T A}, F_{S_e n h})

(15)

This representation retains the stable structure of the traditional Log-Mel spectrogram while incorporating temporal convolutional features, temporal attention, and discriminative information from the spectral enhancement branch. It provides a richer and complementary feature foundation for subsequent classifier training and metric learning loss.

2.3. Joint Loss Optimization

In the machine ID self-supervised ASD framework, the loss function directly influences the discriminative structure of the embedding space. Relying solely on classification loss for optimization makes it challenging to fully constrain the distribution structure within the same machine ID. To address this, this paper introduces the SoftTriple metric learning loss to jointly optimize the embedding space based on the Noisy-ArcMix classification loss. Section 2.3.1 first introduces the Noisy-ArcMix classification loss employed in the baseline method, while Section 2.3.2 further details the SoftTriple metric learning loss introduced in this paper.

2.3.1. Noisy-ArcMix

In this work, the classification branch employs the Noisy-ArcMix loss [19], which builds upon ArcFace and incorporates a mixup mechanism. By dynamically adjusting decision boundaries through weighting two ArcFace losses from the same batch of samples with a mixing coefficient, it further enhances the model’s ability to distinguish unseen samples.

The ArcFace loss [30] achieves intra-class compactness and inter-class separation by introducing angular boundaries between feature vectors and class weight vectors, thereby explicitly compressing intra-class distributions and widening inter-class gaps. Its loss function can be expressed as follows:

L_{AF} (x, y) = - y^{T} log \frac{e^{s cos (θ + m y)}}{\sum_{k = 1}^{K} e^{s cos (θ_{k} + m y_{k})}}

(16)

Here,

y = {[y_{1}, \dots, y_{K}]}^{T}

represents the one-hot encoded labels,

θ = {[θ_{1}, \dots, θ_{K}]}^{T}

, where

θ_{k} = \arccos (w_{k}^{T} h)

denotes the angle between the feature embedding

h

and the k-th category

w_{k}

. m is a preset angular threshold. By amplifying the modified logarithmic values with a scaling factor s, the intra-class distribution is compressed.

In ASD scenarios, as only normal samples are used during training and the number of machine IDs per class is limited, the ArcFace loss may still exhibit insufficient intra-class compactness and mixed normal or anomalous angle distributions under this setup, preventing complete separation between normal and anomalous cases. To mitigate this, the Noisy-ArcMix loss combines it with mixup [31], linearly mixing samples within the same batch:

x^{i j} = λ x^{i} + (1 - λ) x^{j}

(17)

Here,

i \in {1, \dots, B}

,

j \in Shuffle {1, \dots, B}

, and

λ \sim Beta (α, α)

, where

α

is typically set to 0.5 to generate new feature samples near the normal data distribution. Samples with larger weights are treated as target classes, while those with smaller weights serve as noise classes. Building upon this, the two ArcFace losses are weighted according to the mixup coefficient to yield the Noisy-ArcMix loss:

L_{NAMix} (x^{i j}, y^{i}, y^{j}) = λ L_{AF} (x^{i j}, y^{i}) + (1 - λ) L_{AF} (x^{i j}, y^{j})

(18)

Here,

λ \in [0, 1]

represents the mixing coefficient. When

λ

is large, the loss function more closely approximates the original ArcFace effect, causing generated samples to favor the target category. This facilitates further compression of the distribution of normal samples in angular space, enhancing intra-class compactness. Conversely, when

λ

is small, the angular margin is primarily applied to the noise category with lower weight, increasing the angle between this noise category and the class center. This encourages the model to learn clearer inter-class separation near the decision boundary, enhancing its ability to distinguish unknown outlier samples.

2.3.2. SoftTriple

Although the Noisy-ArcMix loss improves intra-class compactness and inter-class separation to some extent, it remains, fundamentally, a classification loss based on machine ID labels. When the distribution of normal sounds within an ID is not sufficiently compact, containing samples far from the main dense region, relying solely on this classification branch still makes such dispersed samples prone to being misclassified as anomalies during inference. To mitigate this issue, this paper introduces the SoftTriple metric learning loss [27] alongside the Noisy-ArcMix classification loss. This approach constrains the embedding space structure through relative distances between samples: it compresses the embeddings of samples within the same ID, enabling them to cluster compactly in the feature space, while simultaneously widening the distances between different IDs. This enhances inter-class separation from the perspective of positive–negative sample pairs, resulting in a clearer category structure within the embedding space.

SoftTriple loss evolves from Softmax loss, achieving a constraint effect similar to triplets without explicitly constructing triplets. Softmax loss is essentially equivalent to a smoothed triplet loss where each category corresponds to only one center. However, in practical scenarios, a category often contains multiple local clusters rather than a single center. For example, the ToyConveyor model in the dataset used in this paper exhibits significant intra-class diversity. In such cases, single-center Softmax loss struggles to capture complex intra-class structures.

Building upon this insight, the SoftTriple loss maintains K class prototypes

{w_{c}^{k}}_{k = 1}^{K}

for each category c in the embedding space. This approach employs multiple centers to capture the potential existence of multiple sub-cluster structures within the same category. Given a sample feature

x_{i}

, its soft weights

q_{k}

relative to each prototype under category c are computed as follows:

q_{k} = \frac{exp (\frac{1}{γ} x_{i}^{T} w_{c}^{k})}{\sum_{k^{'} = 1}^{K} exp (\frac{1}{γ} x_{i}^{T} w_{c}^{k^{'}})}

(19)

where

γ

is the temperature coefficient, which controls the smoothness of the weight distribution, and

k^{'}

serves as the summation index in the denominator, used to sum over K centers. Subsequently, these weights are summed with the similarity scores between the sample and each prototype, yielding the overall similarity between the sample and category c:

S_{i, c} = \sum_{k = 1}^{K} q_{k} x_{i}^{T} w_{c}^{k}

(20)

After obtaining the similarity scores

S_{i, c}

for all categories, these are fed into the angular-margin Softmax formulation to derive the single-sample loss for SoftTriple:

L_{ST} (x_{i}) = - log \frac{exp (λ (S_{i, y_{i}} - δ))}{exp (λ (S_{i, y_{i}} - δ)) + \sum_{c \neq y_{i}} exp (λ S_{i, c})}

(21)

Here,

y_{i}

denotes the category label of sample

x_{i}

,

λ

is the scaling factor, and

δ

represents the angular margin imposed on the true category. This loss function simultaneously pulls samples of the same category toward their respective cluster centers through multiple prototypes, while expanding the inter-class gap by imposing margins and comparing against other categories.

Finally, the SoftTriple loss is jointly optimized with the aforementioned Noisy-ArcMix classification loss to yield the total loss:

L_{Total} = L_{NAMix} + λ_{ST} L_{ST}

(22)

Here,

λ_{ST}

is used to balance the relative contributions of the two loss terms to the overall training.

2.4. Dataset

This paper utilizes the development dataset from DCASE 2020 Task 2 as its primary data source. This dataset comprises two sub-datasets: MIMII [32] and ToyADMOS [33], while also incorporating the official supplementary dataset to enhance the experiments. The MIMII dataset includes four machine types: Fan, Pump, Slider, and Valve. The ToyADMOS dataset comprises two machine types: ToyCar and ToyConveyor. The ToyConveyor dataset consists of six distinct machine IDs, while each of the other machine types comprises seven distinct machine IDs. Consequently, the overall dataset contains 41 distinct machine IDs.

Figure Figure 6 presents the number of audio samples for each machine type used in training and testing. Each sample consists of a 10 s audio clip containing both the target machine’s operational sound and ambient noise. The training portion from the development dataset and supplementary dataset was combined for model training, yielding a total of 36,283 normal audio samples. During the evaluation phase, the test portion of the development dataset was utilized, comprising 5399 normal audio samples and 5469 anomalous audio samples.

3. Results

3.1. Experimental Setup

3.1.1. Implementation Details

The proposed model was trained on 10 s audio clips, with a single model covering all machine types. First, Log-Mel spectrograms were extracted using torchaudio with a window length of 2048, frame shift of 512, and 128 Mel filter banks. This yielded a spectrogram of size 128 × 313, from which four input features,

F_{T}

,

F_{L M}

,

F_{T A}

, and

F_{S}

, were constructed, each sharing the same dimensions as the original spectrogram.

The classification branch employs Noisy-ArcMix loss with hyperparameters

α = 0.5

,

m = 0.7

, and

s = 30

. The metric learning branch uses SoftTriple loss with hyperparameters

γ = 0.1

,

λ = 20

,

δ = 0.02

,

K = 3

,

τ = 0

, and a loss weight

λ_{ST} = 0.15

. The network is trained using the AdamW optimizer with an initial learning rate of 0.0001, training for 300 epochs with a batch size of 64. All experiments are conducted on a system equipped with an NVIDIA RTX 4090 GPU, utilizing the software environment of PyTorch 2.8.0, CUDA 12.8, Python 3.12, and torchaudio 2.7.1.

3.1.2. Anomaly Score Computation

During inference, the model outputs an embedding vector for each input audio clip

x

, along with a logit vector

z (x) \in R^{K}

from the classification head, where K represents the number of machine ID categories. The posterior probability for the k-th machine ID category is computed using the Softmax function:

p_{θ} (k ∣ x) = \frac{exp (z_{k} (x))}{\sum_{j = 1}^{K} exp (z_{j} (x))}

(23)

Under the machine ID self-supervised ASD setting, the true machine ID label y for test samples can be directly obtained from metadata. Therefore, this paper defines the anomaly score as the negative log-likelihood of the posterior probability for that machine ID:

A_{θ} (x) = - log p_{θ} (y ∣ x)

(24)

Intuitively, the more a sample conforms to the normal pattern of its machine ID, the higher its posterior probability and the lower its anomaly score. Conversely, anomalous samples often deviate from the normal pattern, resulting in a lower posterior probability and, thus, a higher anomaly score.

3.1.3. Evaluation Metrics

The evaluation metrics employed are AUC and pAUC. AUC represents the area under the ROC curve, serving as a performance indicator to measure the quality of a learner. pAUC is a metric that assesses binary classification performance within the false positive rate interval

[0, p]

, making it more suitable for anomaly detection scenarios where low false positives are prioritized. Here, p is set to 0.1.

A U C = \frac{1}{N_{-} N_{+}} \sum_{i = 1}^{N_{-}} \sum_{j = 1}^{N_{+}} H (A_{θ} (x_{j}^{+}) - A_{θ} (x_{i}^{-}))

(25)

p A U C = \frac{1}{⌊ p N_{-} ⌋ N_{+}} \sum_{i = 1}^{⌊ p N_{-} ⌋} \sum_{j = 1}^{N_{+}} H (A_{θ} (x_{j}^{+}) - A_{θ} (x_{i}^{-}))

(26)

Here,

N_{+}

and

N_{-}

denote the number of abnormal and normal samples, respectively.

x_{j}^{+}

and

x_{i}^{-}

represent the j-th anomalous sample and the i-th normal sample, respectively.

A_{θ} (x)

is the anomaly score output by the model for sample

x

.

H (\cdot)

is the Heaviside function, which returns 1 when the input is greater than 0, and 0 otherwise.

⌊ p N_{-} ⌋

denotes the number of normal samples used to compute pAUC.

3.2. Performance Comparison

Table 2 presents the detection performance comparison between FGASpecNet and various ASD methods across six machine types. It can be observed that the model proposed in this study achieves optimal results in both average AUC and pAUC, reaching 95.04% and 89.68% respectively, representing improvements of 0.97% and 0.72% over the baseline model TASTgram. Across all machine types, FGASpecNet demonstrates consistent performance improvements with only minor fluctuations on individual machine types. This indicates that overall performance gains were achieved with minimal trade-offs in detection accuracy for any single machine type. This fully demonstrates the effectiveness of the proposed SpecNet spectral enhancement branch, frequency-gated attention, and SoftTriple metric learning loss.

3.3. Ablation Analysis

To further validate the effectiveness of key components within the model, Table 3 presents ablation results for SpecNet, the FGA module, and SoftTriple metric learning loss.

First, introducing the SpecNet spectral enhancement branch on top of the baseline significantly improved both AUC and pAUC, demonstrating that this branch effectively mitigates deficiencies in spectral information.

Adding the FGA module further enhanced all metrics, indicating that introducing an adaptive fusion mechanism in the frequency dimension helps highlight critical frequency bands, thereby strengthening anomaly detection capabilities. Notably, when only the “SpecNet + SoftTriple” combination was used, removing the FGA module, the model’s performance fell below both FGASpecNet and the baseline model. This demonstrates that imposing metric learning constraints without this module actually weakens feature discriminability, highlighting the critical role of the FGA module.

Finally, under the full configuration integrating SpecNet, the FGA module, and SoftTriple loss, model performance improved to 95.04% and 89.68%. Specifically, SoftTriple’s multi-center structure more comprehensively captures inter-class separability between different machine IDs, significantly enhancing the model’s discriminative capability. Experimental results demonstrate that all three components play important and complementary roles in ASD tasks, effectively boosting overall detection performance.

3.4. Hyperparameter Analysis

Table 4 illustrates the impact of various hyperparameters on model performance in SoftTriple. It is evident that the temperature coefficient

γ

exerts the least influence; adjusting it from 0.1 to 0.05 results in only a marginal decline in performance. Changes in the scaling factor

λ

and boundary hyperparameter

δ

also yield relatively moderate effects, with limited fluctuations observed in both AUC and pAUC. In contrast, the number of centers K and loss weight

λ_{ST}

exert more pronounced effects on performance. Increasing K to 4 or decreasing

λ_{ST}

to 0.12 degrades model performance, indicating that excessive centers or overly weak metric weights diminish feature discriminability. The regularization coefficient

τ

exhibits the strongest influence: introducing non-zero regularization significantly reduces both AUC and pAUC, suggesting that additional center shrinkage constraints are unnecessary for this task. Based on this analysis, the default hyperparameter configuration for SoftTriple loss in FGASpecNet is set as

γ = 0.1

,

λ = 20

,

δ = 0.02

,

K = 3

,

τ = 0

, and

λ_{ST} = 0.15

.

3.5. Representation Analysis and Visualization

To visually demonstrate the representation capability of the proposed model in the latent feature space, Figure 7 presents t-distributed stochastic neighbor embedding (t-SNE) visualizations of the pump model test set. Different colors denote distinct machine IDs, with circles representing normal samples and crosses indicating anomalous samples. In Figure 7a, the red-circled region reveals significant overlap among samples from different IDs in the central area, where normal and anomalous samples are intermingled. Within the blue-circled region, in the id_00 cluster, both normal and anomalous samples are distributed throughout the same striped structure, making differentiation challenging. In contrast, the feature distribution of FGASpecNet in Figure 7b is more orderly. The red-circled central region primarily clusters a small number of difficult samples and abnormal samples from various IDs, while the majority of normal samples from each ID have been pushed to distinct peripheral clusters. The blue-circled region exhibits a more compact striped structure, with most normal samples concentrated within the main cluster while anomalous samples are predominantly distributed along the cluster’s periphery. This phenomenon indicates that the proposed method possesses stronger discriminative capability in the embedding space, simultaneously enhancing intra-class compactness and inter-class separation, thereby improving the model’s anomaly detection performance.

3.6. Parameter Count Analysis

Table 5 compares the parameter count and detection performance of different models. For some compared methods, the parameter counts are adopted from [34]. The results show that while traditional lightweight models have fewer parameters, they significantly lag behind acoustic-feature-based modeling methods in both AUC and pAUC. The proposed FGASpecNet exhibits a slight increase in parameter count compared to TASTgram, yet achieves further improvements in AUC and pAUC, reaching 95.04% and 89.68%, respectively—the best results among all methods in the table. This demonstrates that incorporating the SpecNet spectral enhancement branch, gated attention, and SoftTriple loss results in only a slight increase in model parameter count while significantly enhancing overall performance, achieving a favorable balance between performance and model complexity.

4. Conclusions

This paper focuses on anomalous sound detection in industrial scenarios and proposes FGASpecNet to address the limitations of existing methods in spectral information utilization, characterization of key frequency-band importance, and the expressiveness of discriminative structures in the embedding space. The proposed method introduces a spectral enhancement branch, SpecNet, to complement the spectral details of Log-Mel representations, and employs a frequency-gated attention mechanism for adaptive time–frequency fusion, enabling the model to focus on regions relevant to anomalies in the time–frequency representation. Meanwhile, by combining a classification loss and a metric learning loss, FGASpecNet enhances intra-class compactness and inter-class separability, leading to a clearer class structure in the embedding space. Experimental results on the DCASE 2020 Task 2 dataset show that, with only a slight increase in parameter count, FGASpecNet outperforms several representative baseline methods in terms of AUC and pAUC, validating the effectiveness of the proposed approach. Future work will target more challenging real-world industrial settings, including additional evaluations of inference efficiency at the hardware level, improved robustness under noise perturbations and domain generalization, and the exploration of more efficient inference and adaptation strategies to meet practical condition monitoring requirements.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, and writing—original draft preparation: J.J.; writing—review and editing, supervision, and project administration: Z.B., W.Z., and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Project of Shanghai Science and Technology Committee (no. 23010501500).

Data Availability Statement

The data used in this paper are publicly available from the DCASE 2020 Challenge Task 2 (Unsupervised Detection of Anomalous Sounds) at https://dcase.community/challenge2020/task-unsupervised-detection-of-anomalous-sounds (accessed on 20 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

The following abbreviations are used in this manuscript:

ASD	Anomalous sound detection
Machine ID	Machine identifier
Log-Mel	Log-Mel spectrogram
FGASpecNet	Proposed model Fusing Spectral Enhancement and Frequency-Gated Attention
DCASE	Detection and classification of acoustic scenes and events
AUC	Area under the receiver operating characteristic (ROC) curve
pAUC	Partial area under the receiver operating characteristic (ROC) curve
STFT	Short-time Fourier transform
FFT/IFFT	(Inverse) fast Fourier transform
2D-FFT/2D-IFFT	Two-dimensional (inverse) fast Fourier transform
MIMII	Malfunctioning Industrial Machine Investigation and Inspection (dataset)
ToyADMOS	Toy Acoustic Anomaly Detection in Machine Operating Sounds (dataset)

References

Thoben, K.D.; Wiesner, S.; Wuest, T. “Industrie 4.0” and smart manufacturing—A review of research issues and application examples. Int. J. Autom. Technol. 2017, 11, 4–16. [Google Scholar] [CrossRef]
Zhou, J.; Li, P.; Zhou, Y.; Wang, B.; Zang, J.; Meng, L. Toward new-generation intelligent manufacturing. Engineering 2018, 4, 11–20. [Google Scholar] [CrossRef]
Tran, M.Q.; Doan, H.P.; Vu, V.Q.; Vu, L.T. Machine learning and IoT-based approach for tool condition monitoring: A review and future prospects. Measurement 2023, 207, 112351. [Google Scholar] [CrossRef]
Nunes, E.C. Anomalous sound detection with machine learning: A systematic review. arXiv 2021, arXiv:2102.07820. [Google Scholar] [CrossRef]
Koizumi, Y.; Kawaguchi, Y.; Imoto, K.; Nakamura, T.; Nikaido, Y.; Tanabe, R.; Purohit, H.; Suefusa, K.; Endo, T.; Yasuda, M.; et al. Description and discussion on DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring. arXiv 2020, arXiv:2006.05822. [Google Scholar] [CrossRef]
Suefusa, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Kawaguchi, Y. Anomalous sound detection based on interpolation deep neural network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 271–275. [Google Scholar] [CrossRef]
Kapka, S. ID-conditioned auto-encoder for unsupervised anomaly detection. arXiv 2020, arXiv:2007.05314. [Google Scholar] [CrossRef]
Alam, J.; Boulianne, G.; Gupta, V.; Fathan, A. An ensemble approach to unsupervised anomalous sound detection. In Proceedings of the 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Tokyo, Japan, 2–4 November 2020; pp. 2–4. [Google Scholar]
Hayashi, T.; Yoshimura, T.; Adachi, Y. Conformer-based id-aware autoencoder for unsupervised anomalous sound detection. In DCASE2020 Challenge; Technical Report; DCASE Community: Barcelona, Spain, 2020. [Google Scholar]
Park, J.; Yoon, S.; Yoo, S. Unsupervised detection of anomalous machine sound using various spectral features and focused hypothesis test in the reverberant and noisy environment. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, DCASE2020, Tokyo, Japan, 2–4 November 2020. [Google Scholar]
Wang, Y.; Zhang, Q.; Zhang, W.; Zhang, Y. A lightweight framework for unsupervised anomalous sound detection based on selective learning of time-frequency domain features. Appl. Acoust. 2025, 228, 110308. [Google Scholar] [CrossRef]
Giri, R.; Tenneti, S.; Cheng, F.; Helwani, K.; Isik, U.; Krishnaswamy, A. Self-supervised classification for detecting anomalous sounds. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, DCASE2020, Tokyo, Japan, 2–4 November 2020. [Google Scholar]
Dohi, K.; Endo, T.; Purohit, H.; Tanabe, R.; Kawaguchi, Y. Flow-based self-supervised density estimation for anomalous sound detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 336–340. [Google Scholar] [CrossRef]
Chen, H.; Ran, L.; Sun, X.; Cai, C. SW-WAVENET: Learning representation from spectrogram and wavegram using wavenet for anomalous sound detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Wu, J.; Yang, F.; Hu, W. Unsupervised anomalous sound detection for industrial monitoring based on ArcFace classifier and gaussian mixture model. Appl. Acoust. 2023, 203, 109188. [Google Scholar] [CrossRef]
Liu, Y.; Guan, J.; Zhu, Q.; Wang, W. Anomalous sound detection using spectral-temporal information fusion. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; IEEE: New York, NY, USA, 2022; pp. 816–820. [Google Scholar] [CrossRef]
Zeng, X.M.; Song, Y.; Zhuo, Z.; Zhou, Y.; Li, Y.-H.; Xue, H.; Dai, L.-R.; McLoughlin, I. Joint generative-contrastive representation learning for anomalous sound detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Wilkinghoff, K. Sub-cluster AdaCos: Learning representations for anomalous sound detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–8. [Google Scholar] [CrossRef]
Choi, S.; Choi, J.W. Noisy-arcmix: Additive noisy angular margin loss combined with mixup for anomalous sound detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 516–520. [Google Scholar] [CrossRef]
Kong, D.; Yu, H.; Yuan, G. Multi-Spectral and Multi-Temporal Features Fusion with SE Network for Anomalous Sound Detection. IEEE Access 2024, 12, 167262–167277. [Google Scholar] [CrossRef]
Guan, J.; Xiao, F.; Liu, Y.; Zhu, Q.; Wang, W. Anomalous sound detection using audio representation with machine id based contrastive learning pretraining. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Huang, S.; Fang, Z.; He, L. Noise supervised contrastive learning and feature-perturbed for anomalous sound detection. arXiv 2025, arXiv:2509.13853. [Google Scholar] [CrossRef]
Wilkinghoff, K.; Yang, H.; Ebbers, J.; Germain, F.G.; Wichern, G.; Le Roux, J. Keeping the Balance: Anomaly Score Calculation for Domain Generalization. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, R.; Qiao, Y.; Wang, X.; Li, H. Adacos: Adaptively scaling cosine logits for effectively learning deep face representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10823–10832. [Google Scholar] [CrossRef]
Wilkinghoff, K.; Kurth, F. Why do angular margin losses work well for semi-supervised anomalous sound detection? IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 608–622. [Google Scholar] [CrossRef]
Qian, Q.; Shang, L.; Sun, B.; Hu, J.; Tacoma, T.; Li, H.; Jin, R. Softtriple loss: Deep metric learning without triplet sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6450–6458. [Google Scholar] [CrossRef]
Pham, L.; Phan, H.; Nguyen, T.; Palaniappan, R.; Mertins, A.; McLoughlin, I. Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework. Digit. Signal Process. 2021, 110, 102943. [Google Scholar] [CrossRef]
Rippel, O.; Snoek, J.; Adams, R.P. Spectral representations for convolutional neural networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2449–2457. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
Purohit, H.; Tanabe, R.; Ichige, T.; Endo, T.; Nikaido, Y.; Suefusa, K.; Kawaguchi, Y. MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv 2019, arXiv:1909.09347. [Google Scholar] [CrossRef]
Koizumi, Y.; Saito, S.; Uematsu, H.; Harada, N.; Imoto, K. ToyADMOS: A dataset of miniature-machine operating sounds for anomalous sound detection. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; IEEE: New York, NY, USA, 2019; pp. 313–317. [Google Scholar] [CrossRef]
Neri, M.; Carli, M. Low-complexity attention-based unsupervised anomalous sound detection exploiting separable convolutions and angular loss. IEEE Sens. Lett. 2024, 8, 6014404. [Google Scholar] [CrossRef]

Figure 1. Overall structure of FGASpecNet.

Figure 2. Architecture of TgramNet.

Figure 3. Architecture of temporal attention.

Figure 4. Architecture of SpecNet.

Figure 5. Architecture of frequency-gated attention.

Figure 6. Detailed distribution of the dataset. (a) Overall dataset composition. (b) Sample counts for each machine type under different conditions.

Figure 7. The t-SNE visualization of latent embeddings on the test dataset for the pump machine type. (a) TASTgram. (b) FGASpecNet. Different colors represent different machine ID. “•” and “×” denote normal and anomalous samples, respectively.

Table 1. Layer configuration of TgramNet.

Layer	c	k	s	p	n
Conv1D	128	1024	512	512	$\times 1$
LayerNorm	-	-	-	-	$\times 3$
Leaky ReLU	-	-	-	-
Conv1D	128	3	1	1

Table 2. Performance comparison in terms of AUC (%) and pAUC (%) on the test data of the development dataset. Boldface and underlined numbers denote the best and second-best results, respectively.

Methods	Fan		Pump		Slider		Valve		ToyCar		ToyConveyor		Average
Methods	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC
AE	66.20	53.20	72.90	60.30	85.50	67.80	66.30	51.20	80.90	79.90	73.40	61.10	74.20	60.58
IDNN	67.71	52.90	73.76	61.07	86.45	67.58	84.09	64.94	78.69	69.22	71.07	59.70	76.96	62.57
MobileNetV2	80.19	74.40	82.53	76.50	95.27	85.22	88.65	87.98	87.66	85.92	69.71	56.43	84.34	77.74
Glow-Aff	74.90	65.30	84.30	73.80	94.60	82.80	91.40	75.00	92.20	84.10	71.50	59.00	85.20	73.79
STgram-MFN	94.04	88.97	91.94	81.75	99.55	97.61	99.64	98.44	94.44	87.68	74.57	63.60	92.36	86.34
TASTgram	97.82	94.78	94.05	85.40	99.50	97.39	99.97	99.85	96.40	89.91	76.58	66.43	94.07	88.96
FGASpecNet	98.05	94.73	96.03	88.24	99.49	97.32	99.93	99.61	96.99	90.43	79.75	67.72	95.04	89.68

Table 3. Ablation study in terms of average AUC (%) and pAUC (%). The FGA module is applied on top of SpecNet and is not evaluated as a standalone component. Boldface numbers denote the best. ✓ indicates that the corresponding module is enabled.

SpecNet	FGA	SoftTriple	AUC	pAUC
			94.07	88.96
✓			94.42	89.15
		✓	94.57	88.95
✓	✓		94.68	89.38
✓		✓	93.90	88.80
✓	✓	✓	95.04	89.68

Table 4. Average AUC (%) and pAUC (%) results of different SoftTriple hyperparameters. Boldface numbers denote the best.

$γ$	$λ$	$δ$	K	$τ$	$λ_{ST}$	AUC	pAUC
0.05	20	0.02	3	0	0.15	94.94	89.61
0.1	10	0.02	3	0	0.15	94.46	88.99
0.1	20	0.025	3	0	0.15	94.38	89.00
0.1	20	0.02	4	0	0.15	93.99	88.82
0.1	20	0.02	3	0.02	0.15	90.17	84.25
0.1	20	0.02	3	0	0.12	93.80	88.75
0.1	20	0.02	3	0	0.15	95.04	89.68

Table 5. Parameter counts and performance comparison in terms of average AUC (%) and pAUC (%). Boldface and underlined numbers denote the best and second-best results, respectively.

Method	Parameters	AUC	pAUC
IDNN	46 k	76.96	62.57
MobileNetV2	1.1 M	84.34	77.74
Glow-Aff	30 M	85.20	73.90
STgram-MFN	1.1 M	92.36	86.34
TASTgram	1.2 M	94.07	88.96
FGASpecNet	1.3 M	95.04	89.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bi, Z.; Jiang, J.; Zhang, W.; Shan, M. Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention. Mathematics 2026, 14, 530. https://doi.org/10.3390/math14030530

AMA Style

Bi Z, Jiang J, Zhang W, Shan M. Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention. Mathematics. 2026; 14(3):530. https://doi.org/10.3390/math14030530

Chicago/Turabian Style

Bi, Zhongqin, Jun Jiang, Weina Zhang, and Meijing Shan. 2026. "Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention" Mathematics 14, no. 3: 530. https://doi.org/10.3390/math14030530

APA Style

Bi, Z., Jiang, J., Zhang, W., & Shan, M. (2026). Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention. Mathematics, 14(3), 530. https://doi.org/10.3390/math14030530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention

Abstract

1. Introduction

2. Materials and Methods

2.1. Acoustic Feature Enhancement

2.1.1. Tgram

2.1.2. Log-Mel

2.1.3. TAgram

2.1.4. Specgram

2.2. Frequency-Gated Attention

2.3. Joint Loss Optimization

2.3.1. Noisy-ArcMix

2.3.2. SoftTriple

2.4. Dataset

3. Results

3.1. Experimental Setup

3.1.1. Implementation Details

3.1.2. Anomaly Score Computation

3.1.3. Evaluation Metrics

3.2. Performance Comparison

3.3. Ablation Analysis

3.4. Hyperparameter Analysis

3.5. Representation Analysis and Visualization

3.6. Parameter Count Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI