ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network

Su, Yubo; Wang, Haolin; Xu, Zhihao; Yin, Chengxi; Chen, Fucheng; Wang, Zhaoguo

doi:10.3390/math13172719

Open AccessArticle

ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network

by

Yubo Su

^1,2,

Haolin Wang

³,

Zhihao Xu

⁴,

Chengxi Yin

³,

Fucheng Chen

^1,2 and

Zhaoguo Wang

^1,2,*

¹

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China

²

Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen 518055, China

³

Guangdong Provincial Key Laboratory of Intelligent Measurement and Advanced Metering of Power Grid, Electric Power Research Institute of CSG, Guangzhou 510530, China

⁴

China Industrial Control Systems Cyber Emergency Response Team, Beijing 100040, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(17), 2719; https://doi.org/10.3390/math13172719

Submission received: 13 July 2025 / Revised: 16 August 2025 / Accepted: 20 August 2025 / Published: 24 August 2025

(This article belongs to the Special Issue Advances in Mathematical Approaches to Trustworthy and Secure AI Systems)

Download

Browse Figures

Versions Notes

Abstract

Unsupervised anomalous sound detection (ASD) models the normal sounds of machinery through classification operations, thereby identifying anomalies by quantifying deviations. Most recent approaches adopt depthwise separable modules from MobileNetV2. Extensive studies demonstrate that squeeze-and-excitation (SE) modules can enhance model fitting by dynamically weighting input features to adjust output distributions. However, we observe that conventional SE modules fail to adapt to the complex spectral textures of audio data. To address this, we propose an Audio Texture Attention (ATA) specifically designed for machine noise data, improving model robustness. Additionally, we integrate an LSTM layer and refine the temporal feature extraction architecture to strengthen the model’s sensitivity to sequential noise patterns. Experimental results on the DCASE 2020 Challenge Task 2 dataset show that our method achieves state-of-the-art performance, with AUC, pAUC, and mAUC scores of 96.15%, 90.58%, and 90.63%, respectively.

Keywords:

spectro-temporal attention mechanism; anomalous sound detection; dynamic channel interaction; audio texture modeling; lightweight convolutional networks

MSC:

68T10; 68T99

1. Introduction

In real-world industrial machine monitoring scenarios, normal operational noise is readily available, while anomalous sounds are inherently rare and difficult to collect without risking machine damage. Moreover, manually labeling anomalous noise is highly resource-intensive. Therefore, the goal of anomalous sound detection (ASD) is to determine whether a machine is in an abnormal state during inference [1], using only normal audio samples for training. This constraint makes unsupervised and semi-supervised learning particularly well-suited for this task.

Previous research on anomalous sound detection can be primarily categorized into two approaches: the first utilizes autoencoders to model normal audio samples and distinguishes anomalies through reconstruction error analysis [2,3,4,5,6], while the second employs machine noise ID-based supervised classification models.

For autoencoder models like VIDNN [2], since audio data contains multiple machine types, separate encoders must be constructed for each machine ID. In practical applications, this approach requires significantly more training time and computational resources [6]. The interpolated neural network addresses this by building an autoencoder through intermediate frame prediction, which avoids the trivial solution problem common in traditional autoencoders and better captures the time-frequency structure of audio signals.

However, due to the excessive generalization capability of autoencoders [7], when the distinction between normal and abnormal machine noise is subtle, the autoencoder may learn features of anomalous audio from normal data samples. This leads to reduced reconstruction error and consequently degrades detection performance.

Machine ID classification-based methods have demonstrated significant advantages in Automatic Sound Detection (ASD) tasks [8,9,10,11,12], surpassing traditional autoencoder (AE) architectures in detection performance by leveraging the ability to jointly learn feature distributions across multiple machines. Current research primarily focuses on modifying the MobileFaceNetV2 [13] backbone, with improvements mainly following two technical directions. The first is the diversification of input features, such as ST-gram [8] enhancing the temporal representation of log-Mel spectrograms through a temporal feature network (TgramNet), and OS-SCL [11] employing dilated convolutions to construct TFnet for capturing deep frequency-domain features. The second is the optimization of loss functions, including the use of ArcFace [14] for inter-class angular margin penalties, or Noise-ArcMix [10] employing mixup [15] strategies to improve model generalization. Although studies like CLP-SCF [9] optimize feature space distribution through contrastive learning, these approaches fail to address a fundamental limitation—the existing architecture is directly adapted from MobileFaceNetV2, originally designed for face recognition, where its underlying assumptions (e.g., reliance on shape/contour-based spatial features) inherently differ from the time-frequency characteristics of audio spectrograms.

However, facial images and audio spectrograms exhibit significant differences in data structure: the former relies on well-defined hierarchical shape structures and spatial continuity, while the latter, as a time-frequency representation, exhibits weak local correlations and globally dispersed patterns. This domain gap leads to three inherent drawbacks in current models regarding spectrogram feature fusion: (1) the lack of an adaptive attention mechanism for critical frequency bands; (2) the inefficiency of spatial convolution kernels in capturing long-range dependencies across frequency bands; and (3) the loss of time-frequency information during temporal feature aggregation.

To overcome the limitations of MobileFaceNetV2 in audio scene recognition, recent studies in the computer vision domain have demonstrated that attention mechanisms can effectively enhance model performance. However, traditional Squeeze-and-Excitation (SE) [16] modules typically employ pooling operations to generate channel-wise weights, and such one-dimensional attention mechanisms struggle to fully capture the time-frequency characteristics of audio signals. In anomalous sound detection (ASD) tasks, model performance heavily relies on precise modeling of time-frequency features. Although attention mechanisms have been proven effective in processing sequential data, architectures such as the Vision Transformer (ViT) [17,18,19] require partitioning the input tensor into patches, which inevitably leads to the loss of inherent information in the two-dimensional time-frequency structure. To address this issue, the proposed ATA-MSTF-Net adopts a separable convolution structure to directly generate two-dimensional time-frequency feature weights, enabling the model to adaptively enhance the representation of critical frequency bands and temporal frames, thereby achieving a more fine-grained modeling of audio textures.

From an architectural perspective, given that MobileNet extensively uses

1 \times 1

convolutions to effectively reduce computational complexity, such operations introduce fixed channel dependency, which limits the representational capacity of the model. To alleviate this problem, we innovatively introduce a lightweight Channel Shuffle mechanism [20], which periodically perturbs the feature channels. This mechanism effectively breaks fixed inter-channel dependencies and promotes cross-channel information exchange without significantly increasing computational overhead, thereby enhancing the model’s representational power. Considering the temporal characteristics of audio, we further integrate an LSTM [21] module into TgramNet, significantly improving its ability to model long-term dependencies and better capture the dynamic variation patterns of machine noise.

It is worth noting that this fine-grained modeling approach for audio time-frequency textures is highly consistent with the concept of feature co-optimization in the fields of incomplete multi-view learning and multi-label learning. In incomplete multi-view clustering, existing work has enhanced representation capability through a self-attention two-stage autoencoder combined with missing data recovery (RecFormer) [22], or achieved structured consensus representation via sparse regularization and local graph embedding (LSIMVC) [23]. In incomplete multi-view multi-label classification, some studies have realized cross-view consistency alignment through label-driven contrastive learning and quality-aware weighting (RANK) [22], while others have addressed double-incomplete scenarios by integrating missing-view and missing-label information into fusion and classification modules to improve performance [23]. These methods share technical similarities with our approach in terms of multi-modal feature fusion, key feature enhancement, and incomplete information utilization, collectively demonstrating the importance of efficient multi-source information modeling for improving performance in complex tasks.

Further research reveals that the temporal weighting mechanism of the original TAgram module suffers from insufficient flexibility. Inspired by the CBAM module [24], we propose an improved solution: computing max-pooled and average-pooled features in parallel, concatenating them, and generating attention weights to highlight the more discriminative key time frames in the Mel spectrogram. Experimental results demonstrate that this enhanced temporal dynamic feature modeling method exhibits stronger adaptive capabilities, achieving superior feature representation in the time dimension.

Our key contributions can be summarized as follows:

We propose a novel Audio Texture Attention Mechanism that employs separable convolution to directly model two-dimensional time-frequency feature correlations, overcoming the limitations of traditional SE modules in time-frequency feature modeling.
We enhance the temporal weight modeling of the TAgram module by incorporating a CABM-like structure, significantly improving the model’s ability to represent temporal dynamic features in audio signals.
We innovatively integrate an LSTM module into the lightweight network architecture, effectively strengthening the model’s ability to capture long-term dependencies.
By combining AT Attention, we propose the ATA-MSTF-Net (Audio Texture-Aware Multi-Spectro-Temporal Attention Fusion Network), which achieves state-of-the-art performance on the DCASE 2020 Task 2 dataset.

2. Methods

This study adopts a classification-based approach by constructing a classification model for normal audio to learn its data distribution characteristics. The overall framework is illustrated in Figure 1. During the model training phase, we utilize the machine type and machine ID of the audio as classification labels and optimize the feature space using a classification loss function. This ensures that the feature distances between samples of the same class are minimized, while those between samples of different classes are maximized.

In the inference phase, since no anomalous samples are introduced during training, the model’s classification confidence for anomalous audio inputs will approach zero, whereas, for normal audio, the confidence remains close to one. To further enhance the discriminability between normal and anomalous audio, this study employs a negative logarithmic transformation, as shown in Equation (1),

Anomaly Score = - log (p)

(1)

p represents the model’s output classification confidence (a probability value ranging between 0 and 1).

When p approaches 0 (indicating anomalous audio), $- log (p)$ tends toward $+ \infty$ .
When p approaches 1 (indicating normal audio), $- log (p)$ tends toward 0.

This transformation significantly amplifies the confidence differences between the two types of samples. The experimental results demonstrate that when the negative logarithmic value of the confidence exceeds 0.1, the audio can be determined as anomalous.

This chapter proposes an innovative ATA attention mechanism for mechanical noise audio classification tasks and constructs an ATA-MSTF-Net model based on this mechanism. The ATA attention mechanism, through its unique structural design, specifically enhances the feature extraction ability of the classifier, demonstrating significant advantages in time-frequency domain feature representation and effectively capturing key texture characteristics of audio signals.

Building upon this mechanism, the ATA-MSTF-Net model adopts a multi-modal feature fusion strategy. By processing multiple feature representations of the raw audio in parallel and incorporating complementary information from Mel spectrograms, the model significantly improves the discriminative power of input features, thereby optimizing its overall performance. Figure 2a illustrates the structural design of the ATA attention mechanism tailored to the time-frequency characteristics of audio signals, while Figure 3 details the overall architecture of the ATA-MSTF-Net model and its core modules.

2.1. ATA Attention

The core of the attention mechanism lies in dynamically generating weights to adjust feature representations, thereby enhancing model performance by suppressing irrelevant information and emphasizing key features. Its effectiveness heavily depends on the design of the weight generation method, which must be closely aligned with the structural characteristics of the specific data. Current attention weight generation methods can be broadly categorized into two types, each with notable limitations.

The first type is based on self-attention mechanisms. As illustrated in the Figure 2d, the model establishes long-range dependencies by computing the similarity between the query (Q) and key (K) matrices. However, such methods require flattening the input tensor into a one-dimensional sequence for processing. For audio spectrogram data, this leads to severe disruption of the time–frequency structure, resulting in the loss of local time–frequency correlations and a consequent decline in recognition performance.

The second type relies on convolutional structures, as shown in Figure 2b,c, with the squeeze-and-excitation (SE) module being the most representative example. The SE mechanism uses global average pooling combined with fully connected layers to generate channel-wise attention weights. By performing dimensionality reduction followed by expansion, it filters out irrelevant information. However, it suffers from fundamental drawbacks: (1) the pooling operation leads to the loss of spatial detail; and (2) it can only produce a single weight per channel dimension, failing to capture the joint variation of two-dimensional time–frequency features. Although the subsequent EMA [25] improves channel modeling without a dimensionality reduction, its weight generation approach still cannot establish fine-grained attention control for each pixel on the time–frequency plane.

To overcome these shortcomings, we propose the ATA mechanism, which directly uses a convolutional structure to generate attention weights, eliminating pooling and dimensionality reduction operations. It adopts Depthwise Separable Convolution as its core architecture, comprising three key components: Depthwise Convolution, Pointwise Convolution, and a Feed-Forward Enhancement Layer.

This design offers significant advantages in computational efficiency and feature extraction: First, depthwise convolution performs local feature extraction in the spatial dimensions (time–frequency plane), enabling the modeling of local time–frequency correlations at a very low computational cost.Second, pointwise convolution (1 × 1 convolution) is responsible for cross-channel feature integration, adaptively fusing the time–frequency features across channels. Through this cascaded operation, the model can accurately estimate the importance of each spatiotemporal location (time frame × frequency bin) in the input tensor, generating an attention map with fine-grained modulation capability.In the feed-forward enhancement layer, the generated attention map is multiplied element-wise with the original features. The result is then sequentially passed through a Dropout layer and a 1 × 1 convolution layer to enhance feature fusion, and finally combined via residual connections to achieve complementary and enhanced feature information.

As shown in the Figure 2a, the ATA attention can be expressed as follows:

A t t e n t i o n = {Conv}_{kernel = 1} ({DWConv}_{kernel = 5} (F) + {DWConv}_{kernel = 3} (F))

(2)

O u t p u t = {Conv}_{kernel = 1} (Dropout (A t t e n t i o n \otimes F)) + F

(3)

Among these,

F \in R^{C \times H \times W}

represents the input audio features.

Attention \in R^{C \times H \times W}

is the generated attention weight map, where each value of Attention indicates the importance of the corresponding element in F. The symbol ⊗ denotes element-wise multiplication.

Unlike traditional attention mechanisms (such as SE or EMA), the proposed ATA (Audio Texture Attention) mechanism abandons the constraints of normalization functions like sigmoid or softmax, and instead employs a fully convolutional structure to achieve adaptive feature weighting. This convolution operation can dynamically generate adjustment weights based on the local contextual information of the input features, rather than forcing the output into a uniform normalized distribution. This content-based adaptive modulation avoids the limitations of the attention weight range imposed by sigmoid/softmax.

We argue that the effectiveness of an attention mechanism lies in its ability to flexibly adjust according to the spatiotemporal distribution characteristics of the input features. In contrast, traditional normalization operations unnecessarily constrain the dynamic range of attention weights, thereby reducing the model’s representational capacity.

2.2. ATA-MSTF-Net

As shown in the Figure 3, the network architecture of ATA-MSTF-Net features two key improvements.

First, the model adopts a multi-feature extraction architecture design, which integrates multiple signal processing methods to extract highly discriminative time-frequency features from raw audio data. This design is based on the theoretical considerations. Due to the complex time–frequency characteristics of audio signals, a single feature extraction method is insufficient to comprehensively characterize their full informational attributes. To address this, the proposed ATA-MSTF-Net model employs a multi-branch feature extraction structure, primarily comprising two core modules: TF-gramNet and T-LSTM-gramNet. The TF-gramNet module is specifically designed to extract global spectral features, while the T-LSTM-gramNet module enhances the ability to model temporal dynamic features through a long short-term memory (LSTM) network. This differentiated module design enables the model to simultaneously capture both the static texture features and dynamic evolutionary patterns of audio signals. Additionally, the model incorporates a time attention mechanism to optimize input feature representation by weighting critical time frames in the Mel spectrogram, thereby more effectively capturing key classification cues. During the feature fusion stage, the feature vectors extracted from each branch are concatenated and fed into the classifier to maximize the extraction of audio information, ultimately forming the final detection decision. The experimental validation results demonstrate that this multi-feature fusion strategy significantly improves the model’s robustness and generalization performance in complex acoustic environments, offering a novel technical approach to audio feature extraction.

The second improvement is that ATA-MSTF-Net innovatively integrates the Audio Texture Attention (ATA) mechanism with a lightweight MobileNet architecture to construct the Mobile-ATA-Classifier module, significantly improving the network’s ability to model audio texture features. The motivation for introducing ATA attention stems from recognizing the importance of audio textures in classification tasks—many critical classification cues do not reside in a single frequency band or time point but are instead distributed as texture patterns across the entire time–frequency map. The ATA attention mechanism can dynamically weight these regions, highlighting discriminative texture patterns while suppressing background noise. Combining this with the MobileNet architecture addresses efficiency and deployment considerations, ensuring strong expressive power while maintaining low computational resource requirements. This makes the entire classification module both lightweight and efficient, suitable for real-world production environments. Below, we provide a detailed explanation of each of these improved modules.

2.2.1. T-Lstm gramNet

In automatic sound detection (ASD) tasks, effective extraction of audio features is critical to model performance. The STgram architecture employs 1D large-kernel convolution (kernel size = 1024) to simulate the Mel-spectrogram extraction process, thereby supplementing high-frequency information in audio signals and enhancing feature representation through a three-layer convolutional structure. However, traditional convolution operations have inherent limitations in modeling temporal dependencies in audio data, making it difficult to effectively capture long-term temporal evolution characteristics such as machine noise patterns.

To address this limitation, we introduce an LSTM (Long Short-Term Memory) structure into the T-gram module, as illustrated in Figure 4. The T-LSTM-gram module uses a three-layer LSTM network to capture trend variations in the raw audio while concatenating and fusing the results of the large-kernel convolution and convolutional modules, as described below:

x_{init} = LKConv (x)

(4)

x_{T - lstm} = Concat (x_{init}, Convs (x_{init}), LSTMs (x_{init}))

(5)

As illustrated in Figure 4, the architecture employs a Large Kernel 1D Convolution (LKConv) module to simulate the computation process of Mel-spectrograms. The LKConv module is configured with a kernel size of 1024 and a stride of 512, which correspond to the frame length and frame shift settings in audio-processing, respectively. By mapping the raw 1D audio signal to a 128-dimensional feature space, this module achieves a feature extraction functionality analogous to the combination of Fourier transform and Mel filter banks. For an input audio signal with a length of 160,000, the initial feature map

x_{init}

produced by this module has a dimensional structure of (Batch, 128, 313). Subsequent feature extraction is performed by a convolutional module (Convs), which consists of three cascaded convolutional layers for the progressive extraction of deep features. Temporal feature extraction is then accomplished through a three-layer LSTM network (LSTMs) to capture long-term dependencies in the audio signal. Finally, the concatenation (Concat) operation is applied along the channel dimension for three types of features, generating the final output feature

x_{T - lstm}

with a dimension of (Batch, 3, 128, 313).

The T-LSTM-gram module combines the advantages of CNNs in local time–frequency feature extraction with the strengths of LSTMs in modeling long-term temporal dependencies, thereby constructing a more discriminative audio feature representation. This innovative design effectively compensates for the limitations of traditional convolutional networks in temporal modeling.

2.2.2. Time Attention

To address the limitations of the temporal attention mechanism in the TAgram module, this study proposes an improved Time Attention mechanism. In the original TASTgramNet, the TAgram module generates the temporal attention map by directly summing the results of max pooling and average pooling applied to the Mel spectrogram. This approach has notable shortcomings: first, the simple linear addition operation lacks learnable parameters, resulting in an overly rigid attention weight generation process; second, this static fusion strategy fails to adaptively capture the dynamic variations in key time frames across different audio scenarios.

To overcome these limitations, this study innovatively introduces the attention weight generation mechanism inspired by the CABM (Context-Aware Attention Block Module). The improved design offers the following advantages: (1) by incorporating learnable attention weight parameters, it enables the adaptive weighting of temporal features; (2) it leverages context information modeling to enhance the model’s ability to discriminate important time frames; and (3) it employs a nonlinear fusion strategy to more precisely reflect the distribution of temporal importance in the Mel-spectrogram (Figure 5).

The module adopts a dual pooling strategy for feature extraction along the frequency dimension: Max Pooling and Average Pooling are performed in parallel along the frequency axis to respectively capture the prominent spectral features and the overall distribution characteristics. These two complementary statistical features are then concatenated along the channel dimension, forming a joint time–frequency representation with dimension

2 \times W

. To enhance the feature representation capability, this joint representation is passed through a

1 \times 1

convolutional layer for nonlinear transformation and feature fusion, enabling the convolution kernels to learn the optimal combination of features. Finally, after normalization by the Sigmoid activation function, probabilistic attention weights in the range

[0, 1]

are generated, mathematically expressed as follows:

P o o l (x_{mel}) = Concat (AvgPool (x_{mel}), MaxPool (x_{mel}))

(6)

x_{attantion} = x_{mel} \times Sigmoid (Conv (P o o l (x_{mel})))

(7)

Here,

x_{mel} \in R^{H \times W}

denotes the Mel-spectrogram features, and

{Attention}_{mel} \in R^{1 \times W}

represents the generated Mel-spectrogram attention weights. This design fully leverages the advantages of different pooling operations: Max Pooling captures salient feature responses, while Average Pooling preserves the overall distribution characteristics. The organic combination of the two can more comprehensively reflect the importance distribution along the time dimension. This design achieves the three following advantages: (1) It retains the salient features and global statistical properties of the spectrogram; (2) It realizes adaptive feature fusion through learnable convolutional parameters; (3) It generates attention weight distributions with a clear probabilistic interpretation.

2.2.3. Mobile ATA Classifier

In the field of audio recognition, traditional classifier modules commonly use depthwise separable convolutions from the MobileNet architecture as the basic building units. Although this structure excels at reducing computational complexity by decomposing standard convolutions into depthwise convolutions and pointwise convolutions, it still exhibits significant limitations when processing complex audio time–frequency features: Firstly, the channel-wise independence in the depthwise convolution stage results in insufficient cross-channel information interaction, making it difficult to capture critical time–frequency correlation features in audio signals; secondly, the fixed receptive field of the convolution kernels restricts the model’s ability to adapt to multi-scale time–frequency textures; thirdly, the simple combination of pointwise convolutions lacks a mechanism for channel fusion.

To overcome these limitations, this study innovatively combines the ATA attention mechanism with traditional convolution operations and proposes the Mobile-ATA-classifier module. This module adopts a three-layer cascade refinement design, where each layer consists of a standard feature learning module and a downsampling (as shown in Figure 6a,b). The introduction of the ATA attention mechanism effectively compensates for the shortcomings of depthwise separable convolutions: by establishing attention weights across channels in the full dimension, it enhances the model’s perception of time–frequency features, enabling adaptive focus on key frequency regions. Meanwhile, the hierarchical structure design realizes progressive feature extraction from local to global. This innovative architecture significantly improves the model’s ability to represent complex audio features while maintaining computational efficiency advantages.

In the standard convolutional architecture, the model typically consists of three key components: a

1 \times 1

expansion convolution for channel interaction, a depthwise separable convolution (DWConv) with a

3 \times 3

kernel responsible for spatial information fusion, and a

1 \times 1

convolution for channel fusion to achieve feature recombination. To more effectively capture the texture features of audio spectrograms, we optimized the basic structure as follows: first, a parallel

1 \times 1

mapping branch was introduced alongside the original

3 \times 3

spatial convolution layer to enhance multi-scale feature extraction capability; second, the AT attention module was innovatively added before the spatial convolution to strengthen the representation of key texture components in the feature space through the attention mechanism. Notably, considering that the downsampling operation alters the spatial structure of the feature maps, the AT attention module in the downsampling layer was moved after the spatial convolution, enabling it to more accurately learn the spatial texture features of the downsampled audio.

This improved design allows the model to better focus on critical texture information in audio signals, thereby enhancing the effectiveness of feature representation.

In traditional convolutional architectures, as shown in Figure 6c,d, convolutional layers and downsampling convolutional layers are typically connected by two consecutive

1 \times 1

convolutions. This fixed connection pattern causes channel information to flow along predefined paths, limiting the diversity of feature interactions. To enhance the dynamic fusion capability of channel information, this work introduces a channel reorganization method.

As shown in Figure 7, we evenly divide the input feature channels into g groups, with each group containing n channels, resulting in a total of

g \times n

channels. As a concrete example, assume the features are divided into four groups (

g = 4

). The original features can then be represented as

F = (x_{1}, x_{2}, x_{3}, x_{4}),

where

x_{1}

to

x_{4}

denote subsets of features, each comprising n channels.

Next, the feature recombination operation is performed, which consists of the following three steps:

Reshape: The original $g \times n$ channels, initially arranged sequentially, are reorganized into a matrix with n rows and g columns:

$M \in R^{n \times g} .$
Transpose: This matrix is transposed, resulting in a new matrix with g rows and n columns:

$M^{T} \in R^{g \times n} .$
Flatten: The transposed matrix is then flattened back into a one-dimensional feature representation of size $n \times g$ :

$F^{'} = Flatten (M^{T}) .$

After this transformation, the recombined features

F^{'} = ({\hat{x}}_{1}, {\hat{x}}_{2}, {\hat{x}}_{3}, {\hat{x}}_{4})

are obtained. The key insight of this transformation is that it creatively cross-combines channels from the same positional index across the original groups. Specifically, the new

{\hat{x}}_{1}

consists of the first channel from each original group,

{\hat{x}}_{2}

consists of the second channel from each group, and so on.

This recombination mechanism effectively establishes dynamic connections across feature groups, enabling deep interaction and information fusion between different feature subsets. The process enhances cross-group feature integration while maintaining computational efficiency.

This innovative channel permutation strategy has several notable advantages: firstly, while maintaining computational efficiency, it significantly enhances feature representation by establishing cross-group feature correlations; secondly, the dynamic channel reorganization mechanism breaks the fixed connection pattern of traditional convolutions, enabling adaptive feature information interaction; lastly, by increasing feature diversity, this structure builds a richer representation space for subsequent feature extraction. Experiments demonstrate that this structured feature reorganization effectively promotes information fusion across different channel features, allowing the model to learn more discriminative channel interaction patterns and thereby improving the overall representational power of the network.

This structure breaks the original fixed channel combination pattern, allowing information to interact in a more random and flexible manner during propagation at each layer. Compared to the static connections in traditional convolutions, this method introduces an additional cross-channel feature fusion mechanism, enhancing the model’s expressive power and enabling it to learn more robust channel interaction patterns.

3. Experiments and Evaluation

In the experiments, we used the development and additional datasets from the DCASE 2020 Challenge Task 2 [26] to evaluate our model. This dataset consists of parts of the MIMII dataset [27] and the ToyADMOS dataset [28]. The MIMII dataset contains four types of machines (i.e., fan, pump, slider, and valve), each with seven different machines. The ToyADMOS dataset includes two machine types (i.e., ToyCar and ToyConveyor), with seven and six different machines, respectively. In the experiments, we used the training data (normal sounds) from the Task 2 development and additional datasets as the training set, and the test data (normal and anomalous sounds) from the development dataset was used for evaluation.

In this study, we employed AUC (Area Under the ROC Curve), partial AUC (pAUC), and minimum AUC (mAUC) as performance evaluation metrics to comprehensively assess the classifier’s performance from different perspectives. AUC, as a conventional metric, reflects the overall ability of the model to distinguish positive and negative samples across all possible classification thresholds; values closer to 1 indicate a better global ranking performance. Considering the strict requirements for low false alarm rates in practical industrial applications, we computed the pAUC within the false positive rate (FPR) range

[0, 0.1]

, which better reflects model performance in critical operating regions. Furthermore, to evaluate the detection stability across different individual machines, we introduced the mAUC metric, defined as the minimum AUC among different machines of the same machine type. This design provides more practical guidance for model performance evaluation.

This study focuses on a classification task involving a dataset comprising 41 distinct machine types and their corresponding equipment IDs, with detailed parameter configurations provided in Table 1. The experiment adopts the Mobile-ATA classifier architecture, where the input feature dimension of the model is set to

128 \times 313

, and the channel shuffle operation is configured with eight groups. The time-frequency diagram (TF-gram) feature extraction strictly adheres to the standardized procedure outlined in Reference [11]. The T-LSTM network employs a three-layer architecture, with both the hidden layer and input layer dimensions uniformly set to 128.

During the audio signal preprocessing stage, all sample data were standardized using a sampling rate of 16 kHz and subjected to a Hamming window function. The parameters for generating the Mel spectrogram are configured as follows: 128 Mel filter banks were utilized, with valve-type data employing a 1024-point FFT (the same configuration was applied by default to other data types), and a fixed frame shift of 512 sampling points.

The AdamW optimizer was employed for parameter optimization during model training, with an initial learning rate of 0.0001 and a cosine annealing learning rate scheduling strategy. The loss function adopts Noise-ArcMix [10], configured with parameters

α = 0.5

,

m = 0.7

, and

s = 30

. The training process maintains a batch size of 64 and runs for 300 epochs. To ensure the reliability and consistency of the experiment, all machine audio data were processed using uniform standards and training configurations, thereby guaranteeing the comparability and reproducibility of the feature extraction and model training processes.

3.1. Performance Comparison

Table 2 presents a comprehensive performance comparison between ATA-MSTF-Net and current mainstream methods, including IDNN, MobileNetV2, Glow_Aff, ST-gram-MFN, CLP-SCF, TASTgram (NAMix), and TFSTgram (OS-SCL). Experimental results show that, compared to the current state-of-the-art TFSTgram (OS-SCL), ATA-MSTF-Net achieves improvements of 0.44% and 0.35% in average AUC and pAUC respectively, setting a new performance benchmark. Although its performance on the Slider and Valve machine categories is on par with the existing best methods, it demonstrates significant advantages across most other machine types. Notably, on the ToyConveyor dataset—where existing models generally perform poorly—ATA-MSTF-Net attains a remarkable 2.2% improvement. This outstanding result not only validates the effectiveness of the proposed method but also highlights its strong generalization ability in complex industrial scenarios. These experimental findings fully demonstrate the advancement and practical value of ATA-MSTF-Net in the field of industrial anomaly detection.

Table 3 presents a detailed comparison of detection performance based on machine ID using the mAUC metric. Notably, the experimental results show that detection tasks among machines of the same type often have the most challenging performance. Among all tests, ATA-MSTF-Net demonstrates outstanding results on the ToyConveyor dataset, achieving a remarkable 9.89% improvement in mAUC compared to the current best model. This breakthrough strongly highlights the superiority of the proposed model. Although the performance on the Pump, Valve, and ToyCar datasets shows slight gaps compared to the optimal solutions, overall, ATA-MSTF-Net maintains competitive performance across all test datasets. These results convincingly confirm the excellent robustness and generalization capability of ATA-MSTF-Net in cross-machine-type anomaly detection tasks. Table 4 highlights the number of parameters and the performance of our approach compared with those of the SOTA. Our system offers a good trade-off between model complexity and performance.

In the design of the Mobile-ATA-classifier, we innovatively introduced a Channel Shuffle mechanism to enhance the model’s discriminative ability for multi-source machine anomaly features. Through systematic experiments, we found that this module plays a key role in improving model performance. To further investigate the impact of the channel shuffle mechanism, we conducted an ablation study on the number of groups (g). The experimental results reveal a distinct nonlinear relationship between model performance and the number of groups: as g increases from 2 to 8, classification accuracy steadily improves; however, when g exceeds 16, performance begins to decline. This phenomenon can be explained from the perspective of feature learning:

When the number of groups is small ( $g = 2, 4$ ), inter-channel interactions are limited, making it difficult to fully explore the nonlinear correlations among deep features;
At a moderate number of groups ( $g = 8$ ), the model can establish richer cross-channel connections between the two $1 \times 1$ convolution layers, achieving enhanced feature expression through moderate feature perturbations;
However, when the number of groups is too large ( $g = 16, 32$ ), excessive feature perturbations lead to the dispersion of critical feature information, weakening the model’s ability to focus on discriminative features.

The results indicate that setting the number of groups to

g = 8

yields the best classification performance, providing important guidance for parameter configuration of the channel shuffle mechanism (Table 5).

3.2. Ablation Study

To validate the effectiveness of the proposed modules, we evaluate their impact by modifying the classifier inputs and types. The results are shown in the following table.

This study adopts the TF+mel+T-gram+mobile classifier as the baseline comparison scheme, where the TF module is derived from [9], and both the T-gram and mobile classifier components are inherited from the ST-gram approach. By integrating these two methods to build the baseline model, the experimental results indicate that the primary performance improvement stems from the introduction of the mobile AT classifier. This module demonstrates significant advantages across multiple major datasets, with particularly strong results on the ToyConveyor and Pump datasets (Table 6).

It is worth noting that although the Time Attention mechanism leads to a slight decrease in metrics on the Slider dataset (possibly due to the lack of pronounced periodic characteristics in the Slider’s acoustic signals, where anomalies are widely distributed across time frames rather than concentrated in specific periods, causing the attention mechanism to over-focus on certain time segments and thereby harming detection performance), its improvements on other datasets effectively compensate for this shortcoming. Overall, the experimental data clearly verify the effectiveness and general applicability of the proposed modules in enhancing anomaly detection performance.

4. Conclusions

This study innovatively proposes an attention mechanism tailored to audio texture features, which employs a convolutional architecture to directly generate attention weight distributions over the audio spectrogram. Based on this mechanism, we designed the Mobile-ATA-classifier to replace the original Mobile-classifier, enabling the model to more precisely focus on critical regions within the spectrogram and thereby significantly enhance the robustness of anomaly detection across different machine sounds.

In addition, by integrating multiple audio feature processing techniques, the model can capture richer discriminative information, providing a more comprehensive basis for classification decisions. The experimental results demonstrate that the proposed method is not only theoretically innovative but also achieves significant practical improvements, delivering substantial performance gains over state-of-the-art methods. This research offers a new technical perspective and an effective solution for audio texture-based anomaly detection.

Author Contributions

Conceptualization: Y.S. and Z.W.; Data Curation: H.W.; Funding Acquisition: Z.X.; Methodology: Y.S. and Z.X.; Software: H.W. and Y.S.; Supervision: C.Y. and Z.W.; Validation: Y.S.; Visualization: F.C.; Writing—Original Draft: Y.S. and H.W.; Writing—Review & Editing: Z.W., Z.X. and F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Suefusa, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Kawaguchi, Y. Anomalous sound detection based on interpolation deep neural network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 271–275. [Google Scholar]
Dohi, K.; Endo, T.; Purohit, H.; Tanabe, R.; Kawaguchi, Y. Flow-based self-supervised density estimation for anomalous sound detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 336–340. [Google Scholar]
Giri, R.; Tenneti, S.V.; Helwani, K.; Cheng, F.; Isik, U.; Krishnaswamy, A. Unsupervised Anomalous Sound Detection Using Self-Supervised Classification and Group Masked Autoencoder for Density Estimation; DCASE 2020 Challenge Technical Report; 2020; Volume 23. Available online: https://dcase.community/documents/challenge2020/technical_reports/DCASE2020_Giri_103_t2.pdf (accessed on 19 August 2025).
Zeng, X.M.; Song, Y.; Zhuo, Z.; Zhou, Y.; Li, Y.H.; Xue, H.; Dai, L.R.; McLoughlin, I. Joint generative-contrastive representation learning for anomalous sound detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Zavrtanik, V.; Marolt, M.; Kristan, M.; Skočaj, D. Anomalous sound detection by feature-level anomaly simulation. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1466–1470. [Google Scholar]
Guan, J.; Liu, Y.; Kong, Q.; Xiao, F.; Zhu, Q.; Tian, J.; Wang, W. Transformer-based autoencoder with ID constraint for unsupervised anomalous sound detection. EURASIP J. Audio Speech Music. Process. 2023, 2023, 42. [Google Scholar] [CrossRef]
Chen, S.; Guo, W. Auto-encoders in deep learning—A review with new perspectives. Mathematics 2023, 11, 1777. [Google Scholar] [CrossRef]
Liu, Y.; Guan, J.; Zhu, Q.; Wang, W. Anomalous sound detection using spectral-temporal information fusion. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 816–820. [Google Scholar]
Guan, J.; Xiao, F.; Liu, Y.; Zhu, Q.; Wang, W. Anomalous sound detection using audio representation with machine ID based contrastive learning pretraining. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Choi, S.; Choi, J.W. Noisy-arcmix: Additive noisy angular margin loss combined with mixup for anomalous sound detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 516–520. [Google Scholar]
Huang, S.; Fang, Z.; He, L. Noise Supervised Contrastive Learning and Feature-Perturbed for Anomalous Sound Detection. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Kong, D.; Yuan, G.; Yu, H.; Wang, S.; Zhang, B. ASDNet: An Efficient Self-Supervised Convolutional Network for Anomalous Sound Detection. Appl. Sci. 2025, 15, 584. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y.; Gao, X.; Han, Z. Mobilefacenets: Efficient CNNs for accurate real-time face verification on mobile devices. In Chinese Conference on Biometric Recognition; Springer International Publishing: Cham, Switzerland, 2018; pp. 428–438. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
Park, N.; Kim, S. How do vision transformers work? arXiv 2022, arXiv:2202.06709. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Wen, J.; Xu, Y.; Zhang, B.; Nie, L.; Zhang, M. Reliable representation learning for incomplete multi-view missing multi-label classification. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4940–4956. [Google Scholar] [CrossRef] [PubMed]
Wen, J.; Liu, C.; Deng, S.; Liu, Y.; Fei, L.; Yan, K.; Xu, Y. Deep double incomplete multi-view multi-label learning with incomplete labels and missing views. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11396–11408. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Koizumi, Y.; Kawaguchi, Y.; Imoto, K.; Nakamura, T.; Nikaido, Y.; Tanabe, R.; Purohit, H.; Suefusa, K.; Endo, T.; Yasuda, M.; et al. Description and discussion on DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring. arXiv 2020, arXiv:2006.05822. [Google Scholar] [CrossRef]
Purohit, H.; Tanabe, R.; Ichige, K.; Endo, T.; Nikaido, Y.; Suefusa, K.; Kawaguchi, Y. MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv 2019, arXiv:1909.09347. [Google Scholar] [CrossRef]
Koizumi, Y.; Saito, S.; Uematsu, H.; Harada, N.; Imoto, K. ToyADMOS: A dataset of miniature-machine operating sounds for anomalous sound detection. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 313–317. [Google Scholar]
Chen, H.; Ran, L.; Sun, X.; Cai, C. SW-WAVENET: Learning representation from spectrogram and wavegram using wavenet for anomalous sound detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]

Figure 1. Audio Classification-Based Detection Method: During the training phase, the model utilizes composite labels composed of audio types and audio IDs as the classification basis, while during the inference phase, these predefined labels serve as prior conditions for computing prediction confidence.

Figure 2. Four Different Attention Mechanism Weight Generation Methods: (a) AT attention uses a parallel-designed depthwise separable convolution structure, where the attention weights are generated collaboratively through 3 × 3 and 5 × 5 convolutional kernels of different scales; (b) SE first performs pooling operations in the feature extraction stage, followed by a bottleneck structure (dimensionality reduction–dimensionality increase) to dynamically generate the attention weight distribution. (c) EMA, After the pooling operation, a 1 × 1 convolutional layer is directly used for feature transformation to produce attention weights, simplifying the traditional dimensionality reduction–dimensionality increase process; (d),VIT The attention weight matrix is constructed based on the dot product of query and key vectors to model the correlations between features.

Figure 3. ATA-MSTF-Net Model Architecture: The time-frequency features were extracted by TF-gramNet, the Mel-spectrogram features and the temporal features were captured by T-LSTM, and the key temporal features extracted by the Time Attention mechanism were fused through multi-modal feature fusion. The fused feature vector was then input into the Mobile ATA classifier for classification decision-making. C denotes concatenation operation.

Figure 4. T-lstm gramNet architecture.

Figure 5. Time Attention Architecture, “C” represents the concatenation operation.

Figure 6. Mobile ATA and MobileNet Classifier Convolutional Structures. (a) Mobile ATA Convolution Module. (b) Mobile ATA Downsample Module. (c) MobileNet Convolution Module. (d) MobileNet Downsample Module.

Figure 7. Channel shuffle operation. The four colors in the figure indicate that the features are divided into four groups: (a) the original feature distribution without channel shuffle; (b) the feature distribution after channel shuffle processing, where the color changes in each group visually demonstrate the reallocation and mixing effect of the channel shuffle operation on the feature channels.

Table 1. Experimental Parameter Configuration.

Category	Parameter	Value/Config	Note
Model Structure	Input Feature Dim	6 × 128 × 313	Mobile-ATA classifier input
	Channel Shuffle Groups	8
	T-LSTM Layers	3	Hidden/input layer dim = 128
Audio Preprocessing	Sample Rate	16,000 Hz	Uniform for all audio
	Window Function	Hamming	For all machine types
	n_FFT	1024
	n_Mels	128	Number of filter banks
	hop_length	512	Uniform for all types
Training Config	Optimizer	AdamW	Initial lr = 0.0001
	LR Schedule	Cosine Annealing
	Loss Function	Noise-ArcMix	$α$ = 0.5, m = 0.7, s = 30
	Training Epochs	300	Batch size = 64

Table 2. Performance comparison of different methods on various machine types.

Methods	Fan		Pump		Slider		Valve		ToyCar		ToyConveyor		Average
Methods	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC
IDNN [1]	67.71	52.90	73.76	61.07	86.45	67.58	84.09	64.94	78.69	69.22	71.07	59.70	76.96	62.57
MobileNetV2 [3]	80.19	74.40	82.53	70.50	95.27	85.22	88.65	87.98	87.82	85.92	69.71	56.43	84.34	77.74
Glow_Aff [2]	74.90	65.30	83.40	73.80	94.60	82.80	91.40	75.00	97.20	84.10	71.50	59.00	85.20	73.90
ST-gram-MFN [8]	94.04	88.97	91.94	81.75	99.55	97.61	99.64	98.44	94.44	87.68	74.57	63.60	92.36	86.34
SW-WaveNet [29]	97.53	91.54	87.27	82.68	98.96	94.58	99.01	97.26	95.49	90.20	81.20	68.29	93.25	87.41
CLP-SCF [9]	96.98	93.23	94.97	87.39	99.57	97.73	99.89	99.51	95.85	90.19	75.21	62.79	93.75	88.48
TASTgram(NAMix) [10]	98.42	94.83	94.53	86.22	99.41	96.94	99.94	99.73	95.91	89.48	77.14	65.22	94.23	88.74
TFSTgram(OS-SCL) [11]	98.41	94.50	96.25	88.74	99.63	98.04	99.85	99.23	96.55	91.17	83.56	69.87	95.71	90.23
ATA-MSTF-Net	98.82	95.10	96.74	90.39	99.56	97.67	99.18	96.81	96.84	91.92	85.76	71.60	96.15	90.58

Table 3. mAUC performance comparison.

Methods	ST-Gram	CLP-SCF	TASTgram (NAMix)	ATA-MSTF-Net
Fan	81.89	88.27	92.12	94.01
Pump	83.48	87.27	90.85	88.80
Slider	98.22	98.28	97.65	98.52
Valve	98.83	99.58	99.46	97.78
ToyCar	83.07	86.87	87.01	86.86
ToyConveyor	64.16	65.46	67.94	77.83
Average	84.86	87.62	89.12	90.63

Table 4. Number of parameters, average AUC, and average pAUC of SOTA approaches and the proposed method.

Methods	Parameters	AUC (%)	pAUC (%)
IDNN [1]	46 k	76.96	62.57
MobileNetV2 [3]	1.1 M	84.34	77.74
Glow-Aff [2]	30 M	85.20	73.90
STgram-MFN [8]	1.1 M	92.36	86.34
SW-WaveNet [29]	27 M	93.25	87.41
TASTgram(NAMix) [10]	1.1 M	94.23	88.74
ATA-MSTF-Net	4.1 M	96.15	90.58

Table 5. Ablation study on the number of groups g in the channel shuffle mechanism.

Group	w/o Channel Shuffle	$g = 2$	$g = 4$	$g = 8$	$g = 16$	$g = 32$
Fan	98.04	96.59	98.01	98.82	96.35	97.99
Pump	95.71	95.50	94.84	96.74	95.93	95.22
Slider	99.59	99.52	99.44	99.56	99.63	99.38
Valve	99.07	99.31	99.08	99.18	99.56	98.50
ToyCar	97.04	97.56	96.91	96.84	96.36	97.65
ToyConveyor	84.17	82.37	83.65	85.76	82.96	84.08
Average	95.60	95.15	95.32	96.15	95.73	95.47

Table 6. Ablation study results for different classifier inputs and architectures. TF refers to the TFgramNet input, mel refers to the Mel-spectrogram input, T-gram refers to the T-gramNet input, mobile classifier denotes the original depthwise separable classifier, and mobile ATA classifier refers to the classifier proposed in this study. (a): TF+mel+T-gram+mobile classifier; (b): TF+mel+T-lstm+mobile classifier; (c): TF+mel+T-lstm+TA+mobile classifier; (d): TF+mel+T-LSTM+TA+mobile ATA classifier.

Methods	a		b		c		d
Methods	AUC	pAUC	AUC	pAUC	AUC	pAUC	AUC	pAUC
Fan	98.07	94.47	97.93	94.43	98.58	94.99	98.82	95.10
Pump	94.57	86.51	94.85	86.79	95.35	87.73	96.74	90.39
Slider	99.37	96.67	99.57	97.72	99.23	96.05	99.56	97.67
Value	99.91	99.51	99.82	99.07	98.74	96.42	99.18	96.81
ToyCar	95.94	91.08	96.43	91.67	97.03	92.09	96.84	91.92
ToyConveyor	78.67	65.67	81.46	67.72	81.71	70.05	85.76	71.60
Average	94.41	88.99	95.01	89.31	95.10	89.55	96.15	90.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, Y.; Wang, H.; Xu, Z.; Yin, C.; Chen, F.; Wang, Z. ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network. Mathematics 2025, 13, 2719. https://doi.org/10.3390/math13172719

AMA Style

Su Y, Wang H, Xu Z, Yin C, Chen F, Wang Z. ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network. Mathematics. 2025; 13(17):2719. https://doi.org/10.3390/math13172719

Chicago/Turabian Style

Su, Yubo, Haolin Wang, Zhihao Xu, Chengxi Yin, Fucheng Chen, and Zhaoguo Wang. 2025. "ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network" Mathematics 13, no. 17: 2719. https://doi.org/10.3390/math13172719

APA Style

Su, Y., Wang, H., Xu, Z., Yin, C., Chen, F., & Wang, Z. (2025). ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network. Mathematics, 13(17), 2719. https://doi.org/10.3390/math13172719

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network

Abstract

1. Introduction

2. Methods

2.1. ATA Attention

2.2. ATA-MSTF-Net

2.2.1. T-Lstm gramNet

2.2.2. Time Attention

2.2.3. Mobile ATA Classifier

3. Experiments and Evaluation

3.1. Performance Comparison

3.2. Ablation Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI