1. Introduction
Audio-driven intelligent services—ranging from smart speaker voice assistants and abnormal sound detection in industrial facilities to environmental sound recognition in autonomous vehicles, healthcare monitoring, and urban safety or disaster alert systems—are rapidly proliferating across modern society and diverse industrial sectors. These technologies are regarded not only as enhancements to user experience but also as critical factors directly linked to safety and human life, with the associated market size continuing to exhibit steady growth [
1]. In line with this trend, the international standardization body MPEG (Moving Picture Experts Group) has recently initiated efforts to establish the standardization of Audio Coding for Machines (ACoM) [
2].
However, since audio waveforms are inherently one-dimensional time-series signals, there exist structural limitations when directly applying convolutional neural networks (CNNs) or vision transformer (ViT)-based architectures, which are primarily optimized for two-dimensional spatial inputs [
3,
4,
5]. To address this issue, a widely adopted approach is to transform audio signals into two-dimensional visual representations—typically through time-frequency conversion methods such as the Short-Time Fourier Transform (STFT) or mel-spectrograms—and subsequently use these representations as inputs to deep learning models [
6,
7].
This trend is also evident in the DCASE (Detection and Classification of Acoustic Scenes and Events) Challenge, an international benchmark in audio-based AI. Both the official baseline and many top-performing systems adopt spectrogram-based representations, confirming that converting audio signals into image is an effective and widely accepted practice. In this work, we adopt the official Autoencoder baseline model of DCASE 2024 Task 2, while conducting all experiments on the DCASE 2023 Task 2 Development Dataset.
Such spectrogram-based approaches go beyond mere structural transformation, offering the advantage of feature learning grounded in visual properties. Spectrograms encompass diverse visual attributes—such as textures, shapes, and energy bands distributed across the time-frequency plane—and deep learning models leveraging these representations have recently demonstrated human-level or even superior accuracy across various tasks, thereby validating their effectiveness. However, in real-world industrial environments—where audio is collected at the edge and processed in the cloud—the limitations of network bandwidth and latency constraints often act as bottlenecks that can degrade model performance.
Conventional audio data codecs are designed to maximize perceptual quality for human listeners, typically employing perceptual audio coding techniques that discard high-frequency components deemed inaudible [
8]. In this process, essential information such as fine-grained frequency details or phase components required by machine learning models may be lost, leading to a trade-off between transmission efficiency and model accuracy [
9]. Similarly, conventional video codecs are optimized for human visual perception, often discarding subtle frequency textures or fine boundary details that are imperceptible to humans. This, in turn, can distort the visual cues within spectrograms and adversely affect the performance of machine learning models [
10].
Therefore, conventional codecs such as AAC and VVC, while efficient for human perception, are not well-suited for machine listening tasks. This limitation highlights the need for machine-oriented codecs that can preserve subtle spectral cues essential for anomaly detection and other learning-based tasks. At present, however, there is no established codec in the audio domain specifically optimized for machines. Feature Coding for Machines (FCM) has been actively studied and standardized in the image and video domains as part of the VCM (Video Coding for Machines) framework, but its application to audio remains unexplored. In parallel, Audio Coding for Machines (ACoM) has only recently entered the standardization process. Importantly, the current stage of ACoM (Phase 1) is limited to near-lossless signal compression, while feature-level lossy coding is planned for Phase 2. This situation explains the scarcity of prior work in the audio domain and highlights the novelty of our study in addressing this gap.
2. Related Work
2.1. Conventional Audio and Video Compression Methods: AAC (Advanced Audio Coding) and VVC (Versatile Video Coding)
Advanced Audio Coding (AAC), first introduced in 1997 as part of MPEG-2 Part 7, has since become a core component of the MPEG-4 standard and is now widely deployed, being implemented in more than ten billion devices worldwide. The AAC family encompasses multiple profile—including AAC-LC, HE-AAC, HE-AAC v2, and xHE-AAC—supporting a wide range of applications from high-quality streaming at bitrates above 128 kbps to low-bitrate IoT scenarios operating below 24 kbps [
11]. All profiles employ perceptual audio coding techniques, maximizing compression efficiency by discarding high-frequency and phase components that are difficult for humans to perceive. However, such a design has been criticized for its limitations, as it may discard fine-grained frequency patterns and phase information required by machine learning models.
Versatile Video Coding (VVC), established in 2020 by the Joint Video Experts Team (JVET) of ITU-T VCEG and ISO/IEC MPEG, is a next-generation video coding standard that offers improved compression efficiency over HEVC and supports a wide range of applications, including 8K UHD, HDR/WCG, and 360-degree video. However, VVC is likewise designed with optimization for human visual quality, which may result in the loss of delicate frequency patterns or fine edge details within spectrograms. In particular, this study does not process raw audio directly; instead, one-dimensional audio signals are transformed into spectrograms and treated as two-dimensional visual data. For this reason, VVC, the latest video coding standard, was selected as the comparative anchor. In summary, AAC was considered the representative baseline for perceptual audio compression, while VVC served as the representative baseline for spectrogram compression in the video domain. Using these two conventional approaches, we analyzed their limitations from the perspective of machine perception performance.
Figure 1 illustrates how these differences manifest in practice. The original log-mel spectrogram preserves find spectral cues across the full frequency band. After AAC-LC encoding (32 kbps), high-frequency components and subtle spectral details are visibly attenuated, reflecting the perceptual coding strategy that removes inaudible content for humans. By contrast, VVC compression introduces distortions in the time-frequency textures and blurs fine boundaries, since it is optimized for human visual quality rather than for preserving anomaly-related structures. These visual differences demonstrate why conventional codecs, while effective for human-centered consumption, may be inadequate for machine perception tasks.
Consequently, AAC and VVC can be regarded as conventional compression techniques optimized for human-centered consumption of audio and video signals, respectively, but they fail to sufficiently preserve the information required for machine perception. This structural trade-off between performance and bandwidth efficiency in practical edge-cloud audio analysis systems underscores the limitations of existing codecs and is accelerating the transition toward next-generation standards such as ACoM.
2.2. Audio Coding for Machines (ACoM)
Audio Coding for Machines (ACoM) represents a novel standardization approach that compresses and transmits audio data not on the basis of human perceptual quality but rather according to the analytical performance of machine learning models [
12]. This represents an attempt to overcome the structural limitation of conventional audio codecs, which discard certain information for perceptual optimization but consequently also eliminate critical feature information required by machines.
The ACoM standard is structured around a two-phase roadmap [
13]. Phase 1 aims to define a common format for audio data compression—either lossless or near-lossless—suitable for dataset exchange and long-term preservation. This phase focuses on improving storage efficiency while preserving the quality of diverse audio sources, including medical acoustics, spatial audio, and sensor-based recordings.
Phase 2 adopts a feature-centric, lossy coding approach in which intermediate feature maps utilized within deep learning-based recognition models are extracted, quantized, compressed, and transmitted, rather than encoding the entire signals. This approach minimizes recognition performance degradation even at low bitrates and is evaluated as being particularly advantageous for real-time analysis in edge-cloud environments.
Accordingly, this study draws upon the concept of feature-based compression proposed in ACoM Phase 2 and explores techniques for efficiently transmitting intermediate feature of audio recognition models. This aligns with ACoM’s technical objective of enhancing transmission efficiency while preserving machine learning performance, and it experimentally supports the structural shift envisioned in next-generation audio coding standards.
MPEG is currently preparing a Call for Proposals (CfP) for the formal standardization of ACoM, and once evaluation metrics and reference implementations are established, approaches such as the one presented in this study may serve as potential candidate technologies for standard.
2.3. Video Coding for Machines (VCM)
Video Coding for Machines (VCM) is an ISO/IEC MPEG standardization initiative that shifts the objective of video compression away from human viewing quality toward the optimization of machine perception tasks such as object detection, tracking and classification. In contrast to conventional codecs such as VVC and HEVC, it is conceptually distinguished by defining compression schemes tailored for data consumed directly by machines rather than by humans. In particular, it originated from the awareness that conventional video codecs, being designed around visual quality criteria, inevitably discard fine-grained feature information that is crucial for machine learning.
VCM can be broadly categorized into ROI-based pixel compression and intermediate feature map-based coding. The first track, ROI (Region of Interest)-based coding, enhances transmission efficiency by first detecting regions critical for object detection and tracking, then applying lower quantization parameters (QPs) to those areas while assigning higher QPs to less important background regions. By concentrating the bitstream on regions critical for recognition, this method has been reported to maintain key performance metrics—such as mean Average Precision (mAP) and Multiple Object Tracking Accuracy (MOTA)—even at lower bitrates compared to VVC.
The second track, Feature Coding for Machines (FCM), compresses and transmits intermediate feature maps extracted from deep learning networks, rather than the entire input images. The representative reference model, FCTM-7, eliminates less significant channels from multi-channel features, normalizes their distribution through statistical non-linear transformations, rearranges them into a two-dimensional grid, and finally encodes them using VVC. At the decoder side, hierarchical feature structures are reconstructed through processes such as unpacking, dequantization, and distribution correction networks.
In this way, VCM seeks to achieve a balance transmission efficiency and recognition performance through its two complementary approaches with distinct input domains. ROI-based compression is well-suited for applications where visual fidelity is critical, whereas FCM-based compression offers superior efficiency in edge-cloud environments constrained by bandwidth and requiring real-time processing.
At present, however, there are no published FCM methods specifically designed for the audio domain, which makes direct comparison difficult. Therefore, in this study we benchmark our proposed audio-oriented FCM approach against two state-of-the-art baseline: AAC-LC as the representative perceptual audio codec, and VVC as the state-of-the-art codec for spectrogram (image-like) compression. This comparison allows us to evaluate the effectiveness of the proposed framework relative to the strongest existing methods in both audio and visual coding domains.
2.4. Anomalous Sound Detection
The DCASE (Detection and Classification of Acoustic Scenes and Events) Challenge serves as an international benchmark for advancing audio-based artificial intelligence, providing standardized datasets and baseline models for a wide range of audio recognition tasks. In particular, DCASE 2024 Task 2 addresses the problem setting of first-shot unsupervised anomalous sound detection, aiming to identify anomalies in unseen machine types using only limited normal data. The official baseline model is based on Autoencoder (Autoencoder). Input audio is first transformed into log-mel spectrograms through STFT and mel-filterbank processing, and then compressed and reconstructed by the Autoencoder. Anomaly scores are subsequently computed using reconstruction errors and Mahalanobis distance. The baseline has proven effective in capturing complex frequency patterns and subtle acoustic anomalies, and was therefore adopted as a comparative reference in this study.
In this study, we build upon the baseline Autoencoder model by applying input data processed through different compression—reconstruction pathways—namely AAC, VVC and FCM-based ACoM—within the same anomaly detection pipeline. The impact of each method on anomaly detection performance is quantitatively compared and evaluated. Through this analysis, we examine the structural trade-off between transmission efficiency and recognition accuracy, and discuss the necessity of audio coding schemes optimized for machine perception.
3. Method
This section discusses three audio data transmission approaches for anomalous sound detection: conventional perceptual audio coding using AAC, spectrogram image compression with VVC, and the proposed feature map compression based on FCM. Because these methods differ in their compression targets, processing pipelines, and information loss characteristics, a comparative analysis enables us to examine the correlation between transmission efficiency and anomaly detection performance. We deliberately selected these three approaches because they represent distinct compression principles: AAC-LC as a widely deployed perceptual audio codec, VVC as the state-of-the-art video codec applied to spectrograms, and FCM-based ACoM as a machine-oriented feature compression scheme. This choice allows us to focus on the effect of compression strategy itself under the same anomaly detection pipeline, rather than benchmarking anomaly detection algorithms.
3.1. AAC
AAC is a representative perceptual audio coding technology based on the MPEG-2/4 standards. It combines multiple techniques—such as MDCT-based filter banks, perceptual modeling, stereo synthesis and optimized bit allocation—to achieve high compression efficiency and listening quality. In this study, we adopted the AAC-LC (Low Complexity) profile, which offers low computational complexity and suitability for real-time processing, considering transmission efficiency and standard compatibility in a single-channel, 16-bit, 16 kHz environment. AAC-LC maintains audio quality comparable to the original even at bitrates below 32 kbps, and operates reliably in resource-constrained environments such as IoT devices and mobile platforms.
Figure 2 illustrates the AAC-LC–based anomaly detection pipeline implemented in this study. The original audio signal is first compressed by the AAC encoder, and the resulting bitstream is then reconstructed through the decoder. The reconstructed audio is transformed into a log-mel spectrogram through STFT and mel-filterbank processing, and this spectrogram is then used as the input of the anomaly detection model. Although this approach is optimized for preserving auditory quality, it may result in the loss of subtle anomalous cues required for machine perception; thus, it serves as a baseline for comparison with other methods.
3.2. VVC
VVC (Versatile Video Coding) is the latest video compression standard, offering approximately 50% bitrate reduction compared to HEVC while maintaining comparable visual quality. In this study, audio data were converted into log-mel spectrograms, normalized into the YUV 4:0:0 10-bit format, and then fed into the VVC encoder. The compressed bitstream is reconstructed by the VVC decoder, and the restored log-mel spectrogram is directly used as the input to the anomaly detection model.
Figure 3 illustrates the VVC-based spectrogram compression pipeline. In this process, the spectrogram is treated as image data, enabling high compression efficiency; however, since the compression is optimized for human visual quality, fine-grained frequency patterns and boundary information critical for machine perception may be lost. Nevertheless, it serves as an important comparative baseline in this study, as it represents the upper bound of performance achievable with state-of-the-art video compression techniques.
3.3. ACoM
Finally, this study extends the Feature Coding for Machines (FCM) standard, currently under development by MPEG, to the task of audio anomaly detection. Unlike conventional audio and video codecs designed for perceptual optimization, FCM selectively compresses and transmits only intermediate feature maps from deep learning models, with the objective of efficiently conveying the information required for machine learning.
Figure 4 presents the overall structure of the proposed FCM-based anomaly detection pipeline. The audio signal is first converted into a log-mel spectrogram, from which intermediate feature maps are extracted in NN-part1; these feature maps are then compressed through the FCM encoder. The encoder first performs a Feature Reduction stage, in which multi-scale features of varying resolutions are normalized and padded to uniform size, then merged by FENet into a single-scale 320-dimensional feature representation. The structure of FENet is shown in
Figure 5, where four parallel encoding blocks complementarily fuse high- and loss-resolution information. The output of the final block is passed through a Gain Unit, where it is transformed into a latent vector prepared for quantization. In the subsequent Feature Conversion stage, the vector is packed into frames, normalized using min–max scaling, quantized into 10-bit integers, and then converted into the YUV format. Finally, in the Inner Codec stage, a VVC-based intra-mode codec is employed to generate the final bitstream.
The FCM decoder performs this process in reverse. The decoded feature maps progressively recover their resolution through DRNet, the structure of which is presented in
Figure 6. DRNet consists of an Inverse Gain Unit and four decoding blocks, each performing upsampling at scales of 2, 4, 8 and 16. At each stage, a Feature Mixing Block is employed to fuse lower-level features during higher-resolution reconstruction, thereby minimizing the loss of fine details. More specifically, in the Inverse Gain Unit, each feaature channel of the decoded latent vector is multiplied by the corresponding inverse gain value to recover the original dynamic range. The resulting output is sequentially passed through the four decoding blocks, where each block includes an upsampling layer and a Feature Adaptation Block that refines spatial and contextual information. Through these progressive decoding and fusion operations, a multi-scale feature hierarchy is reconstructed, producing a high-resolution feature map thet closely resembles the original spectrogram. Finally, the reconstructed feature map is fed into our ASD model for anomaly detection rather than being passed to NN-part2.
The core idea of the proposed FCM approach is to target intermediate feature maps of deep learning models—rather than raw waveforms or full spectrograms—as the transmission unity, thereby reducing necessary data transfer and maximizing transmission efficiency. This enables the preservation of anomaly detection performance even under low-bitrate conditions, offering significant advantages particularly in industrial IoT environments with constrained transmission bandwidth.
From a practical deployment perspective, this workflow can be implemented in a typical edge-cloud collaboration setting. In industrial IoT scenarios, lightweight encoding is performed on edge devices to minimize local computational load and bandwidth usage. The compressed bitstream is then transmitted to cloud servers, where the anomaly detection model is executed. This architecture enables real-time monitoring while reducing communication costs. Furthermore, with model compression or lightweight feature extraction networks, part of the anomaly detection pipeline could be shifted onto edge devices in future implementations to support ultra-low latency requirements.
4. Training Configuration
In this chapter, the tree compression methods described in
Section 3 are applied to the actual anomaly detection pipeline, and the training procedures and experimental settings are detailed. The overall workflow of the proposed system consists of four stages: Raw audio input → Autoencoder → Anomaly scoring, as illustrated in
Figure 7. In particular, since the encoding method of the input signal introduces structural differences, both the preprocessing of training data and the model learning pathway vary accordingly.
In this study, four input pathways were established for comparison. The four pathways are as follows: (1) the Raw path, which directly uses preprocessed original audio; (2) the AAC path, which employs audio encoded and decoded by AAC-LC; (3) the VVC path, which converts log-mel spectrograms into image formats and compresses them using VVC; and (4) the FCM-ACoM path, which compresses and reconstructs intermediate feature maps extracted from NN-part1 through the FCM framework.
All pathways employ the same Autoencoder-based anomaly detection network; however, since the characteristics of the input data differ, each was trained and evaluated independently.
4.1. Dataset and Preprocessing
The experiments were conducted using the DCASE 2023 Challenge Task 2 Development Dataset [
14,
15]. This study focuses on four machine types—fan, slider, toycar and valve—each represented by approximately 10 s mono audio clips sampled at 16 kHz with 16-bit resolution. For training and evaluation, the entire dataset was utilized without distinguishing between source and target domains. For each class, 1000 normal samples are provided for training, along with 100 normal and 100 anomalous samples for testing. The detailed composition is summarized in
Table 1.
All audio signals were resampled to 16-bit PCM at 16 kHz, after which log-mel spectrograms were generated through the following preprocessing steps.
STFT Transformation: Frame size of 1024 points (64 ms), hop size of 512 samples, with a Hamming window applied.
Mel Filterbank: Applied a 128-channel mel filter to transform the spectrogram onto the mel scale.
Log-Scale Transformation: Applied logarithmic scaling to the amplitude spectrums to convert it into the dB scale.
Normalization: Performed min–max normalization on each clip, scaling the values to the range [0, 1].
Finally, two-dimensional log-mel spectrograms of size were used as inputs to the model. This preprocessing pipeline clearly represents frequency–time patterns, providing inputs well-suited for vision-based deep learning models to learn features relevant to anomaly detection.
4.2. Anomaly Scoring and Decision Making
During testing, the reconstruction error e and the latent vector are computed for the input spectrogram. These two signals are normalized and combined with weights to form the anomaly scores, which are then used to determine anomalous conditions. The overall procedure is as follows.
4.2.1. Reconstruction Error
Let the input log-mel spectrogram be
and the reconstructed output of the Autoencoder be
. The reconstruction error is defined as
where
F denotes the number of mel channels (128),
T the number of frames (313), and
the Frobenius norm. Equation (
1) represents the mean squared error (MSE) averaged over all time-frequency bins, where higher values indicates greater reconstruction difficulty.
4.2.2. Latent Vectors and Mahalanobis Distance
Let the latent vector obtained from the bottleneck layer of the Autoencoder be
(if a latent map is used, it is vectorized via spatial average pooling). From the latent distribution of the normal training samples, the mean
and covariance
are estimated. For numerical stability, we use
, and compute the Mahalanobis distance as
4.2.3. Score Normalization
Since
e and
are on different scales, both are normalized to
using percentile-based min–max calibration derived from the validation set, as follows:
where
denotes the
q-th percentile, and
represents the operation of saturating values within the range [0, 1]. The use of percentiles enhances robustness against outliers.
4.2.4. Score Fusion
The final anomaly score is defined as
where
controls the weighting between the normalized reconstruction error and the normalized Mahalanobis distance. The weighting factor
is selected via grid search on the validation set (e.g.,
). Since reconstruction difficulty (
) and distributional deviation (
) are generally complementary,
can be optimized differently depending on the dataset and codec pathway.
4.2.5. Thresholding and Binary Decision
The threshold
is chosen on the validation set to maximize the F1-score, and binary decisions are made as
where
denotes the predicted class label. In this study, considering the domain generalization setting of DCASE, a single threshold
is employed for each machine type (i.e., shared across source and target domains).
4.2.6. Evaluation Metrics
The primary evaluation metrics are AUC and F1-score. AUC reflects classification performance independent of the threshold, whereas F1-score captures the balance between precision and recall at the selected threshold. In reporting, we compare not only the performance across machine types but also the trends observed under different pathways (Raw, AAC, VVC, and FCM-ACoM) and varying bitrates.
4.2.7. Implementation Details
Covariance Estimation: For numerical stability when computing , we set . When the number of samples is small relative to the latent dimensionality, applying shrinkage (e.g., Ledoit-Wolf) also provides stability.
Clip-Level Scoring: Equations (
1) and (
2) assume clip-level score computation. When frame-level scores are required, temporal averaging or top-percentile aggregation (e.g., top-
) is applied.
Validation Split: Normal training samples are partitioned into training and validation subsets to estimate the normalization bound (), , and . If a separate validation set containing anomalous samples is available, it is used for F1-score optimization.
The anomaly score is defined as
where
and
denote the normalized reconstruction error and Mahalanobis distance within the range [0, 1], respectively, and
represents the weighting factor. The Mahalanobis distance is computed as
with
and
estimated from the latent vector distribution of normal training samples. For numerical stability of the covariance matrix, a small value
is added to the diagonal before inversion. The threshold
is determined on the validation set to maximize the F1-score, and final evaluation is performed using both AUC and F1-score.
4.3. Model Training Setup
The model employs the symmetric 2D convolutional Autoencoder architecture described in
Section 3, with only the data transformation process differing according to the input pathway. In the Raw pathway, the original audio waveform is directly converted into log-mel spectrograms and fed into the network. In contrast, in the AAC pathway, audio data are first compressed and reconstructed using the AAC-LC codec, after which the restored signals are transformed into log-mel spectrograms for input to the network. In the VVC pathway, spectrograms are converted into an image format, compressed and reconstructed using VVC, and the results are used as input. In the FCM-ACoM pathway, intermediate feature maps extracted from NN-part1 are compressed and reconstructed with a VVC-based codec; before being passed to NN-part2, the restored feature maps are extracted and used as inputs to the anomaly detection model.
The training configuration was kept identical across all pathways. Optimization was performed using the Adam optimizer with an initial learning rate of , adjusted by a cosine annealing scheduler with a cycle of 20 epochs. The batch size was set up 64, with a maximum of 100 training epochs, and early stopping was applied based on validation losses. The loss function was mean squared error (MSE); however, for the FCM pathway, an additional regularization term was included to stabilize the latent space. All experiments were conducted on a Lenova ThinkStation equipped NVIDIA RTX A6000 GPU (Lenovo Group Ltd., Beijing, China; NVIDIA Corporation, Santa Clara, CA, USA). For compression, ffmpeg was used for AAC, while the internal codec based on VTM 12.0 was employed for FCM.
4.4. Inference and Evaluation Strategy
In the testing phase, the trained Autoencoder computes both the reconstruction error of the input and the Mahalanobis distance of its latent vector. These values are normalized, combined through a weighted average, and used to produce the final anomaly score. The threshold is set at the point that maximizes the F1-score on the validation set, and binary classification is then performed based on this criterion. Performance evaluation was conducted using AUC and F1-score to compare the anomaly detection performance across the four input pathways.
5. Experimental Results
In this chapter, we compare the anomaly sound detection (ASD) performance of the proposed ACoM method with that of the baseline pathways—AAC, VVC and Raw—using the DCASE 2023 Task 2 dataset. All experiments were conducted by training independent models for each of the four machine types—Fan, Valve, ToyCar and Slider—and performance was evaluated using the F1-score metric. In particular, we analyzed how detection performance varies with bitrate under each compression method, thereby validating their efficiency in bandwidth-constrained transmission environments. The overall F1-score results across all machine types and compression methods are summarized in
Table 2.
5.1. Fan
As shown in the
Figure 8 and
Table 2, the Raw data achieved an F1-score of 0.6754 for the Fan type. The proposed ACoM achieved an F1-score of 0.655 even at the lowest bitrate (1.3 kbps), outperforming AAC at 4 kbps (0.634). When the bitrate was increased to 1.7, 3.3, and 6.3 kbps, ACoM achieved an F1-score of 0.660, 0.664 and 0.667, respectively, approaching the performance level of the Raw data. This indicates that critical anomaly patterns can be preserved even under extremely low-bitrate conditions. In contrast, AAC achieved an F1-score of 0.634 at 4 kbps, confirming that its performance degrades at lower bitrates compared to higher ones. VVC achieved an F1-score of 0.613 at 401.9 kbps and 0.621 at 95.5 kbps, both lower than the Raw baseline even at high bitrates. In particular, at 45.4 kbps, performance dropped markedly, with an F1-score of 0.590.
5.2. Valve
As illustrated in
Figure 9, the Raw input achieved an F1-score of 0.6654, while ACoM reached an identical score of 0.667 at 6.3 kbps. Even at lower bitrates of 3.3, 1.7 and 1.3 kbps, ACoM achieved F1-scores of 0.654, 0.592 and 0.571, respectively, thereby maintaining a reasonable level of performance under low-bitrate conditions. AAC achieved its highest performance of 0.667 at 16 kbps, whereas at the lower bitrate of 4 kbps, its performance dropped to 0.641. VVC achieved an F1-score of 0.664 at 401.9 kbps, nearly matching the Raw input, but its performance dropped to 0.557 at 95.5 kbps, indicating that it remains competitive primarily in high-bitrate settings.
5.3. ToyCar
As illustrated in
Figure 10, the Raw input achieved an F1-score of 0.6619. ACoM recorded its highest performance of 0.641 at 1.7 kbps, while maintaining stable levels of 0.624 and 0.629 at 3.3 and 6.3 kbps, respectively. Notably, even at an ultra-low bitrate of 0.9 kbps, ACoM achieved a score of 0.605, demonstrating its effectiveness in low-bandwidth environments. AAC achieved relatively strong performance with a score of 0.632 at 16 kbps; however, its performance dropped sharply to 0.524 at 24 kbps. VVC achieved a score of 0.620 at 401.9 kbps, but its performance decreased to 0.580 at 25.4 kbps.
5.4. Slider
Among the four machine types, Slider yielded the most challenging results. As illustrated in
Figure 11, the Raw input achieved a relatively high performance of 0.6437, ACoM showed a sharp performance degradation at lower bitrates, recording 0.601 at 3.3 kbps, 0.569 at 1.7 kbps, 0.469 at 1.3 kbps and 0.465 at 0.9 kbps. AAC achieved its best performance of 0.667 at both 4 kbps and 32 kbps, outperforming the proposed method on this dataset. This can be attributed to the subtle differences between normal and anomalous spectrogram in the Slider dataset and the monotonous pattern variations, suggesting that information lost during feature map compression had a significant impact on detection performance. VVC underperformed the Raw baseline across all bitrate ranges, with performance dropping to as low as 0.363 at 45.4 kbps.
In addition,
Figure 11 shows only three data points for the ACoM curve, which is due to the x-axis being plotted on a logarithmic scale ranging from 1 to 1000 kbps. The data point at 0.9 kbps lies at the far left of the axis range and may appear overlapped or invisible in the plot; however, it actually corresponds to an F1-score of 0.465.
5.5. Comprehensive Analysis
Table 2 summarizes the F1-scores and corresponding bitrates for the four machine types. Overall, ACoM achieved comparable or superior performance to AAC in the Fan, Valve, and ToyCar datasets, while operating at less than half the bitrate. VVC approached the Raw-level performance in high-bitrate settings (above 200 kbps), but in low-bitrate ranges it failed to preserve fine patterns essential for machine perception, resulting in degraded performance. For the Slider dataset, due to the characteristics of the data and the limitations of feature compression, AAC achieved the highest performance, while both ACoM and VVC suffered substantial performance degradation at lower bitrates.
Consequently, the proposed ACoM demonstrates the potential to achieve both high transmission efficiency and stable performance in anomalous sound detection under low-bitrate conditions. VVC approached the original-level performance in high-bitrate settings, but at low bitrates it exhibited limitations in retaining the fine details necessary for machine perception. It is expected that retraining the FCM module with audio-specific adaptations will help reduce the performance gap, even in challenge datasets such as Slider.
6. Conclusions
In this study, we proposed an ACoM framework based on FCM, which contrasts with conventional signal-based compression methods (AAC and VVC), to enhance transmission efficiency for audio anomaly detection (ASD) systems under low-bitrate conditions. Unlike conventional approaches that encode the entire raw signal, the proposed method compresses and transmits only the intermediate feature maps of the deep learning model, thereby delivering the essential information required for machine perception more efficiently. To this end, we conducted a comparison across four input pathways—Raw, AAC, VVC and ACoM—using the DCASE 2024 Task 2 dataset. Independent models were trained for each machine type and performance was evaluated based on F1-score and bitrate.
Experimental results show that ACoM achieved equal or higher F1-scores than AAC in the Fan, Valve and ToyCar machine types, while operating at less than half the bitrate. In particular, ACoM achieved an F1-score of 0.655 at 1.3 kbps for Fan and recorded its highest performance at 6.3 kbps for Valve and 1.7 kbps for ToyCar. This suggests that critical anomalous sound patterns can be reliably preserved even under ultra-low bitrate conditions. In contrast, for the Slider dataset, the differences between normal and anomalous signals in the spectrogram were subtle, making the impact of information loss during FCM-based feature compression more pronounced; consequently, AAC demonstrated superior performance. This phenomenon can be attributed to the fact that FCM was applied without an audio-specific design, suggesting high potential for improvement through retraining and architectural optimization of the module in future work.
Moreover, across the experiments, increasing the bitrate generally led to gradual improvements in F1-score; however, in certain ranges, slight drops or fluctuations in performance were observed. This can be attributed to the fact that the evaluation in this study was based on binary decision rather than continuous scores, whereby changes in classification boundaries near the threshold directly affected the F1-score. Such variability becomes particularly pronounced when the output distribution of the model changes sightly between adjacent bitrate settings.
VVC-based spectrogram compression achieved performance comparable to Raw in high-bitrate regions; however, below 100 kbps, it exhibited more pronounced degradation than both AAC and ACoM. This result indicates that VVC, being designed with a focus on human visual quality optimization, has limitations in preserving fine spectrogram patterns that are critical for machine learning. Therefore, the experiments confirmed that rather than merely applying state-of-the-art video compression techniques, it is necessary to employ feature-preserving coding schemes optimized for machine perception.
The contributions of this study can be summarized as follows. First, we quantitatively demonstrated the effectiveness of feature-based compression, moving beyond conventional signal-centric approaches in audio anomaly detection. Second, by significantly reducing the bitrate required to achieve comparable performance, we demonstrated the applicability of the proposed approach to bandwidth-constrained environments such as industrial IoT and remote monitoring. Third, through comparisons with VVC, we identified the limitations of human-centric compression and provided evidence supporting the need for machine-centric standardization.
In future work, we plan to redesign in the FCM encoder–decoder architecture to specialize it for the audio domain and to enhance generalization across diverse machine types through multi-domain training. We also intend to explore various quantization and normalization techniques to further push the performance limits in low-bitrate regimes, aiming at optimal bitrate-performance trade-offs. Another important direction will be to comprehensively validate the effectiveness of the proposed approach through real-time transmission and detection experiments in practical industrial environments. In parallel, we will benchmark the framework against more advanced anomaly detection models, such as transformer-based and self-supervised methods, to further validate its generality. Moreover, we aim to investigate hybrid and adaptive compression principles that dynamically combine signal-level, spectrogram-level, and feature-level approaches, with the goal of enhancing robustness and broadening applicability in real-world scenarios. Finally, although the present study was limited to four machine types from the DCASE 2023 Task 2 dataset, future work will extend validation to larger and more diverse datasets, as well as more complex application scenarios such as industrial IoT and real-time monitoring under constrained network conditions.