Visual Feature Domain Audio Coding for Anomaly Sound Detection Application

Subin Byun; Jeongil Seo

doi:10.3390/a18100646

and

Department of Computer Engineering, Dong-A University, Busan 49315, Republic of Korea

^*

Author to whom correspondence should be addressed.

Algorithms2025, 18(10), 646;https://doi.org/10.3390/a18100646

This article belongs to the Special Issue Visual Attributes in Computer Vision Applications

Version Notes

Order Reprints

Abstract

Conventional audio and video codecs are designed for human perception, often discarding subtle spectral cues that are essential for machine-based analysis. To overcome this limitation, we propose a machine-oriented compression framework that reinterprets spectrograms as visual objects and applies Feature Coding for Machines (FCM) to anomalous sound detection (ASD). In our approach, audio signals are transformed log-mel spectrograms, from which intermediate feature maps are extracted, compressed, and reconstructed through the FCM pipeline. For comparison, we implement AAC-LC (Advanced Audio Coding Low Complexity) as a representative perceptual audio codec and VVC (Versatile Video Coding) as spectrogram-based video codec. Experiments were conducted on the DCASE (Detection and Classification of Acoustic Scenes and Events) 2023 Task 2 dataset, covering four machine types (fan, valve, toycar, slider), with anomaly detection performed using the official Autoencoder baseline model released in DCASE 2024. Detection scores were computed from reconstruction error and Mahalanobis distance. The results show that the proposed FCM-based ACoM (Audio Coding for Machines) achieves comparable or superior performance to AAC at less than half the bitrate, reliably preserving critical features even under ultra-low bitrate conditions (1.3–6.3 kbps). While VVC retains competitive performance only at high bitrates, it degrades sharply at low bitrates. These findings demonstrate that feature-based compression offers a promising direction for next-generation ACoM standardization, enabling efficient and robust ASD in bandwidth-constrained industrial environments.

Keywords:

anomalous sound detection; audio coding for machines; deep learning

1. Introduction

Audio-driven intelligent services—ranging from smart speaker voice assistants and abnormal sound detection in industrial facilities to environmental sound recognition in autonomous vehicles, healthcare monitoring, and urban safety or disaster alert systems—are rapidly proliferating across modern society and diverse industrial sectors. These technologies are regarded not only as enhancements to user experience but also as critical factors directly linked to safety and human life, with the associated market size continuing to exhibit steady growth []. In line with this trend, the international standardization body MPEG (Moving Picture Experts Group) has recently initiated efforts to establish the standardization of Audio Coding for Machines (ACoM) [].

However, since audio waveforms are inherently one-dimensional time-series signals, there exist structural limitations when directly applying convolutional neural networks (CNNs) or vision transformer (ViT)-based architectures, which are primarily optimized for two-dimensional spatial inputs [,,]. To address this issue, a widely adopted approach is to transform audio signals into two-dimensional visual representations—typically through time-frequency conversion methods such as the Short-Time Fourier Transform (STFT) or mel-spectrograms—and subsequently use these representations as inputs to deep learning models [,].

This trend is also evident in the DCASE (Detection and Classification of Acoustic Scenes and Events) Challenge, an international benchmark in audio-based AI. Both the official baseline and many top-performing systems adopt spectrogram-based representations, confirming that converting audio signals into image is an effective and widely accepted practice. In this work, we adopt the official Autoencoder baseline model of DCASE 2024 Task 2, while conducting all experiments on the DCASE 2023 Task 2 Development Dataset.

Such spectrogram-based approaches go beyond mere structural transformation, offering the advantage of feature learning grounded in visual properties. Spectrograms encompass diverse visual attributes—such as textures, shapes, and energy bands distributed across the time-frequency plane—and deep learning models leveraging these representations have recently demonstrated human-level or even superior accuracy across various tasks, thereby validating their effectiveness. However, in real-world industrial environments—where audio is collected at the edge and processed in the cloud—the limitations of network bandwidth and latency constraints often act as bottlenecks that can degrade model performance.

Conventional audio data codecs are designed to maximize perceptual quality for human listeners, typically employing perceptual audio coding techniques that discard high-frequency components deemed inaudible []. In this process, essential information such as fine-grained frequency details or phase components required by machine learning models may be lost, leading to a trade-off between transmission efficiency and model accuracy []. Similarly, conventional video codecs are optimized for human visual perception, often discarding subtle frequency textures or fine boundary details that are imperceptible to humans. This, in turn, can distort the visual cues within spectrograms and adversely affect the performance of machine learning models [].

Therefore, conventional codecs such as AAC and VVC, while efficient for human perception, are not well-suited for machine listening tasks. This limitation highlights the need for machine-oriented codecs that can preserve subtle spectral cues essential for anomaly detection and other learning-based tasks. At present, however, there is no established codec in the audio domain specifically optimized for machines. Feature Coding for Machines (FCM) has been actively studied and standardized in the image and video domains as part of the VCM (Video Coding for Machines) framework, but its application to audio remains unexplored. In parallel, Audio Coding for Machines (ACoM) has only recently entered the standardization process. Importantly, the current stage of ACoM (Phase 1) is limited to near-lossless signal compression, while feature-level lossy coding is planned for Phase 2. This situation explains the scarcity of prior work in the audio domain and highlights the novelty of our study in addressing this gap.

2. Related Work

2.1. Conventional Audio and Video Compression Methods: AAC (Advanced Audio Coding) and VVC (Versatile Video Coding)

Advanced Audio Coding (AAC), first introduced in 1997 as part of MPEG-2 Part 7, has since become a core component of the MPEG-4 standard and is now widely deployed, being implemented in more than ten billion devices worldwide. The AAC family encompasses multiple profile—including AAC-LC, HE-AAC, HE-AAC v2, and xHE-AAC—supporting a wide range of applications from high-quality streaming at bitrates above 128 kbps to low-bitrate IoT scenarios operating below 24 kbps []. All profiles employ perceptual audio coding techniques, maximizing compression efficiency by discarding high-frequency and phase components that are difficult for humans to perceive. However, such a design has been criticized for its limitations, as it may discard fine-grained frequency patterns and phase information required by machine learning models.

Versatile Video Coding (VVC), established in 2020 by the Joint Video Experts Team (JVET) of ITU-T VCEG and ISO/IEC MPEG, is a next-generation video coding standard that offers improved compression efficiency over HEVC and supports a wide range of applications, including 8K UHD, HDR/WCG, and 360-degree video. However, VVC is likewise designed with optimization for human visual quality, which may result in the loss of delicate frequency patterns or fine edge details within spectrograms. In particular, this study does not process raw audio directly; instead, one-dimensional audio signals are transformed into spectrograms and treated as two-dimensional visual data. For this reason, VVC, the latest video coding standard, was selected as the comparative anchor. In summary, AAC was considered the representative baseline for perceptual audio compression, while VVC served as the representative baseline for spectrogram compression in the video domain. Using these two conventional approaches, we analyzed their limitations from the perspective of machine perception performance.

Figure 1 illustrates how these differences manifest in practice. The original log-mel spectrogram preserves find spectral cues across the full frequency band. After AAC-LC encoding (32 kbps), high-frequency components and subtle spectral details are visibly attenuated, reflecting the perceptual coding strategy that removes inaudible content for humans. By contrast, VVC compression introduces distortions in the time-frequency textures and blurs fine boundaries, since it is optimized for human visual quality rather than for preserving anomaly-related structures. These visual differences demonstrate why conventional codecs, while effective for human-centered consumption, may be inadequate for machine perception tasks.

Figure 1. Log-mel spectrograms under different conditions: (a) Original, (b) AAC-LC (32 kbps), and (c) VVC.

Consequently, AAC and VVC can be regarded as conventional compression techniques optimized for human-centered consumption of audio and video signals, respectively, but they fail to sufficiently preserve the information required for machine perception. This structural trade-off between performance and bandwidth efficiency in practical edge-cloud audio analysis systems underscores the limitations of existing codecs and is accelerating the transition toward next-generation standards such as ACoM.

2.2. Audio Coding for Machines (ACoM)

Audio Coding for Machines (ACoM) represents a novel standardization approach that compresses and transmits audio data not on the basis of human perceptual quality but rather according to the analytical performance of machine learning models []. This represents an attempt to overcome the structural limitation of conventional audio codecs, which discard certain information for perceptual optimization but consequently also eliminate critical feature information required by machines.

The ACoM standard is structured around a two-phase roadmap []. Phase 1 aims to define a common format for audio data compression—either lossless or near-lossless—suitable for dataset exchange and long-term preservation. This phase focuses on improving storage efficiency while preserving the quality of diverse audio sources, including medical acoustics, spatial audio, and sensor-based recordings.

Phase 2 adopts a feature-centric, lossy coding approach in which intermediate feature maps utilized within deep learning-based recognition models are extracted, quantized, compressed, and transmitted, rather than encoding the entire signals. This approach minimizes recognition performance degradation even at low bitrates and is evaluated as being particularly advantageous for real-time analysis in edge-cloud environments.

Accordingly, this study draws upon the concept of feature-based compression proposed in ACoM Phase 2 and explores techniques for efficiently transmitting intermediate feature of audio recognition models. This aligns with ACoM’s technical objective of enhancing transmission efficiency while preserving machine learning performance, and it experimentally supports the structural shift envisioned in next-generation audio coding standards.

MPEG is currently preparing a Call for Proposals (CfP) for the formal standardization of ACoM, and once evaluation metrics and reference implementations are established, approaches such as the one presented in this study may serve as potential candidate technologies for standard.

2.3. Video Coding for Machines (VCM)

Video Coding for Machines (VCM) is an ISO/IEC MPEG standardization initiative that shifts the objective of video compression away from human viewing quality toward the optimization of machine perception tasks such as object detection, tracking and classification. In contrast to conventional codecs such as VVC and HEVC, it is conceptually distinguished by defining compression schemes tailored for data consumed directly by machines rather than by humans. In particular, it originated from the awareness that conventional video codecs, being designed around visual quality criteria, inevitably discard fine-grained feature information that is crucial for machine learning.

VCM can be broadly categorized into ROI-based pixel compression and intermediate feature map-based coding. The first track, ROI (Region of Interest)-based coding, enhances transmission efficiency by first detecting regions critical for object detection and tracking, then applying lower quantization parameters (QPs) to those areas while assigning higher QPs to less important background regions. By concentrating the bitstream on regions critical for recognition, this method has been reported to maintain key performance metrics—such as mean Average Precision (mAP) and Multiple Object Tracking Accuracy (MOTA)—even at lower bitrates compared to VVC.

The second track, Feature Coding for Machines (FCM), compresses and transmits intermediate feature maps extracted from deep learning networks, rather than the entire input images. The representative reference model, FCTM-7, eliminates less significant channels from multi-channel features, normalizes their distribution through statistical non-linear transformations, rearranges them into a two-dimensional grid, and finally encodes them using VVC. At the decoder side, hierarchical feature structures are reconstructed through processes such as unpacking, dequantization, and distribution correction networks.

In this way, VCM seeks to achieve a balance transmission efficiency and recognition performance through its two complementary approaches with distinct input domains. ROI-based compression is well-suited for applications where visual fidelity is critical, whereas FCM-based compression offers superior efficiency in edge-cloud environments constrained by bandwidth and requiring real-time processing.

At present, however, there are no published FCM methods specifically designed for the audio domain, which makes direct comparison difficult. Therefore, in this study we benchmark our proposed audio-oriented FCM approach against two state-of-the-art baseline: AAC-LC as the representative perceptual audio codec, and VVC as the state-of-the-art codec for spectrogram (image-like) compression. This comparison allows us to evaluate the effectiveness of the proposed framework relative to the strongest existing methods in both audio and visual coding domains.

2.4. Anomalous Sound Detection

The DCASE (Detection and Classification of Acoustic Scenes and Events) Challenge serves as an international benchmark for advancing audio-based artificial intelligence, providing standardized datasets and baseline models for a wide range of audio recognition tasks. In particular, DCASE 2024 Task 2 addresses the problem setting of first-shot unsupervised anomalous sound detection, aiming to identify anomalies in unseen machine types using only limited normal data. The official baseline model is based on Autoencoder (Autoencoder). Input audio is first transformed into log-mel spectrograms through STFT and mel-filterbank processing, and then compressed and reconstructed by the Autoencoder. Anomaly scores are subsequently computed using reconstruction errors and Mahalanobis distance. The baseline has proven effective in capturing complex frequency patterns and subtle acoustic anomalies, and was therefore adopted as a comparative reference in this study.

In this study, we build upon the baseline Autoencoder model by applying input data processed through different compression—reconstruction pathways—namely AAC, VVC and FCM-based ACoM—within the same anomaly detection pipeline. The impact of each method on anomaly detection performance is quantitatively compared and evaluated. Through this analysis, we examine the structural trade-off between transmission efficiency and recognition accuracy, and discuss the necessity of audio coding schemes optimized for machine perception.

3. Method

This section discusses three audio data transmission approaches for anomalous sound detection: conventional perceptual audio coding using AAC, spectrogram image compression with VVC, and the proposed feature map compression based on FCM. Because these methods differ in their compression targets, processing pipelines, and information loss characteristics, a comparative analysis enables us to examine the correlation between transmission efficiency and anomaly detection performance. We deliberately selected these three approaches because they represent distinct compression principles: AAC-LC as a widely deployed perceptual audio codec, VVC as the state-of-the-art video codec applied to spectrograms, and FCM-based ACoM as a machine-oriented feature compression scheme. This choice allows us to focus on the effect of compression strategy itself under the same anomaly detection pipeline, rather than benchmarking anomaly detection algorithms.

3.1. AAC

AAC is a representative perceptual audio coding technology based on the MPEG-2/4 standards. It combines multiple techniques—such as MDCT-based filter banks, perceptual modeling, stereo synthesis and optimized bit allocation—to achieve high compression efficiency and listening quality. In this study, we adopted the AAC-LC (Low Complexity) profile, which offers low computational complexity and suitability for real-time processing, considering transmission efficiency and standard compatibility in a single-channel, 16-bit, 16 kHz environment. AAC-LC maintains audio quality comparable to the original even at bitrates below 32 kbps, and operates reliably in resource-constrained environments such as IoT devices and mobile platforms.

Figure 2 illustrates the AAC-LC–based anomaly detection pipeline implemented in this study. The original audio signal is first compressed by the AAC encoder, and the resulting bitstream is then reconstructed through the decoder. The reconstructed audio is transformed into a log-mel spectrogram through STFT and mel-filterbank processing, and this spectrogram is then used as the input of the anomaly detection model. Although this approach is optimized for preserving auditory quality, it may result in the loss of subtle anomalous cues required for machine perception; thus, it serves as a baseline for comparison with other methods.

Figure 2. AAC-LC-based anomaly detection pipeline.

3.2. VVC

VVC (Versatile Video Coding) is the latest video compression standard, offering approximately 50% bitrate reduction compared to HEVC while maintaining comparable visual quality. In this study, audio data were converted into log-mel spectrograms, normalized into the YUV 4:0:0 10-bit format, and then fed into the VVC encoder. The compressed bitstream is reconstructed by the VVC decoder, and the restored log-mel spectrogram is directly used as the input to the anomaly detection model.

Figure 3 illustrates the VVC-based spectrogram compression pipeline. In this process, the spectrogram is treated as image data, enabling high compression efficiency; however, since the compression is optimized for human visual quality, fine-grained frequency patterns and boundary information critical for machine perception may be lost. Nevertheless, it serves as an important comparative baseline in this study, as it represents the upper bound of performance achievable with state-of-the-art video compression techniques.

Figure 3. VVC-based spectrogram compression pipeline.

3.3. ACoM

Finally, this study extends the Feature Coding for Machines (FCM) standard, currently under development by MPEG, to the task of audio anomaly detection. Unlike conventional audio and video codecs designed for perceptual optimization, FCM selectively compresses and transmits only intermediate feature maps from deep learning models, with the objective of efficiently conveying the information required for machine learning.

Figure 4 presents the overall structure of the proposed FCM-based anomaly detection pipeline. The audio signal is first converted into a log-mel spectrogram, from which intermediate feature maps are extracted in NN-part1; these feature maps are then compressed through the FCM encoder. The encoder first performs a Feature Reduction stage, in which multi-scale features of varying resolutions are normalized and padded to uniform size, then merged by FENet into a single-scale 320-dimensional feature representation. The structure of FENet is shown in Figure 5, where four parallel encoding blocks complementarily fuse high- and loss-resolution information. The output of the final block is passed through a Gain Unit, where it is transformed into a latent vector prepared for quantization. In the subsequent Feature Conversion stage, the vector is packed into frames, normalized using min–max scaling, quantized into 10-bit integers, and then converted into the YUV format. Finally, in the Inner Codec stage, a VVC-based intra-mode codec is employed to generate the final bitstream.

Figure 4. Overall FCM-based anomaly detection pipeline.

Figure 5. FENet: Multi-Resolution Feature Fusion and Compression Architecture.

The FCM decoder performs this process in reverse. The decoded feature maps progressively recover their resolution through DRNet, the structure of which is presented in Figure 6. DRNet consists of an Inverse Gain Unit and four decoding blocks, each performing upsampling at scales of 2, 4, 8 and 16. At each stage, a Feature Mixing Block is employed to fuse lower-level features during higher-resolution reconstruction, thereby minimizing the loss of fine details. More specifically, in the Inverse Gain Unit, each feaature channel of the decoded latent vector is multiplied by the corresponding inverse gain value to recover the original dynamic range. The resulting output is sequentially passed through the four decoding blocks, where each block includes an upsampling layer and a Feature Adaptation Block that refines spatial and contextual information. Through these progressive decoding and fusion operations, a multi-scale feature hierarchy is reconstructed, producing a high-resolution feature map thet closely resembles the original spectrogram. Finally, the reconstructed feature map is fed into our ASD model for anomaly detection rather than being passed to NN-part2.

Figure 6. DRNet: Restoration Architecture Based on Progressive Upsampling and Feature Fusion.

The core idea of the proposed FCM approach is to target intermediate feature maps of deep learning models—rather than raw waveforms or full spectrograms—as the transmission unity, thereby reducing necessary data transfer and maximizing transmission efficiency. This enables the preservation of anomaly detection performance even under low-bitrate conditions, offering significant advantages particularly in industrial IoT environments with constrained transmission bandwidth.

From a practical deployment perspective, this workflow can be implemented in a typical edge-cloud collaboration setting. In industrial IoT scenarios, lightweight encoding is performed on edge devices to minimize local computational load and bandwidth usage. The compressed bitstream is then transmitted to cloud servers, where the anomaly detection model is executed. This architecture enables real-time monitoring while reducing communication costs. Furthermore, with model compression or lightweight feature extraction networks, part of the anomaly detection pipeline could be shifted onto edge devices in future implementations to support ultra-low latency requirements.

4. Training Configuration

In this chapter, the tree compression methods described in Section 3 are applied to the actual anomaly detection pipeline, and the training procedures and experimental settings are detailed. The overall workflow of the proposed system consists of four stages: Raw audio input → Autoencoder → Anomaly scoring, as illustrated in Figure 7. In particular, since the encoding method of the input signal introduces structural differences, both the preprocessing of training data and the model learning pathway vary accordingly.

Figure 7. Anomalous sound detection pipeline used in this study: processing flow from raw audio through preprocessing, Autoencoder encoding/decoding, to anomaly scoring.

In this study, four input pathways were established for comparison. The four pathways are as follows: (1) the Raw path, which directly uses preprocessed original audio; (2) the AAC path, which employs audio encoded and decoded by AAC-LC; (3) the VVC path, which converts log-mel spectrograms into image formats and compresses them using VVC; and (4) the FCM-ACoM path, which compresses and reconstructs intermediate feature maps extracted from NN-part1 through the FCM framework.

All pathways employ the same Autoencoder-based anomaly detection network; however, since the characteristics of the input data differ, each was trained and evaluated independently.

4.1. Dataset and Preprocessing

The experiments were conducted using the DCASE 2023 Challenge Task 2 Development Dataset [,]. This study focuses on four machine types—fan, slider, toycar and valve—each represented by approximately 10 s mono audio clips sampled at 16 kHz with 16-bit resolution. For training and evaluation, the entire dataset was utilized without distinguishing between source and target domains. For each class, 1000 normal samples are provided for training, along with 100 normal and 100 anomalous samples for testing. The detailed composition is summarized in Table 1.

Table 1. Composition of DCASE2023 Dataset.

All audio signals were resampled to 16-bit PCM at 16 kHz, after which log-mel spectrograms were generated through the following preprocessing steps.

STFT Transformation: Frame size of 1024 points (64 ms), hop size of 512 samples, with a Hamming window applied.
Mel Filterbank: Applied a 128-channel mel filter to transform the spectrogram onto the mel scale.
Log-Scale Transformation: Applied logarithmic scaling to the amplitude spectrums to convert it into the dB scale.
Normalization: Performed min–max normalization on each clip, scaling the values to the range [0, 1].

Finally, two-dimensional log-mel spectrograms of size

128 \times 313

were used as inputs to the model. This preprocessing pipeline clearly represents frequency–time patterns, providing inputs well-suited for vision-based deep learning models to learn features relevant to anomaly detection.

4.2. Anomaly Scoring and Decision Making

During testing, the reconstruction error e and the latent vector

z

are computed for the input spectrogram. These two signals are normalized and combined with weights to form the anomaly scores, which are then used to determine anomalous conditions. The overall procedure is as follows.

4.2.1. Reconstruction Error

Let the input log-mel spectrogram be

X \in R^{F \times T}

and the reconstructed output of the Autoencoder be

\hat{X} \in R^{F \times T}

. The reconstruction error is defined as

e = \frac{1}{F T} ∥ X - \hat{X} ∥_{F}^{2}

(1)

where F denotes the number of mel channels (128), T the number of frames (313), and

{| \cdot |}_{F}

the Frobenius norm. Equation (1) represents the mean squared error (MSE) averaged over all time-frequency bins, where higher values indicates greater reconstruction difficulty.

4.2.2. Latent Vectors and Mahalanobis Distance

Let the latent vector obtained from the bottleneck layer of the Autoencoder be

z \in R^{128}

(if a latent map is used, it is vectorized via spatial average pooling). From the latent distribution of the normal training samples, the mean

μ

and covariance

Σ

are estimated. For numerical stability, we use

Σ_{ε} = Σ + ε I

(ε > 0)

, and compute the Mahalanobis distance as

d (z) = \sqrt{(z - μ^{⊤} Σ_{ε}^{- 1} (z - μ))}

(2)

4.2.3. Score Normalization

Since e and

d (z)

are on different scales, both are normalized to

[0, 1]

using percentile-based min–max calibration derived from the validation set, as follows:

\begin{matrix} \tilde{e} & = {clip}_{[0, 1]} (\frac{e - P_{5} (e)}{P_{95} (e) - P_{5} (e)}), \end{matrix}

(3)

\begin{matrix} \tilde{e} & = {clip}_{[0, 1]} (\frac{d (z) - P_{5} (d)}{P_{95} (d) - P_{5} (d)}) \end{matrix}

(4)

where

P_{q} (\cdot)

denotes the q-th percentile, and

{clip}_{[0, 1]} (\cdot)

represents the operation of saturating values within the range [0, 1]. The use of percentiles enhances robustness against outliers.

4.2.4. Score Fusion

The final anomaly score is defined as

Score = α \tilde{e} + (1 - α) \tilde{d}

(5)

where

α

controls the weighting between the normalized reconstruction error and the normalized Mahalanobis distance. The weighting factor

a l p h a \in [0, 1]

is selected via grid search on the validation set (e.g.,

0, 0.25, 0.5, 0.75, 1

). Since reconstruction difficulty (

\tilde{e}

) and distributional deviation (

\tilde{d}

) are generally complementary,

α

can be optimized differently depending on the dataset and codec pathway.

4.2.5. Thresholding and Binary Decision

The threshold

τ

is chosen on the validation set to maximize the F1-score, and binary decisions are made as

\hat{y} = 1 [Score \geq τ]

(6)

where

\hat{y}

denotes the predicted class label. In this study, considering the domain generalization setting of DCASE, a single threshold

τ

is employed for each machine type (i.e., shared across source and target domains).

4.2.6. Evaluation Metrics

The primary evaluation metrics are AUC and F1-score. AUC reflects classification performance independent of the threshold, whereas F1-score captures the balance between precision and recall at the selected threshold. In reporting, we compare not only the performance across machine types but also the trends observed under different pathways (Raw, AAC, VVC, and FCM-ACoM) and varying bitrates.

4.2.7. Implementation Details

Covariance Estimation: For numerical stability when computing $Σ_{ε}^{- 1}$ , we set $ε \approx 10^{- 3}$ . When the number of samples is small relative to the latent dimensionality, applying shrinkage (e.g., Ledoit-Wolf) also provides stability.
Clip-Level Scoring: Equations (1) and (2) assume clip-level score computation. When frame-level scores are required, temporal averaging or top-percentile aggregation (e.g., top- $k %$ ) is applied.
Validation Split: Normal training samples are partitioned into training and validation subsets to estimate the normalization bound ( $P_{5}, P_{95}$ ), $α$ , and $τ$ . If a separate validation set containing anomalous samples is available, it is used for F1-score optimization.

The anomaly score is defined as

Score = α \cdot \tilde{e} + (1 - α) \cdot \tilde{d}

(7)

where

\tilde{e}

and

\tilde{d}

denote the normalized reconstruction error and Mahalanobis distance within the range [0, 1], respectively, and

α

represents the weighting factor. The Mahalanobis distance is computed as

d (z) = \sqrt{{(z - μ)}^{⊤} Σ^{(- 1)} (z - μ)}

(8)

with

μ

and

Σ

estimated from the latent vector distribution of normal training samples. For numerical stability of the covariance matrix, a small value

ε

is added to the diagonal before inversion. The threshold

τ

is determined on the validation set to maximize the F1-score, and final evaluation is performed using both AUC and F1-score.

4.3. Model Training Setup

The model employs the symmetric 2D convolutional Autoencoder architecture described in Section 3, with only the data transformation process differing according to the input pathway. In the Raw pathway, the original audio waveform is directly converted into log-mel spectrograms and fed into the network. In contrast, in the AAC pathway, audio data are first compressed and reconstructed using the AAC-LC codec, after which the restored signals are transformed into log-mel spectrograms for input to the network. In the VVC pathway, spectrograms are converted into an image format, compressed and reconstructed using VVC, and the results are used as input. In the FCM-ACoM pathway, intermediate feature maps extracted from NN-part1 are compressed and reconstructed with a VVC-based codec; before being passed to NN-part2, the restored feature maps are extracted and used as inputs to the anomaly detection model.

The training configuration was kept identical across all pathways. Optimization was performed using the Adam optimizer with an initial learning rate of

1 \times 10^{- 3}

, adjusted by a cosine annealing scheduler with a cycle of 20 epochs. The batch size was set up 64, with a maximum of 100 training epochs, and early stopping was applied based on validation losses. The loss function was mean squared error (MSE); however, for the FCM pathway, an additional

ℓ_{2}

regularization term was included to stabilize the latent space. All experiments were conducted on a Lenova ThinkStation equipped NVIDIA RTX A6000 GPU (Lenovo Group Ltd., Beijing, China; NVIDIA Corporation, Santa Clara, CA, USA). For compression, ffmpeg was used for AAC, while the internal codec based on VTM 12.0 was employed for FCM.

4.4. Inference and Evaluation Strategy

In the testing phase, the trained Autoencoder computes both the reconstruction error of the input and the Mahalanobis distance of its latent vector. These values are normalized, combined through a weighted average, and used to produce the final anomaly score. The threshold is set at the point that maximizes the F1-score on the validation set, and binary classification is then performed based on this criterion. Performance evaluation was conducted using AUC and F1-score to compare the anomaly detection performance across the four input pathways.

5. Experimental Results

In this chapter, we compare the anomaly sound detection (ASD) performance of the proposed ACoM method with that of the baseline pathways—AAC, VVC and Raw—using the DCASE 2023 Task 2 dataset. All experiments were conducted by training independent models for each of the four machine types—Fan, Valve, ToyCar and Slider—and performance was evaluated using the F1-score metric. In particular, we analyzed how detection performance varies with bitrate under each compression method, thereby validating their efficiency in bandwidth-constrained transmission environments. The overall F1-score results across all machine types and compression methods are summarized in Table 2.

Table 2. Comparison of F1-Score for RAW, VVC, AAC, and ACoM.

5.1. Fan

As shown in the Figure 8 and Table 2, the Raw data achieved an F1-score of 0.6754 for the Fan type. The proposed ACoM achieved an F1-score of 0.655 even at the lowest bitrate (1.3 kbps), outperforming AAC at 4 kbps (0.634). When the bitrate was increased to 1.7, 3.3, and 6.3 kbps, ACoM achieved an F1-score of 0.660, 0.664 and 0.667, respectively, approaching the performance level of the Raw data. This indicates that critical anomaly patterns can be preserved even under extremely low-bitrate conditions. In contrast, AAC achieved an F1-score of 0.634 at 4 kbps, confirming that its performance degrades at lower bitrates compared to higher ones. VVC achieved an F1-score of 0.613 at 401.9 kbps and 0.621 at 95.5 kbps, both lower than the Raw baseline even at high bitrates. In particular, at 45.4 kbps, performance dropped markedly, with an F1-score of 0.590.

Figure 8. F1-Score vs. bitrate results on the Fan dataset.

5.2. Valve

As illustrated in Figure 9, the Raw input achieved an F1-score of 0.6654, while ACoM reached an identical score of 0.667 at 6.3 kbps. Even at lower bitrates of 3.3, 1.7 and 1.3 kbps, ACoM achieved F1-scores of 0.654, 0.592 and 0.571, respectively, thereby maintaining a reasonable level of performance under low-bitrate conditions. AAC achieved its highest performance of 0.667 at 16 kbps, whereas at the lower bitrate of 4 kbps, its performance dropped to 0.641. VVC achieved an F1-score of 0.664 at 401.9 kbps, nearly matching the Raw input, but its performance dropped to 0.557 at 95.5 kbps, indicating that it remains competitive primarily in high-bitrate settings.

Figure 9. F1-Score vs. bitrate results on the Valve dataset.

5.3. ToyCar

As illustrated in Figure 10, the Raw input achieved an F1-score of 0.6619. ACoM recorded its highest performance of 0.641 at 1.7 kbps, while maintaining stable levels of 0.624 and 0.629 at 3.3 and 6.3 kbps, respectively. Notably, even at an ultra-low bitrate of 0.9 kbps, ACoM achieved a score of 0.605, demonstrating its effectiveness in low-bandwidth environments. AAC achieved relatively strong performance with a score of 0.632 at 16 kbps; however, its performance dropped sharply to 0.524 at 24 kbps. VVC achieved a score of 0.620 at 401.9 kbps, but its performance decreased to 0.580 at 25.4 kbps.

Figure 10. F1-Score vs. bitrate results on the ToyCar dataset.

5.4. Slider

Among the four machine types, Slider yielded the most challenging results. As illustrated in Figure 11, the Raw input achieved a relatively high performance of 0.6437, ACoM showed a sharp performance degradation at lower bitrates, recording 0.601 at 3.3 kbps, 0.569 at 1.7 kbps, 0.469 at 1.3 kbps and 0.465 at 0.9 kbps. AAC achieved its best performance of 0.667 at both 4 kbps and 32 kbps, outperforming the proposed method on this dataset. This can be attributed to the subtle differences between normal and anomalous spectrogram in the Slider dataset and the monotonous pattern variations, suggesting that information lost during feature map compression had a significant impact on detection performance. VVC underperformed the Raw baseline across all bitrate ranges, with performance dropping to as low as 0.363 at 45.4 kbps.

Figure 11. F1-Score vs. bitrate results on the Slider dataset.

In addition, Figure 11 shows only three data points for the ACoM curve, which is due to the x-axis being plotted on a logarithmic scale ranging from 1 to 1000 kbps. The data point at 0.9 kbps lies at the far left of the axis range and may appear overlapped or invisible in the plot; however, it actually corresponds to an F1-score of 0.465.

5.5. Comprehensive Analysis

Table 2 summarizes the F1-scores and corresponding bitrates for the four machine types. Overall, ACoM achieved comparable or superior performance to AAC in the Fan, Valve, and ToyCar datasets, while operating at less than half the bitrate. VVC approached the Raw-level performance in high-bitrate settings (above 200 kbps), but in low-bitrate ranges it failed to preserve fine patterns essential for machine perception, resulting in degraded performance. For the Slider dataset, due to the characteristics of the data and the limitations of feature compression, AAC achieved the highest performance, while both ACoM and VVC suffered substantial performance degradation at lower bitrates.

Consequently, the proposed ACoM demonstrates the potential to achieve both high transmission efficiency and stable performance in anomalous sound detection under low-bitrate conditions. VVC approached the original-level performance in high-bitrate settings, but at low bitrates it exhibited limitations in retaining the fine details necessary for machine perception. It is expected that retraining the FCM module with audio-specific adaptations will help reduce the performance gap, even in challenge datasets such as Slider.

6. Conclusions

In this study, we proposed an ACoM framework based on FCM, which contrasts with conventional signal-based compression methods (AAC and VVC), to enhance transmission efficiency for audio anomaly detection (ASD) systems under low-bitrate conditions. Unlike conventional approaches that encode the entire raw signal, the proposed method compresses and transmits only the intermediate feature maps of the deep learning model, thereby delivering the essential information required for machine perception more efficiently. To this end, we conducted a comparison across four input pathways—Raw, AAC, VVC and ACoM—using the DCASE 2024 Task 2 dataset. Independent models were trained for each machine type and performance was evaluated based on F1-score and bitrate.

Experimental results show that ACoM achieved equal or higher F1-scores than AAC in the Fan, Valve and ToyCar machine types, while operating at less than half the bitrate. In particular, ACoM achieved an F1-score of 0.655 at 1.3 kbps for Fan and recorded its highest performance at 6.3 kbps for Valve and 1.7 kbps for ToyCar. This suggests that critical anomalous sound patterns can be reliably preserved even under ultra-low bitrate conditions. In contrast, for the Slider dataset, the differences between normal and anomalous signals in the spectrogram were subtle, making the impact of information loss during FCM-based feature compression more pronounced; consequently, AAC demonstrated superior performance. This phenomenon can be attributed to the fact that FCM was applied without an audio-specific design, suggesting high potential for improvement through retraining and architectural optimization of the module in future work.

Moreover, across the experiments, increasing the bitrate generally led to gradual improvements in F1-score; however, in certain ranges, slight drops or fluctuations in performance were observed. This can be attributed to the fact that the evaluation in this study was based on binary decision rather than continuous scores, whereby changes in classification boundaries near the threshold directly affected the F1-score. Such variability becomes particularly pronounced when the output distribution of the model changes sightly between adjacent bitrate settings.

VVC-based spectrogram compression achieved performance comparable to Raw in high-bitrate regions; however, below 100 kbps, it exhibited more pronounced degradation than both AAC and ACoM. This result indicates that VVC, being designed with a focus on human visual quality optimization, has limitations in preserving fine spectrogram patterns that are critical for machine learning. Therefore, the experiments confirmed that rather than merely applying state-of-the-art video compression techniques, it is necessary to employ feature-preserving coding schemes optimized for machine perception.

The contributions of this study can be summarized as follows. First, we quantitatively demonstrated the effectiveness of feature-based compression, moving beyond conventional signal-centric approaches in audio anomaly detection. Second, by significantly reducing the bitrate required to achieve comparable performance, we demonstrated the applicability of the proposed approach to bandwidth-constrained environments such as industrial IoT and remote monitoring. Third, through comparisons with VVC, we identified the limitations of human-centric compression and provided evidence supporting the need for machine-centric standardization.

In future work, we plan to redesign in the FCM encoder–decoder architecture to specialize it for the audio domain and to enhance generalization across diverse machine types through multi-domain training. We also intend to explore various quantization and normalization techniques to further push the performance limits in low-bitrate regimes, aiming at optimal bitrate-performance trade-offs. Another important direction will be to comprehensively validate the effectiveness of the proposed approach through real-time transmission and detection experiments in practical industrial environments. In parallel, we will benchmark the framework against more advanced anomaly detection models, such as transformer-based and self-supervised methods, to further validate its generality. Moreover, we aim to investigate hybrid and adaptive compression principles that dynamically combine signal-level, spectrogram-level, and feature-level approaches, with the goal of enhancing robustness and broadening applicability in real-world scenarios. Finally, although the present study was limited to four machine types from the DCASE 2023 Task 2 dataset, future work will extend validation to larger and more diverse datasets, as well as more complex application scenarios such as industrial IoT and real-time monitoring under constrained network conditions.

Author Contributions

Conceptualization, J.S. and S.B.; methodology, S.B.; software, S.B.; validation, S.B. and J.S.; formal analysis, S.B.; investigation, S.B.; resources, J.S.; data curation, S.B.; writing—original draft preparation, S.B.; writing—review and editing, J.S.; visualization, S.B.; supervision, J.S.; project administration, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Dong-A University Research fund and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (Grant No. RS-2023-00273349). No APC was charged.

Data Availability Statement

The data used in this study are publicly available from the DCASE 2023 Task 2 Development Dataset at https://zenodo.org/records/782613 (accessed on 4 October 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

ISO/IEC JTC 1/SC 29/WG 6. Use Cases and Requirements on Audio Coding for Machines. In Proceedings of the 149th Meeting of the Moving Picture Experts Group (MPEG), Geneva, Switzerland, 20–24 January 2025. [Google Scholar]
Lim, U.T.; Jang, I.S.; Baek, S.K.; Kang, J.W. Analysis of Standardization Trends in Audio Coding for Machines. Korean Inst. Broadcast Media Eng. (KIBME) 2025, 30, 279–288. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar] [CrossRef]
Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep Learning for Audio Signal Processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
Herre, J.; Quackenbush, S.; Kim, M.; Skoglund, J. Perceptual Audio Coding: A 40-Year Historical Perspective. arXiv 2025, arXiv:2504.16223. [Google Scholar] [CrossRef]
Besacier, L.; Bergamini, C.; Vaufreydaz, D.; Castelli, E. The Effect of Speech and Audio Compression on Speech Recognition Performance. In Proceedings of the IEEE Workshop on Multimedia Signal Processing (MMSP), Cannes, France, 3–5 October 2001; pp. 123–128. [Google Scholar] [CrossRef]
Jin, S.; Guan, X.; Liu, Z. VVC Adaptive QP Offset Algorithm Based on Visual Perception. In Proceedings of the SPIE, Third International Conference on Signal Image Processing and Communication (ICSIPC 2023), Kunming, China, 26–28 May 2023; Volume 12916, p. 129161Z. [Google Scholar] [CrossRef]
ITU-R. Low Delay Audio Coding for Broadcasting Applications, Technical Report BS.2161; International Telecommunication Union: Geneva, Switzerland, 2009.
Kuznetsova, A.; Jang, I.; Lim, W.; Kim, M. Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine. arXiv 2025, arXiv:2507.12701. [Google Scholar]
ISO/IEC JTC 1/SC 29/WG 6. Updated Call for Proposals on Audio Coding for Machines, Technical Report N0364; International Organization for Standardization: Daejeon, Republic of Korea, 2025.
Harada, N.; Niizumi, D.; Takeuchi, D.; Ohishi, Y.; Yasuda, M.; Saito, S. ToyADMOS2: Another Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection under Domain Shift Conditions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop 2021, Online, 15–19 November 2021. [Google Scholar]
Dohi, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Yamamoto, M.; Nikaido, Y.; Kawaguchi, Y. MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection for Domain Generalization Task. In Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop 2022, Nancy, France, 3–4 November 2022. [Google Scholar]

Figure 1. Log-mel spectrograms under different conditions: (a) Original, (b) AAC-LC (32 kbps), and (c) VVC.

Figure 2. AAC-LC-based anomaly detection pipeline.

Figure 3. VVC-based spectrogram compression pipeline.

Figure 4. Overall FCM-based anomaly detection pipeline.

Figure 5. FENet: Multi-Resolution Feature Fusion and Compression Architecture.

Figure 6. DRNet: Restoration Architecture Based on Progressive Upsampling and Feature Fusion.

Figure 7. Anomalous sound detection pipeline used in this study: processing flow from raw audio through preprocessing, Autoencoder encoding/decoding, to anomaly scoring.

Figure 8. F1-Score vs. bitrate results on the Fan dataset.

Figure 9. F1-Score vs. bitrate results on the Valve dataset.

Figure 10. F1-Score vs. bitrate results on the ToyCar dataset.

Figure 11. F1-Score vs. bitrate results on the Slider dataset.

Table 1. Composition of DCASE2023 Dataset.

Type	Train	Test
Type	Train	Anomaly	Normal
Fan	1000	100	100
Slider	1000	100	100
ToyCar	1000	100	100
Valve	1000	100	100

Table 2. Comparison of F1-Score for RAW, VVC, AAC, and ACoM.

Dataset	RAW	VVC			AAC		ACoM (Proposal)
Dataset	F1-Score	kbps	QP	F1-Score	kbps	F1-Score	kbps	QP	F1-Score
Fan	0.6754	401.9	20	0.613	32	0.645	6.3	20	0.667
		242.9	25	0.651	24	0.662	3.3	25	0.664
		95.5	30	0.621	16	0.638	1.7	30	0.660
		45.4	35	0.590	4	0.634	1.3	33	0.655
Valve	0.6654	401.9	20	0.664	32	0.667	6.3	20	0.667
		242.9	25	0.584	24	0.658	3.3	25	0.654
		95.5	30	0.557	16	0.667	1.7	30	0.592
		45.4	33	0.660	4	0.641	1.3	33	0.571
ToyCar	0.6619	401.9	20	0.620	32	0.622	6.3	20	0.629
		95.5	25	0.602	24	0.524	3.3	25	0.624
		45.4	30	0.602	16	0.632	1.7	30	0.641
		25.4	35	0.580	4	0.594	1.3	33	0.605
Slider	0.6437	401.9	20	0.538	32	0.667	3.3	25	0.601
		242.9	25	0.639	24	0.521	1.7	30	0.569
		45.4	30	0.363	16	0.581	1.3	33	0.469
		25.4	35	0.560	4	0.667	0.9	40	0.465

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Visual Feature Domain Audio Coding for Anomaly Sound Detection Application

Abstract

1. Introduction

2. Related Work

2.1. Conventional Audio and Video Compression Methods: AAC (Advanced Audio Coding) and VVC (Versatile Video Coding)

2.2. Audio Coding for Machines (ACoM)

2.3. Video Coding for Machines (VCM)

2.4. Anomalous Sound Detection

3. Method

3.1. AAC

3.2. VVC

3.3. ACoM

4. Training Configuration

4.1. Dataset and Preprocessing

4.2. Anomaly Scoring and Decision Making

4.2.1. Reconstruction Error

4.2.2. Latent Vectors and Mahalanobis Distance

4.2.3. Score Normalization

4.2.4. Score Fusion

4.2.5. Thresholding and Binary Decision

4.2.6. Evaluation Metrics

4.2.7. Implementation Details

4.3. Model Training Setup

4.4. Inference and Evaluation Strategy

5. Experimental Results

5.1. Fan

5.2. Valve

5.3. ToyCar

5.4. Slider

5.5. Comprehensive Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics