1. Introduction
Sound Event Detection (SED) has become a pivotal research area within signal processing and machine learning, motivated by its broad range of applications, including surveillance systems, smart homes, multimedia indexing, bioacustics and healthcare monitoring. Traditionally, SED systems have relied on audio-specific representations such as mel-spectrograms [
1], employing architectures like Convolutional Recurrent Neural Networks (CRNNs) to analyze temporal and spectral features [
2]. Despite their widespread adoption, these conventional methods often struggle in complex acoustic environments, mainly due to sensitivity to noise, limited generalization, and difficulty detecting overlapping sound events.
Conformer architectures have also been investigated for this task [
3], demonstrating improved robustness in sound events classification compared to CRNNs. However, due to the global nature of the attention mechanism, these models tend to produce less temporally precise predictions than CRNNs, which better capture fine-grained temporal dynamics through their recurrent components. We target these limitations by predicting time–frequency event boxes directly on mel-spectrograms, which improves onset/offset precision and represents overlapping events via multiple instances.
Compounding these technical issues are significant data-related limitations, including the scarcity of high-quality, strongly-labeled datasets, subjectivity in manual annotations, and severe class imbalance, where common sounds dominate and critical but rare events are underrepresented. To overcome these obstacles, researchers increasingly explore strategies such as data augmentation, synthetic data generation, transfer learning, semi-supervised and unsupervised learning.
In parallel, advancements in computer vision, particularly in object detection techniques, offer promising alternative solutions. Inspired by the high accuracy and real-time performance of models like the YOLO (You Only Look Once) family [
4], recent research has begun exploiting structural similarities between visual data and mel-spectrograms [
5].
Because polyphonic SED requires multi-instance localization in time and frequency, single-stage detectors such as YOLO can be a reasonable choice: they jointly localize and classify multiple events in one pass, explicitly predicting multiple instances.
Building on this approach, this study introduces an exploratory SED framework by adapting modern object-detection architectures, specifically YOLOv8 and the more recent YOLOv11 [
6], through transfer learning [
7] directly applied to mel-spectrograms. This integration leverages robust visual recognition capabilities for audio analysis.
1.1. State of the Art in Sound Event Detection (SED)
Sound Event Detection (SED) seeks to identify what is happening in an acoustic scene and when it is happening, providing both the class label and the precise temporal boundaries of each event. Modern SED pipelines combine two closely related steps/tasks: data preparation, turning raw waveforms into model-ready inputs, and data modeling, learning to map those inputs to multi-label, time-aligned predictions.
1.1.1. Signal Representation and Augmentation
The prevailing front-end turns audio into two-dimensional time-frequency images by applying the Short-Time Fourier Transform followed by a Mel filter bank, producing mel-spectrograms that preserve perceptually important frequency cues and align naturally with convolutional processing [
1]. Because public SED corpora remain small relative to vision datasets [
8,
9], aggressive augmentation is essential. Time-stretching, pitch-shifting, and additive noise injection harden models against changes in playback speed, frequency variations, and background clutter, respectively [
10,
11,
12]. Mixup blends pairs of clips and labels to regularize the decision surface [
13], while SpecAugment and random erasing mask or overwrite random time-frequency patches, forcing the network to rely on broader context [
14,
15]. Synthetic audio produced with Generative Adversarial Networks (GANs) further expands the tail of rare events without costly annotation [
16]. Together, these techniques approximate the diversity of real-world acoustics and mitigate severe class imbalance.
1.1.2. Supervised Multi-Label Learning
With augmented mel-spectrogram stacks in hand, most leading systems train under mix of supervised, semi-supervised and unsupervised paradigms [
17]. Because multiple events can overlap, the task is framed as frame-wise multi-label classification: for every short hop (e.g., 10 ms), the model outputs an event probability estimation for each sound class, yielding a prediction matrix whose rows track time and where columns track classes. Classical Convolutional Recurrent Neural Networks (CRNNs) dominate thanks to CNN layers that capture local spectral structure and bidirectional RNN layers that model longer-term dynamics. Recent work increasingly replaces recurrence with self-attention, or full transformer–conformer blocks [
3], which scale better and capture long-range dependencies more efficiently.
Despite strong results, CRNN and Conformer pipelines inherit structural limitations from their frame-wise classification setup. CRNNs rely on post hoc thresholding/median filtering to turn frame-level class probability scores into events, which introduces boundary jitter and fragmented detections and provides no explicit mechanism to separate overlapping events (polyphony is only handled implicitly via multi-label frames). Conformers improve long-range context but their global attention tends to smooth these probabilities, degrading onset/offset precision while being more data-hungry and computationally heavier. Both families also show sensitivity to domain shift (backgrounds, devices) and class imbalance, typically requiring extensive augmentation or synthetic mixtures to generalize. These weaknesses motivate reframing SED around models that learn localization directly, rather than inferring it from frame-wise probabilities.
Unlike frame-level classifiers such as CRNN or Conformer, which estimate per-frame posterior probabilities and require post-processing to derive onset and offset times, YOLO adopts a localization-first paradigm that directly predicts event bounding boxes in the time–frequency domain. This eliminates several sources of temporal imprecision inherent to frame-level SED pipelines and motivates the adaptation of modern object detectors to the audio modality.
1.1.3. Cross-Domain Transfer
Cross-domain transfer constitutes a complementary advance: embedding extractor networks are first pre-trained on vast image corpora, then fine-tuned for audio. Visual pretraining supplies rich, modality-agnostic features that lessen the demand for labeled sound data, accelerate convergence, and endow models with greater robustness to frequency shifts, boosting performance in event detection and classification.
1.1.4. Evaluation
The systems are judged by how closely their predicted event lists (class + onset/offset) match expert annotations. Performance is measured with segment-level
-scores, event-based
-scores [
18] and PSDS [
18], emphasizing both correct class assignment and tight temporal alignment across diverse acoustic scenes.
Table 1 summarizes pros and cons of these metrics.
PSDS requires per-class posterior probabilities over time (a probability vector for every time interval). Our YOLO-style detector outputs sparse event segments with one confidence only for the predicted class and no probabilities for the remaining classes. Those missing probabilities cannot be credibly reconstructed, so PSDS is not well-defined for this output.
Additionally, there are technical mismatches that make PSDS unreliable here:
PSDS assumes a contiguous posterior timeline; YOLO gives isolated segments, forcing us to invent “empty” intervals and scores.
PSDS evaluates score calibration across classes; YOLO confidences are not calibrated or comparable across classes.
Sweeping thresholds in PSDS presumes a fixed posterior; with YOLO, threshold changes interact with NMS (remove duplicate detections that refer to the same event) and segmentization (convert the post-NMS set of boxes/proposals into clean event segments), altering the detection set itself.
Given these problems or PSDS as a metric to evaluate YOLO-based outputs, we report event-level with average temporal IoU (0.50–0.95), which directly evaluates what the model produces: class-consistent segments and their boundary accuracy.
Evaluation Metrics
For quantitative comparison, we adopt three event-based F-scores that differ only in the intersection-over-union (IoU) criterion used to declare a correct detection. In all cases, precision (
P) and recall (
R) are combined by the usual harmonic mean:
Consistent with [
18], we report event-based
at multiple temporal IoU thresholds. We denote the temporal intersection-over-union threshold by
and report
for
and the mean
.
counts a system prediction as correct when the temporal IoU between the predicted event interval and a ground-truth event is at least 0.5. This relatively lenient threshold rewards detectors that locate events approximately correctly, and it is therefore sensitive to missed detections but tolerant of minor boundary errors.
uses a strict IoU threshold of 0.95, crediting only predictions whose onset-offset boundaries almost perfectly overlap the reference. This measure emphasizes temporal precision and penalizes even small misalignments.
is the arithmetic mean of eleven scores computed at IoU thresholds . Analogous to the mean average precision (mAP) metric in object detection, it provides a single figure that balances coarse recall with fine-grained temporal accuracy, and it is therefore our primary metric for overall ranking.
Reporting the trio of , , and affords a nuanced view of a model’s behavior across the full recall–precision spectrum, highlighting both boundary accuracy and robustness to cluttered, polyphonic scenes.
International Challenges
The state of the art in SED has been fostered by the recent series of challenges organized by the Detection and Classification of Audio Scenes and Events (DCASE community,
https://dcase.community/challenge2023/task-sound-event-detection-with-weak-and-soft-labels, accessed on 1 February 2023) [
19]. This community organizes yearly challenges that foster competition and allow to assess the state of the art in SED (and other related tasks) by providing a common benchmark and organizing an associated workshop for presenting the novelties in the area. In particular, we take the 2023 DCASE Task 4A challenge as a representative example of SED and try to use the datasets and experimental protocols [
19] for our research.
1.2. State of the Art in Object Detection
To achieve a comprehensive understanding of visual content, computer vision systems must extend beyond clasifying entire images and instead identify individual objects along with their spatial locations [
20,
21]. This dual objective—object localization and classification—forms the foundation of object detection, a task central to numerous applications such as autonomous driving, video surveillance, image retrieval, and activity recognition [
4,
22].
Object detection has evolved significantly with the advancement of deep learning, particularly through the use of Convolutional Neural Networks (CNNs) [
23]. In CNNs, input images are transformed across successive layers of convolution and pooling, resulting in multi-channel feature maps that encode spatial and semantic hierarchies [
24,
25,
26]. These feature maps are then processed by fully connected layers to yield classification outputs. Filtering operations using learned kernels and non-linear activation functions (e.g., ReLU [
27], sigmoid [
28]) enable the network to extract high-level features from localized receptive fields. Pooling methods such as max pooling and average pooling [
29] improve robustness by summarizing feature responses and reducing dimensionality.
Generic object detection methods typically fall into two broad categories: region proposal-based and single-stage approaches. The former includes models like R-CNN [
22], SPP-net [
29], Fast R-CNN [
30], Faster R-CNN [
31], R-FCN [
32], FPN [
33], and Mask R-CNN [
34]. These frameworks first generate candidate regions of interest and then classify each region independently. While accurate, these methods can be computationally expensive due to their multi-stage nature.
In contrast, single-shot detectors streamline the detection pipeline by eliminating the region proposal step. Examples include MultiBox [
35], AttentionNet [
36], G-CNN [
37], SSD [
38], YOLO [
4], YOLOv2 [
39], DSSD [
40], and DSOD [
41]. These models treat detection as a regression problem, directly predicting object bounding boxes and class probabilities from feature maps in a unified architecture. The YOLO (You Only Look Once) family in particular has gained prominence for its balance of speed and accuracy, using anchor boxes and multi-scale detection heads to identify objects in real time. The success of object detection in computer vision can be attributed to the ability of deep neural networks to model complex visual patterns and spatial relationships.
2. Transfer Learning in Sound Event Detection
One of the main obstacles in Sound Event Detection (SED) is the scarcity of strongly labeled data, especially for rare or overlapping events. While computer vision researchers train on millions of annotated images in ImageNet [
9], audio corpora are far smaller, less diverse and heavily imbalanced. Transfer learning therefore dominates recent work, where knowledge captured by large-scale vision models is repurposed for audio tasks, dramatically reducing annotation cost and boosting accuracy [
42].
The standard recipe is to treat a time-frequency spectrogram as an “image”, load a convolutional network pretrained on ImageNet, and fine-tune its weights. VGG variants, for example, learn hierarchical time-frequency patterns that separate short knocks from long tonal sounds, outperforming models trained from scratch on polyphonic benchmarks [
43]. ResNet backbones, helped by focal loss and mixup augmentation, excel in acoustic-scene classification and robust event tagging [
44]. Lighter vision backbones such as MobileNet can act as frozen feature extractors whose outputs feed 1-D CNNs that model temporal continuity, giving competitive results with modest computation [
45]. Other studies fuse spectrograms with wavelet scalograms and pass the trio through CNN + GRU hybrids, capturing both multi-resolution details and long-range context [
46]. Performance, however, hinges on receptive-field tuning: convolutional filters must be refilted so that their spatial coverage matches meaningful audio scales, a step shown to be critical when adapting DenseNet or deeper ResNets [
47].
Beyond pure CNNs, transfer learning powers several complementary lines. Vision-derived embeddings have been plugged into Gaussian-mixture or one-class SVM detectors for audio anomaly spotting [
48,
49]. The state of the art now favors transformer families initially conceived for images: BEATs apply masked-autoencoder pretraining to spectrograms [
50]; PASST introduces patch-out and domain-specific attention to focus on salient acoustic cues [
51]. Among recent audio-specific Transformer models, the Audio Spectrogram Transformer (AST) [
52] directly adapts the Vision Transformer to mel-spectrograms, achieving strong benchmarks in event classification. PaSST introduces patch-out regularization and frequency-domain masking to improve robustness and efficiency, while Data2Vec-Audio [
53] generalizes masked prediction across speech, vision, and text domains. These models operate on frame-level embeddings and produce event tag posteriors rather than explicit onset–offset boxes, whereas our YOLO-based framework predicts time–frequency bounding boxes directly. This distinction situates our work as a complementary “localization-first” alternative to frame-wise Transformer pipelines.
It is important to note that, unlike audio-specific foundation models pretrained on large-scale acoustic datasets (e.g., BEATs), the YOLO backbones used here derive their inductive priors from purely visual pretraining. This places our approach in an intermediate position between classical SED models trained only on DCASE data and modern audio-specialized transformers, highlighting an interesting direction for future research on multimodal or hybrid pretraining strategies.
Transfer learning has thus become indispensable for SED, allowing researchers to exploit mature computer-vision architectures instead of building bespoke audio models from scratch. Yet an important gap remains: object-detection networks, known for precise localization in images, have barely been used for sound event detection.
Prior transfer-learning work overwhelmingly treats spectrograms as inputs to classifiers, using vision backbones as feature extractors feeding CRNN/Transformer heads. Localization is an afterthought derived from posteriors. By contrast, we directly adapt modern single-shot object detectors (YOLOv8/YOLOv11) to mel-spectrograms so that the model learns event localization and classification jointly. Concretely, YOLO’s multi-scale heads improve detection of both brief, high-frequency transients and long, quasi-stationary sounds. Bounding-box regression can yield tighter onsets/offsets in our setup than post-processed posteriors, and native handling of multiple boxes represents overlapping events via multiple instances. We pair this with (i) on-the-fly mel-spectrogram generation to scale training without massive storage and (ii) a curriculum learning schedule plus DCASE-style synthetic data to mitigate class imbalance and reduce overfitting, elements that are largely complementary to prior CRNN/Conformer pipelines.
3. Methodology
The methodology presented in this work introduces a pragmatic approach to
Sound Event Detection (SED) by incorporating recent object detection models traditionally utilized within computer vision. This novel integration capitalizes on the robust and rapid detection capabilities inherent to YOLO (You Only Look Once) [
4] architectures, specifically the advanced YOLOv8 and the newest YOLOv11 versions [
54].
By repurposing these image-based neural network models for the audio domain, we effectively translate audio signals into visually interpretable formats, bridging the gap between audio and visual processing paradigms. The core aim of this methodology is to leverage YOLO’s design for fast detection in vision; we do not claim online guarantees in the audio setting. Concretely, multi-scale feature fusion (PANet) and multi-head detection localize both brief transients and long, quasi-stationary events, helping maintain useful representations as acoustic diversity and overlap increase.
Audio waveforms are pre-processed to obtain mel-sepctrograms with filters and with frames per second (10 s audio clips). This is designed to align closely with the expected input of the YOLO architectures, an image of size . Batch size B, number of classes C, IoU threshold . Each event is a box .
Subsequently, we employ transfer learning by fine-tuning YOLO models that have been pretrained on massive visual datasets, thereby adapting their capabilities to the specialized requirements of audio spectrogram analysis. Synthetic data augmentation further enhances the robustness of our approach, enabling the generation of diverse training samples through a variety of signal transformations, effectively expanding the training set and ensuring greater generalization.
Finally, our experiments show that adopting a curriculum learning strategy starting with isolated and clean sound events and proceeding to more difficult cases is very important to achieve good results.
3.1. Model Implementation
The implementation of our sound event detection (SED) model hinges critically on transfer learning, specifically through fine-tuning the YOLO architecture. Therefore, we first include a short review of the YOLOv8 and YOLOv11 object detection models.
3.1.1. YOLOv8
Released in January 2023 by Ultralytics [
6], YOLOv8 represents a major leap forward in usability, versatility, and performance. YOLOv8 is structured around the CSPDarknet53 backbone, enhanced by path aggregation networks (PANets) for improved feature fusion across scales, crucial for accurately detecting objects of varying sizes.
YOLOv8 notably expands beyond pure object detection, offering seamless capabilities for classification, object segmentation, and pose estimation tasks. This versatility is enabled through task-specific head architectures that share a unified feature extraction backbone, significantly simplifying deployment in diverse applications. The algorithm was pretrained on large-scale datasets like COCO (Common Objects in Context), requiring RGB images as input, typically resized to standardized dimensions (640 × 640 pixels) to maintain consistency during training and inference. The primary outputs from YOLOv8 include bounding boxes, class probabilities, and confidence scores for detected objects, with additional segmentation masks or keypoint predictions when performing tasks like instance segmentation or pose estimation.
3.1.2. YOLOv11
YOLOv11 significantly advances YOLO’s capabilities, featuring notable enhancements in accuracy, efficiency, and task generalization. Building upon the successful architecture of YOLOv8, YOLOv11 integrates Transformer-based modules alongside convolutional neural networks, a hybrid approach aimed at capturing global context more effectively. This integration facilitates superior object representation, enabling YOLOv11 to better understand intricate relationships within the image data.
The model extends capabilities across diverse computer vision tasks, including advanced multi-object tracking, video segmentation, and enhanced real-time detection in complex scenarios. Pretraining involves significantly larger and more diverse datasets combining COCO, ImageNet, and additional specialized sets curated for complex scene recognition, demanding inputs of higher resolution images (typically 1024 × 1024 pixels or greater).
YOLOv11 produces comprehensive outputs tailored to its broad task spectrum, delivering bounding boxes, detailed segmentation maps, dense pixel-level classifications, and sophisticated multi-object tracking data. The inclusion of more extensive metadata within its outputs (e.g., object trajectory, movement predictions, and interaction analysis) underscores its suitability for advanced applications such as autonomous driving, surveillance, and interactive robotics.
Both YOLOv8 and YOLOv11 continue the tradition of high speed, often used for practical applications requiring instantaneous response coupled with high accuracy. Their evolution underscores a broader industry trend towards increasingly flexible, high-performance neural networks capable of addressing complex, real-world challenges efficiently.
3.1.3. Custom Model Implementation
By leveraging a medium-level fine-tuning approach, we repurposed the pretrained visual recognition capabilities of YOLOv8 to specifically detect and classify audio events represented visually as mel-spectrograms. Medium fine-tuning involves carefully adjusting several layers, particularly focusing on those deeper within the network. This ensures that the generic object detection features learned from extensive visual datasets are effectively adapted to the distinct characteristics and patterns inherent in audio event spectrograms.
Achieving efficient transfer learning on large amounts of data necessitated significant modifications to the original YOLOv8 implementation. Key among these were alterations to the trainer function. The standard training procedure of YOLOv8 typically involves stored datasets comprising preprocessed images. However, our approach requires the trainer function to dynamically process mel-spectrograms generated on the fly from raw audio data to be able to extend the training to a huge number of melgram sections modified with the different data augmentation techniques without needing a massive amount of storage that also slows the whole training process.
This online processing eliminated the need for storing large quantities of intermediate image data, substantially reducing storage requirements. Similarly, critical adjustments were made to the validator and predictor function.
The combined adjustments to the trainer, validator, and predictor functions were meticulously orchestrated to maintain computational. Ensuring synchronization between spectrogram generation and network processing tasks required detailed profiling and optimization of data flow.
In our case, the workflow (
Figure 1) starts with a MEDIUM-capacity detector: a YOLO variant holding roughly 20–26 million parameters. This size balances accuracy and speed while retaining feature richness. Launching with this model provides a solid, efficient baseline before heavier or task-specific modules fine-tune results further.
4. Experimental Setup
In this section, we detail the experimental pipeline used in our work, which transforms raw audio data into a format suitable for training a state-of-the-art object detector (YOLOv8 and YOLOv11). The overall goal of our approach is to convert .wav files into a mel-spectrogram representation, further transform the spectrogram into a color image, and subsequently evaluate the effectiveness of different training strategies.
4.1. Audio Preprocessing and Mel-Spectrogram Generation
The first step in our pipeline involves reading audio files and computing its corresponding mel-spectrogram. The choice of parameters for generating the mel-spectrogram is critical, as these directly influence the resolution and quality of the features extracted from the audio. In our case, the chosen parameters are shown in
Table 2.
These parameters were chosen based on extensive literature review and preliminary experiments to balance between computational load and the preservation of salient audio features. The resulting mel-spectrogram provides a two-dimensional representation, where the horizontal axis corresponds to the time dimension and the vertical axis to frequency.
The choice of 512 mel filters and the final resizing of spectrograms to 640 × 640 pixels are motivated by the internal processing characteristics of YOLOv8 and YOLOv11. Although both models accept arbitrary spatial resolutions, they internally normalize inputs to 640 × 640 during feature extraction. Using 512 mel bands allows the vertical dimension of the spectrogram to approximate this internal resolution, thereby reducing interpolation artifacts and minimizing distortions introduced by YOLO’s non-modifiable resizing operations. This alignment improves reproducibility and ensures that model performance reflects the behavior of the detector rather than inconsistencies introduced by mismatched input scales.
4.2. Conversion from Mel-Spectrogram to Image
Once the mel-spectrogram is computed, it is essential to convert it into an RGB image that serves as input to the YOLOv8 and YOLOv11 network. The conversion function, implemented in Python v3.11 with PyTorch v2.2.2 and Matplotlib v3.8.2, performs the following operations:
Amplitude to Decibel Conversion: The function converts the amplitude values in the mel-spectrogram into decibels. This logarithmic transformation is crucial for visualizing the wide dynamic range of the audio signal.
Normalization: After conversion to decibels, the spectrogram is normalized to a [0, 1] range. This normalization ensures that the subsequent image processing steps operate on a standardized input, which is beneficial for training deep neural networks.
Resizing: The normalized spectrogram is resized to a target dimension of 640 × 640 pixels using bicubic interpolation. This resizing step is critical because the YOLOv8 and YOLOv11 model expects a consistent input size. The interpolation is applied in such a way as to preserve the overall structure and spectral features. It is important to note that the training audios have a length of 10 s. Therefore, the time warping applied is similar for all the training audios. For evaluation, we also use 10 s segments.
Color Mapping: To enhance the visual representation, a colormap (specifically the viridis colormap) is applied to the resized spectrogram. This mapping converts the grayscale representation into a color image, where different hues represent varying levels of intensity. The resulting image is then scaled to an 8-bit format (values between 0 and 255).
Final Output: The function returns the generated RGB image, which encapsulates the spectral content of the audio in a format directly compatible with YOLOv8 and YOLOv11.
While RGB images (
Figure 2) do not include additional acoustic information in comparison with grayscale images, this choice allows us to use YOLO without architectural modifications, since the model is optimized for three-channel inputs. Future work could explore single-channel log-mel inputs or multi-resolution channel encodings, but such variants require substantial architectural changes and are beyond the scope of this study.
4.3. Synthetic Audio Generation and Training Strategies
This section presents the process of generating synthetic audio data and the training strategies employed in our YOLOv8- and YOLOv11-based sound event detection framework. The synthetic data generation process follows the methodology outlined by the DCASE challenge [
21], leveraging the Scaper soundscape synthesis and augmentation library [
55]. The generation process uses the Scaper library to synthesize diverse mixtures whose event and background distributions are aligned to AudioSet statistics. Foregrounds are drawn from DESED soundbanks, backgrounds from SINS and TUT Acoustic Scenes, and non-target events from FUSS, ensuring acoustic diversity consistent with real-world domestic environments.
This approach enables the creation of a diverse, strongly labeled dataset that closely approximates the distribution observed in the validation set. In the following sections, we provide an in-depth explanation of the data synthesis process, the construction of various audio subsets, and the two distinct training strategies implemented to optimize model performance.
The process begins with the selection of high-quality audio clips from multiple sources, ensuring that both the target and non-target event distributions reflect real-world conditions.
Foreground Audio Selection: We used all available foreground files from the DESED synthetic soundbank. These files contain isolated sound events that serve as the building blocks for our synthetic soundscapes. The selection was guided by the need to replicate the event distribution observed in the validation set of AudioSet, ensuring that the frequency and occurrence of events are representative of real-world scenarios.
Background Audio Integration: To create a realistic acoustic context, background files were incorporated into the synthesis process. Specifically, files annotated as “other” from the SINS dataset and audio clips from the TUT Acoustic Scenes 2017 development dataset were employed. These backgrounds provide a wide variety of domestic environmental sounds, thereby enhancing the complexity and realism of the synthetic audio.
Non-Target Classes from FUSS: In addition to the foreground and background sounds, clips containing non-target classes were extracted from the FUSS dataset. The selection of these clips was informed by FSD50K annotations, ensuring that the non-target events are accurately represented in the synthetic soundscapes. This step is crucial for training the model to distinguish between target and non-target audio events effectively.
Event distribution statistics, both for target and non-target classes, were computed based on annotations from approximately 90,000 clips in AudioSet. This statistical analysis allowed us to calibrate the synthetic data generation process, ensuring that the distribution of events in our synthetic dataset closely mirrors that of the validation set. The alignment of event distributions is vital for achieving high generalization and robustness in the detection model.
4.3.1. Custom Synthetic Dataset
The synthetic dataset comprises multiple subsets, each designed to represent different aspects of the detection model’s performance under varying conditions. A total of 10,000 audio files of 10 s each were generated for each subset. Subsets 1–6 consist entirely of synthetic audio generated with Scaper, constructed under DCASE guidelines for controlled overlap, backgrounds, and event balance. In contrast, Subset 7 corresponds to the real, strongly labeled DCASE Task 4A dataset, which is used exclusively for model evaluation. This configuration mirrors the official DCASE data protocol, in which training employs synthetic soundscapes while validation relies on real recordings. The dataset is organized as follows:
Subset 1: Synthetic Audio Files with Single-Class, Non-Overlapping Events without Background: This subset consists of 10 s audio clips containing multiple events from a single target class. The events are arranged sequentially without any overlap, and no background noise is added. This configuration isolates the target events, allowing the model to focus solely on the characteristics of the sound events without interference from extraneous audio information.
Subset 2: Synthetic Audio Files with Single-Class, Non-Overlapping Events with Background: Similar to the previous subset, this collection also features 10 s clips with multiple events from a single class. However, these clips include background audio, simulating a more realistic scenario where the target events occur amidst ambient noise. This subset is critical for assessing the model’s performance in environments where background sounds may mask or distort the target events.
Subset 3: Synthetic Audio Files with Multi-Class, Non-Overlapping Events with Background: This subset expands on the complexity by incorporating multiple target classes within each 10 s clip. The events are arranged sequentially without overlapping, and background audio is present. The introduction of multiple classes tests the model’s ability to accurately classify and localize distinct sound events in the presence of potential class confusion and background interference.
Subset 4: Synthetic Audio Files with Multi-Class, Overlapping Events without Background: In this configuration, 10 s clips are generated with events from various classes that overlap in time. The absence of background audio ensures that the overlapping events are the sole focus. This subset is particularly challenging as the overlapping events require the model to disentangle concurrent audio signals and correctly identify each event, highlighting its ability to manage temporal complexities.
Subset 5: Synthetic Audio Files with Multi-Class, Overlapping Events with Background: Representing the most complex and realistic scenario, this subset contains 10 s clips with overlapping events from multiple classes and includes background audio. The simultaneous presence of overlapping events and background noise creates a challenging environment for the detection model, pushing its limits in terms of both localization accuracy and classification robustness.
4.3.2. Additional Dataset
Subset 6: Synthetic Audio Files Provided by DCASE 2023—Task 4A: In addition to the subsets generated using the aforementioned methods, this subset also includes synthetic data and strongly labeled sound events provided directly by DCASE.
Subset 7: Real Audio File Strongly Labeled Data Provided by DCASE 2023—Task 4A: The strongly labeled dataset from DCASE includes real audio data and serves as a gold standard for training. The annotations in this dataset are verified by human experts, providing high-quality ground truth for model evaluation. Integrating these data into our training regimen helps bridge the gap between synthetic and real-world audio scenarios.
Strongly Labeled Data Validation: This strongly labeled dataset serves as validation set for DCASE and in our case is used as a test set to evaluate the models. It is designed such that the distribution in terms of clips per class is similar to that of the weakly labeled training set. The validation set contains 1168 clips (4093 events) and is annotated with strong labels, including timestamps provided by human annotators.
4.4. Training Strategies
To perform the experiments, the entire spectrogram image is used as a single input for the object detection model. This method takes advantage of the full temporal and frequency resolution of the audio representation.
The full image contains all spectral information throughout the duration of the audio clip, allowing the network to learn features holistically. However, the use of the full image poses challenges in terms of variability in temporal duration, which could complicate the learning process.
The experiments are designed to explore two strategies at the training/learning level using the generated images as input to YOLOv8 and YOLOv11. Both strategies are designed to assess the impact of dataset composition and presentation order on model performance.
Curriculum Learning: In the first strategy, training is conducted in a sequential manner, processing one dataset subset at a time. The process begins with the MEDIUM -capacity detector, and the first subset (10,000 audio clips of single-class, non-overlapping events without background). This subset is split into an 80% training set and a 20% validation set. Once the training on this subset converges, the weights of the iteration that achieve the best fitness score (best.pt) or the weights of the last training (last.pt) are used as the starting point for training on the next subset. This process is repeated iteratively across all dataset subsets, in increasing order of complexity. This ordered exposure constrains distribution shift between phases and consolidates features before introducing harder mixtures, which mitigates (though does not eliminate) catastrophic forgetting as scene complexity rises. In our runs, phase-to-phase weight carry-over (continuing from last.pt) acts as an additional consolidation step and yields slightly more stable performance than restarting from (best.pt). We therefore report both, highlighting the stability of (last.pt) under rising complexity. The sequential approach allows the model to gradually adapt to increasing complexity (i.e., from Subset 1 to Subset 7), starting from simpler scenarios and progressing to more challenging environments with overlapping events and background noise. This step-by-step learning process is hypothesized to help the model build a robust internal representation of audio events, improving its generalization capabilities across diverse acoustic conditions.
Randomized Blended Training: The second training strategy involves mixing all the subsets randomly into a single training corpus. This approach is intended to expose the model to a wide variety of audio conditions simultaneously, potentially enhancing its ability to generalize across different acoustic scenarios. All data are divided in batches of 10,000 data pieces and each one processes a portion of the randomly mixed dataset. Checkpoints from these sessions are periodically consolidated in the same way as in the curriculum learning approach. Although this strategy introduces significant heterogeneity in the training data, it challenges the model to reconcile different distributions and event characteristics concurrently.
5. Results
When a localization-first detector on spectrograms—together with curriculum learning—improves boundary accuracy, handles overlaps, and maintains accuracy as dataset complexity increases is reported as
. Unless otherwise noted, all performance metrics (
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7 are computed on the real, strongly labeled DCASE evaluation set (Subset 7)). Although models were trained solely on synthetic subsets (Subsets 1–6), they are evaluated entirely on real recordings to assess generalization. To expose sensitivity to localization strictness beyond the averaged
and the 0.5/0.95 operating points, we also report F1 as a function of IoU threshold for representative short-, intermediate-, and long-duration classes (see
Figure 3).
The definition of the different variants of models that have been developed and the results obtained are presented. We also compare them with other models from other researches.
5.1. Models Proposed
We propose four curriculum-learning YOLO-based SED models:
YOLOv8-SED-CurrLear-Last. The model progressively learns from easier to more complex examples to improve generalization and starting from the weights of the last training epoch (last.pt).
YOLOv8-SED-CurrLear-Best. Employs curriculum learning as its training strategy. The model progressively learns from easier to more complex examples to improve generalization and starting from the weights for the iteration that achieved the best fitness score (best.pt).
YOLOv11-SED-CurrLear-Last. Leverages curriculum learning as its training strategy. The model progressively learns from easier to more complex examples to improve generalization and starting from the weights of the last training epoch (last.pt).
YOLOv11-SED-CurrLear-Best. Deploys curriculum learning as its training strategy. The model progressively learns from easier to more complex examples to improve generalization and starting from the weights for the iteration that achieved the best fitness score (best.pt).
The results obtained by these models on the
Strongly Labeled Data Validation are mentioned in
Section 4.3.2.
Table 3 shows results in terms of
for different thresholds of the Intersection over Union (IOU) of the detected and real bounding boxes for the audio events. In particular, thresholds of 0.5, 0.95 and an average of results with thresholds between 0.5 and 0.95 are presented.
Table 3.
Results of the proposed curriculum-learning YOLO-based SED models.
Table 3.
Results of the proposed curriculum-learning YOLO-based SED models.
| Model | Checkpt. | Curriculum Learning |
|---|
| | |
|---|
| YOLO v8 | Last | | | |
| Best | | | |
| YOLO v11 | Last | | | |
| Best | | | |
An examination of the results presented in
Table 3 shows that in general, taking the model obtained in the last epoch for each subset tends to produce slightly better results than taking the model obtaining the best result. This may be due to the instabilities in the validation results, which may lead to undertraining in some circumstances.
We also propose four YOLO-based SED models leveraging full mel-spectrograms and randomly mixed real and synthetic subsets:
YOLOv8-SED-RandMix-Last. Trained using a random mix of all real and synthetic training subsets and starting from the weights of the last training epoch (last.pt).
YOLOv8-SED-RandMix-Best. Trained using a random mix of all real and synthetic training subsets and starting from the weights for the iteration that achieved the best fitness score (best.pt).
YOLOv11-SED-RandMix-Last. This variant also uses the entire mel-spectrogram but is trained using a random mix of all synthetic training subsets and starting from the weights of the last training epoch (last.pt).
YOLOv11-SED-RandMix-Best Trained using a random mix of all synthetic training subsets and starting from the weights for the iteration that achieved the best fitness score (best.pt).
The results produced by these models on the
Strongly Labeled Data Validation mentioned in
Section 4.3.2 are represented in
Table 4, which is similar to
Table 3.
Table 4.
Score of four YOLO-based SED models leveraging full mel-spectrograms and randomly mixed real and synthetic subsets.
Table 4.
Score of four YOLO-based SED models leveraging full mel-spectrograms and randomly mixed real and synthetic subsets.
| Model | Checkpt. | Randomized Blended Training |
|---|
| | |
|---|
| YOLO v8 | Last | | | |
| Best | | | |
| YOLO v11 | Last | | | |
| Best | | | |
Comparing
Table 3 and
Table 4, Randomized Blended Training yields lower performance on average than Curriculum Learning. The gap between
Table 3 and
Table 4 indicates better retention when complexity is introduced gradually rather than all at once. In practice, the curriculum schedule preserves previously learned classes more reliably as overlap and background are added.
Our results also reveal a clear instance of catastrophic forgetting within the YOLO-based SED training pipeline. Although the curriculum schedule initially improves stability by introducing acoustic conditions gradually, performance degradation becomes evident as soon as the most complex subsets—especially those with overlapping events—are incorporated. At this stage, the model tends to overwrite representations learned from earlier, simpler subsets, indicating that single-shot object detectors struggle to retain previously acquired temporal–spectral patterns when confronted with increasing polyphony. This behavior aligns with long-standing observations in continual learning, where replay buffers, regularization mechanisms (e.g., EWC-like penalties), or distillation-based stabilizers are commonly required to preserve past knowledge. Incorporating such strategies into YOLO-based SED models represents a promising but nontrivial extension, demanding architectural and algorithmic adjustments that fall outside the scope of the present work. Nonetheless, identifying the exact point in the curriculum where forgetting emerges provides a valuable foundation for future research on robust continual learning approaches for audio event detection
These observations suggest that the choice of training strategy plays a critical role in optimizing model performance, even when the overall training conditions remain largely comparable. Although the differences related to the checkpoint kept from one phase of training as initializaton of the next phase are small, they seem to indicate a small but consistent advantage for the method keeping the weights of the last epoch in the case of the Curriculum learning. Based on these insights, model YOLOv11-SED-CurrLear-Last is selected as the most effective configuration for subsequent evaluations.
To make application-specific operating points explicit,
Figure 3 plots
vs.
threshold (
) for representative classes spanning short (dog, dishes), intermediate (speech), and long events (vacuum cleaner, electric shaver/toothbrush). The sweep shows that short transients degrade steeply as
increases, reflecting tighter onset/offset demands, whereas long, sustained events remain comparatively stable across
. This analysis complements the mean
by revealing class-dependent trade-offs between coarse and precise localization, addressing use cases (e.g., alarm-like events) that require stricter boundary tolerance.
5.2. Results by Class of the Best Model
Table 5 and
Table 6 trace the class-by-class evolution in precision in terms of
evaluated on the Strongly Labeled Data Validation as the training progresses along the different subsets in the curriculum learning. Comparing results obtained per class, we can observe that the proposed model is particularly effective for long and sustained events such as vacuum cleaner, electrical shaver/toothbrush, frying, running water or blender. These events tend to last for much of the 10 s audio segments and therefore occupy a large amount of the image taken as input to the YOLO model. On the other hand, the model struggles with very short-duration acoustic events such as dog barks, dishes clattering. These events tend to be very short in time and therefore tend to appear as almost vertical lines in the mel-spectrogram, ocuppying a very small fraction of the image. This explains the difficulty found by YOLO to detect these events. At the full-clip (10 s) resolution, the detector heads operate with strides and receptive fields that are large relative to these near-vertical traces, which reduces salience at small scales. In addition, box-based matching penalizes these events via low temporal IoU when their annotated extents are short, further depressing
at higher IoU thresholds.
Short-duration acoustic events (e.g., dog barks,
Figure 4 or dish impacts,
Figure 2) occupy only a very small temporal region in the mel-spectrogram and therefore correspond to tiny regions in the 640 × 640 image processed internally by YOLO. This severely limits the effective receptive field covering these events and often results in insufficient discriminative evidence for the detector to produce reliable bounding boxes. To illustrate this phenomenon, we include a visualization of the actual model input for an audio clip containing dog barks. This example highlights the minimal spatial footprint of short events and motivates future improvements based on temporal zooming or sliding-window segmentation, which could provide finer temporal granularity and enrich local context for the detector.
Finally, there are intermediate events such as cat meows, alarm bell ringing and speech that tend to be longer than dog barks and dish clatter but shorter than vacuum cleaner, shaver/toothbrush or blender and have an intermediate level of results with our model. A direct conclusion is that the length of the event (and therefore the portion of the input image ocuppied by the event) is critical for obaining good performance with this approach. In contrast,
YOLOv11-SED-CurrLear-Last performs the worst on very brief events (e.g., dog barks, clattering dishes), as reflected by their low
0.5−0.95 in
Table 5 and
Table 6.
The curriculum strategy revealed a form of catastrophic forgetting when subsets with highly overlapping events were introduced. This behavior suggests that YOLO, like other deep models, may benefit from continual learning mechanisms such as replay, regularization, or distillation. Incorporating such techniques remains an important avenue for future work.
| Classes | Subset.1 | Subset.2 | Subset.3 | Subset.4 |
|---|
| Alarm Bell Ringing | | | | |
| Blender | | | | |
| Cat | | | | |
| Dishes | | | | |
| Dog | | | | |
| Elec. Shaver Toothbrush | | | | |
| Frying | | | | |
| Running Water | | | | |
| Speech | | | | |
| Vacuum Cleaner | | | | |
Another interesting insight that can be taken from
Table 5 is that the inclusion of more complex subsets produces catastrophic forgetting, especially for very short events, but even for intermediate events. Decrease in performance obtained for some classes is particularly noticeable when subset S.4 (the first including overlapping) is included in training. It is also noteworthy that the long-duration events such as shave/toothbrush, frying or running water do actually improve with this subset. This pattern is consistent with multi-scale feature learning: YOLO’s pyramid better preserves long-extent patterns but short, high-sparsity footprints remain vulnerable when overlap is introduced. Hence, we observe partial catastrophic forgetting for short events under overlap, despite overall gains for long events. A possible cause for this behavior could be that the long-duration events are masking the short-duration events and confounding the model when learning the short-duration events. This limitation of the YOLO models to deal with short-duration events and learning from short-duration events overlapped with other events needs to be addressed in future research.
| Classes | Subset.5 | Subset.6 | Subset.7 |
|---|
| Alarm Bell Ringing | | | |
| Blender | | | |
| Cat | | | |
| Dishes | | | |
| Dog | | | |
| Elec. Shaver Toothbrush | | | |
| Frying | | | |
| Running Water | | | |
| Speech | | | |
| Vacuum Cleaner | | | |
This degradation, particularly visible for short-duration events such as dog bark, dishes, and cat in
Table 5 and
Table 6, illustrates a stability–plasticity trade-off typical of curriculum learning. Although we did not yet implement specific anti-forgetting mechanisms, future research will evaluate interleaved replay of earlier subsets, teacher–student distillation, and regularization approaches to preserve learned representations while introducing overlapping scenes.
5.3. Comparison with Other Models
This section presents comparison of results obtained by our proposed model with results obtained with other models. In particular, we use the following models for comparison with our models.
5.3.1. Models Without Pretrained Audio Encoders
First, we compare our proposed models with models that do not include pretrained audio encoders. In particular, we use these models for comparison:
Model 1—CRNN. This model corresponds to the DCASE Challenge 2023 Task 4A baseline. It is a Convolutional Recurrent Neural Network (CRNN) composed of seven convolutional layers with average pooling, followed by two bidirectional GRU (BGRU) layers with 128 units each. The model takes mel-spectrograms as input and employs the Mean Teacher method [
56] to leverage weakly and strongly labeled data and unlabeled data.
Model 2—FDY-CRNN. This model extends the DCASE baseline by introducing Frequency Dynamic Convolutions (FDYs) [
57]. Traditional convolutional operations are designed for the image domain, where translation equivariance across both spatial axes is usually assumed. However, in the case of audio, especially mel-spectrograms, shifting an acoustic event along the frequency axis can significantly change how it sounds. To address this limitation, FDY introduces frequency-adaptive attention weights that are used as convolutional kernels. This allows the model to handle frequency shifts more effectively by adapting the convolution operation along the frequency axis. The results reported for this model are obtained by replicating the approach described in [
57].
Model 3—FDY-Conformer. This approach replaces the RNN in the baseline CRNN architecture with several Conformer layers [
58], which combines convolutional modules with Transformers [
59] to capture both local and global patterns. It uses an FDY-based CNN for feature extraction and dimensionality reduction, followed by seven Conformer encoder blocks with multi-head self-attention (four heads) and an encoder dimension of 144. Conformer-based models have shown greater robustness in avoiding misclassification of sound events [
3,
60] compared to prevalent CRNNs.
Table 7 compares the performance of the best of all models proposed in this paper,
YOLOv11-SED-CurrLear-Last, with these additional models previously used for this task. The evaluation is performed across the 10 event classes considered in DCASE 2023 Task 4A. The last row presents the global results.
Table 7.
for comparison among models without audio-pretrained encoders.
Table 7.
for comparison among models without audio-pretrained encoders.
| Classes | Mod.1 | Mod.2 | Mod.3 | Proposed Model |
|---|
| Alarm Bell Ringing | | | | |
| Blender | | | | |
| Cat | | | | |
| Dishes | | | | |
| Dog | | | | |
| Elec. Shaver Toothbrush | | | | |
| Frying | | | | |
| Running Water | | | | |
| Speech | | | | |
| Vacuum Cleaner | | | | |
| Global | | | | |
Against architectures that learn from scratch, the proposed curriculum-trained YOLOv11-SED-CurrLear-Last obtains the highest mean among the compared from-scratch baselines in our setting -score (), nudging ahead of the classic CRNN baseline () and its FDY upgrade (), while more than doubling the score of the FDY-Conformer (). Its object detection bias proves advantageous for quasi-stationary sound events, e.g., vacuum cleaner, electric shaver, frying, yet it still struggles to detect brief events such as dog barks or clattering dishes. Overall, the results show that reframing SED as a visual detection problem and introducing complexity gradually via curriculum learning indicates that a vision-pretrained YOLO can be competitive with some audio-centric models in our setting.
To be completely fair in the comparison, we must acknowledge that our proposed models are pretrained on large image datasets while the systems we used for comparison in
Table 6 are only trained from scratch on the audio events. In next section, we compare our proposed model against systems using audio encoders typically pretrained on images and then fine-tuned on large amounts of audio.
5.3.2. Models with Pretrained Audio Encoders
In this section, we compare our proposed model with models that include pretrained audio encoders. These pretrained audio encoders are typically pretrained on images and then fine-tuned on large amounts of audio. The following models are used in this comparison:
Model 4—CRNN-BEATs. This model replicates the second baseline proposed in Task 4A of the DCASE Challenge 2023. It leverages audio embeddings from the pretrained BEATs (Bidirectional Encoder representation from Audio Transformers) model [
61]. These embeddings are concatenated with the features extracted by the CNN and then passed to the RNN. BEATs achieve state-of-the-art performance on the AudioSet benchmark through iterative optimization of an acoustic tokenizer and a self-supervised audio model.
Model 5—FDY-Conformer-BEATs. This model [
18] extends Model 3 by concatenating BEATs embeddings with CNN-extracted features, which are then processed by Conformer blocks instead of an RNN.
Conversely, these models fuse BEATs embeddings with CNN features, enabling a precise appraisal of performance gains uniquely attributable to large-scale self-supervised audio pretraining within otherwise comparable backbones;
Table 8.
When self-supervised audio representations are added, the balance shifts: CRNN-BEATs leaps to global , comfortably ahead of the best YOLO configuration (), whereas FDY-Conformer-BEATs falls to . This contrast underscores the decisive boost that large-scale audio pretraining (here, BEATs embeddings) gives to conventional spectrogram pipelines—a boost the current YOLO adaptation, still limited to visual pretraining, cannot yet tap. Thus, while YOLO remains competitive, closing the gap with the top encoder-augmented baseline likely requires fusing it with audio-native embeddings or adopting multi-modal pretraining strategies. This time the comparison is not completley fair for our proposed model since the amount of audio used for training our model is much more limited than that used to train BEATS.
The superior global of BEAT-based systems primarily reflects the benefit of large-scale audio-domain pretraining, absent in our vision-initialized YOLO models. This difference highlights the expected advantage of self-supervised acoustic encoders rather than a shortcoming of ours.
6. Conclusions
In terms of surpassing the most advanced audio-specific Transformer models (e.g., AST, PaSST, BEATs, or Data2Vec-Audio), our results show that directly on spectrograms, YOLO is competitive with strong non-pretrained baselines while offering distinct advantages in boundary precision.
Phase-wise analysis underscores that the stage is pivotal: while it enriches the acoustic diversity needed for real-world robustness, it can also trigger catastrophic forgetting—most acutely for events of very short duration. While our current curriculum partially mitigates this phenomenon, it does not fully prevent it. Future work will systematically explore strategies such as replay of earlier data, cross-phase distillation, and continual learning regularization to reduce catastrophic forgetting in overlapping acoustic scenes. We explored mitigating this effect by freezing or “checkpointing” the best performing model after each phase. However, because validation accuracy fluctuated markedly from phase to phase, the checkpoint strategy offered no consistent benefit; retaining a single model trained through all phases yielded slightly better and more stable performance. A principled schedule for introducing overlap that retains short events while adding polyphony remains.
In sum, our findings support curriculum learning as a practical route to better generalization under our data and protocol in sound-event detection, provided that overlapping examples are introduced with care and that model selection is based on the entire training trajectory rather than on intermediate checkpoints.
When compared with encoder-augmented baselines, our curriculum-trained YOLOv11-SED-CurrLear-Last reaches
global
, surpassing FDY-Conformer-BEATs
but still trailing the CRNN-BEATs reference
. This gap confirms that large self-supervised audio embeddings (e.g., BEATs) remain advantageous for peak accuracy. Nevertheless, the vision-first YOLO detector captures a substantial share of the relevant time–frequency structure without any audio-domain pre-training or annotation-constrained deployments. It is also important to note that our YOLO-based detectors rely exclusively on vision-domain pretraining, whereas BEATs-augmented baselines incorporate large-scale audio-native self-supervision. The gap observed in
Table 8 therefore reflects differences in pretraining modality and corpus scale. Bridging this gap requires adopting audio-native or multimodal pretraining, which we identify as a direction for the future.
The Scaper-generated soundscapes—crafted under DCASE guidelines with precise control over event overlap, background interference, and class balance—were not merely a convenient augmentation; they were indispensable for executing the curriculum-learning pipeline that ultimately underpins our strongest results. These purpose-built mixtures enabled a clean-to-complex progression that would have been impossible with real-world data alone, thereby tightening decision boundaries, reducing over-fitting, and safeguarding detector stability when confronted with scarce or imbalanced natural recordings. In concert with the strongly-labeled DCASE clips, the synthetic corpus furnished the structured diversity required for each curriculum phase, confirming its essential role in the success of our study.
Taken together, our results illustrate the practical benefits of cross-pollinating audio signal processing with mature computer-vision technology. They also open a path for future hybrids that merge vision-style detectors with self-supervised audio encoders and cross-domain evaluation metrics. Such interdisciplinary efforts will be essential for building faster, more scalable and more accurate sound-event detection systems capable of operating in the wild. We caution that conclusions are limited to the evaluated datasets and metrics; further study is required to assess robustness across recording conditions and label policies.
While the use of synthetic mixtures enables fully controlled, strongly labeled training, it may still limit generalization to unconstrained real-world recordings beyond the DCASE domain. This trade-off is inherent to the benchmark protocol, but future work will incorporate real annotated data and in-the-wild recordings to improve realism and robustness.
As for targeted mitigations, future work will evaluate higher frame-rate mel features and temporal up-sampling in the backbone, short-window sliding-patch (window = 1 s, hop = 0.5 s), class-balanced focal loss, and time-aware post-processing (e.g., class-specific minimum-duration priors).
Overall, YOLO-based SED represents a complementary direction to Transformer-based approaches: the former emphasizes one-step localized event detection, whereas the latter excels at global context modeling. Future research may explore hybrid or multimodal architectures that combine these strengths.
Although this work demonstrates that YOLO-based detectors can operate competitively without audio-specific pretraining, we acknowledge that a direct comparison with audio-specialized architectures such as AST, PaSST, or Data2Vec-Audio remains an open direction. Incorporating these models in future evaluations will provide a broader assessment of the relative strengths and limitations of localization-first versus embedding-first SED pipelines.
7. Opportunities for Future Work
Building upon our findings, several lines of investigation can be pursued:
Sliding Window Training: We propose to apply a sliding window (e.g., of one second with overal of half a second) overlap across full-band mel-spectrograms, sending each patch to YOLOv8/11. This segmentation enlarges the dataset, preserves temporal continuity, and may improve the detection of short events.
Training with Pseudo-Strong Labels: A promising path forward involves converting weak labels—commonly available in real-world datasets—into pseudo-strong annotations using methods such as multiple instance learning or temporal localization models.
Using Additional Datasets like WildDESED: While our work focused on synthetic data and standard DCASE datasets, external collections like WildDESED, which includes more spontaneous, uncontrolled soundscapes, could challenge models with new types of audio complexity. For example, training on or evaluating with datasets such as External Libraries (BBC Sounds, Freesound) would test generalizability and robustness to unseen environments.