You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

24 December 2025

YOLO-Based Transfer Learning for Sound Event Detection Using Visual Object Detection Techniques

,
and
AUDIAS—Audio, Data Intelligence and Speech Research Group, Universidad Autónoma de Madrid, Calle Francisco Tomás y Valiente, 11, Ciudad Universitaria de Canto Blanco, 28049 Madrid, Spain
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advances in Audio Signal Processing

Abstract

Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct adaptation of modern YOLO detectors to audio. To our knowledge, this is among the first works to employ YOLOv8 and YOLOv11 not merely as feature extractors but as end-to-end models that localize and classify sound events on mel-spectrograms. Methodologically, our approach (i) generates mel-spectrograms on the fly from raw audio to streamline the pipeline and enable transfer learning from vision models; (ii) applies curriculum learning that exposes the detector to progressively more complex mixtures, improving robustness to overlaps; and (iii) augments training with synthetic audio constructed under DCASE 2023 guidelines to enrich rare classes and challenging scenarios. Comprehensive experiments compare our YOLO-based framework against strong CRNN and Conformer baselines. In our experiments on the DCASE-style setting, the method achieves competitive detection accuracy relative to CRNN and Conformer baselines, with gains in some overlapping/noisy conditions and shortcomings for several short-duration classes. These results suggest that adapting modern object detectors to audio can be effective in this setting, while broader generalization and encoder-augmented comparisons remain open.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.