Next Article in Journal
Investigation of an Influence Function Model as a Self-Rotating Wheel Polishing Tool and Its Application in High-Precision Optical Fabrication
Next Article in Special Issue
IoT System for Detecting the Condition of Rotating Machines Based on Acoustic Signals
Previous Article in Journal
Characterization and Optimization of a Conical Corona Reactor for Seed Treatment of Rapeseed
Previous Article in Special Issue
Spatial Audio Scene Characterization (SASC): Automatic Localization of Front-, Back-, Up-, and Down-Positioned Music Ensembles in Binaural Recordings
 
 
Article

You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

1
Interdisciplinary Centre for Computer Music Research, University of Plymouth, Plymouth PL4 8AA, UK
2
Plymouth Marine Laboratory, Plymouth PL1 3DH, UK
*
Author to whom correspondence should be addressed.
Academic Editor: Sławomir K. Zieliński
Appl. Sci. 2022, 12(7), 3293; https://doi.org/10.3390/app12073293
Received: 8 March 2022 / Revised: 21 March 2022 / Accepted: 22 March 2022 / Published: 24 March 2022
Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets for audio segmentation and sound event detection. As the output of YOHO is more end-to-end and has fewer neurons to predict, the speed of inference is at least 6 times faster than segmentation-by-classification. In addition, as this approach predicts acoustic boundaries directly, the post-processing and smoothing is about 7 times faster. View Full-Text
Keywords: audio segmentation; sound event detection; you only look once; deep learning; regression; convolutional neural network; music-speech detection; convolutional recurrent neural network; radio audio segmentation; sound event detection; you only look once; deep learning; regression; convolutional neural network; music-speech detection; convolutional recurrent neural network; radio
Show Figures

Figure 1

MDPI and ACS Style

Venkatesh, S.; Moffat, D.; Miranda, E.R. You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection. Appl. Sci. 2022, 12, 3293. https://doi.org/10.3390/app12073293

AMA Style

Venkatesh S, Moffat D, Miranda ER. You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection. Applied Sciences. 2022; 12(7):3293. https://doi.org/10.3390/app12073293

Chicago/Turabian Style

Venkatesh, Satvik, David Moffat, and Eduardo Reck Miranda. 2022. "You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection" Applied Sciences 12, no. 7: 3293. https://doi.org/10.3390/app12073293

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop