2.1. Classification and Detection of Sound Events
Sound event classification
in its simplest form requires assigning an event class label to each test audio, as illustrated in Figure 1
A. Classification is performed on audio containing isolated sound events [10
] or containing a target sound event and additional overlapping sounds [23
]. In classification, the system output is a class label, and there is no information provided about the temporal boundaries of the sounds. Audio containing multiple, possibly overlapping sounds can be classified into multiple classes—performing audio tagging
, illustrated in Figure 1
B. Tagging of audio with sound event labels is used for example for improving the tags of Freesound audio samples [24
], and has been proposed as an approach for audio surveillance of home environments [25
]. Single-label classification is equivalent with tagging, when a single tag is assigned per test file.
Sound event detection
requires detection of onsets and offsets in addition to the assignment of class labels, and it usually involves detection of multiple sound events in a test audio. This results in assigning sound event labels to selected segments of the audio as illustrated in Figure 1
C. For overlapping sounds these segments overlap, creating a multi-level segmentation of the audio based on the number and temporal location of recognized events. Sound event detection of the most prominent event at each time provides a monophonic output, which is a simplified representation of the polyphonic output. These cases are presented in Figure 2
, together with the polyphonic annotation of the audio. Polyphonic sound event detection can be seen as a frame by frame multi-class and multi-label classification of the test audio. In this respect, polyphonic sound event detection is similar to polyphonic music transcription, with sound events equivalent to musical notes, and the polyphonic annotation similar to the piano roll representation of music.
2.2. Building a Polyphonic Sound Event Detection System
In a multisource environment such as our everyday acoustic environment, multiple different sound sources can be active at the same time. For such data, the annotation contains overlapping event instances as illustrated in Figure 2
, with each event instance having an associated onset, offset and label. The label is a textual description of the sound, such as “speech”, “beep”, “music”. Sound event detection is treated as a supervised learning problem, with the event classes being defined in advance, and all the sound instances used in training belong to one of these event classes. The aim of sound event detection is to provide a description of the acoustic input that is as close as possible to the reference. In this case, the requirement is for the sound event detection system to output information consisting of detected sound event instances, having an associated onset, offset, and a textual label that belongs to one of the learned event classes.
The stages of a sound event detection system are illustrated in Figure 3
. The training chain involves processing of audio and annotations, and the sound event detection system training. Acoustic features are extracted from audio, and the training stage finds the mapping between acoustic features and sound event activities given by the annotations. The testing chain involves processing of test audio in the same way as for training, the testing of the system, and, if needed, postprocessing of the system output for obtaining a representation similar to the annotations.
The audio is processed in short frames of typically 20–200 ms to extract the audio features of choice. Often used in sound event detection are representations of the signal spectrum, such as mel-frequency cepstral coefficients [25
], mel energies [17
], or simply the amplitude or power spectrum [18
]. The audio processing chain may include simple preprocessing of audio such as normalization and pre-emphasis before feature extraction, or more complex preprocessing such as sound source separation and acoustic stream selection for reducing the complexity of audio mixtures used in training [16
]. The annotations are processed to obtain a representation suitable for the training method. In the illustrated example, the annotations are processed to obtain a binary activity representation that provides the class activity information for each frame in the system training.
Training uses the obtained audio features and the corresponding target output given by the reference for supervised learning. Possible learning approaches for this step include Gaussian mixture models [25
], hidden Markov models [26
], non-negative dictionaries [17
], deep neural networks [18
], etc. For testing, an audio recording goes through the same preprocessing and feature extraction process as applied in the training stage. Afterwards, the trained system is used to map the audio features to event class likelihoods or direct decisions, according to the employed method. A further step for postprocessing the system output may be needed for smoothing the output and obtaining a binary activity representation for the estimated event classes. Smoothing methods used include median filtering [27
], use of a set length decision making window, majority voting, etc., while binarization is usually obtained by using a threshold [17
Evaluation is done by comparing the system output with a reference available for the test data. Systems performing sound event classification are usually evaluated in terms of accuracy [2
]. Studies involving both monophonic and polyphonic sound event detection report results using a variety of metrics, for example Precision, Recall and F-score [6
] or only F-score [7
], recognition rate and false positive rate [3
], or false positive and false negative rates [1
]. One of the first evaluation campaigns for sound event detection (CLEAR 2007) used the acoustic event error rate as the evaluation metric, expressed as time percentage [29
]. In this case, however, the system output was expected to be monophonic, while the ground truth was polyphonic. Later, acoustic event error rate was redefined for frame-based calculation, and used for example in DCASE 2013 [20
], while a similarly defined error rate was used as a secondary metric in MIREX Multiple Fundamental Frequency Estimation and Tracking task [30
These metrics are well established in different research areas, but the temporal overlap in polyphonic sound event detection leads to changes in their calculation or interpretation. Simple metrics that count numbers of correct predictions, used in classification and information retrieval, must be defined to consider multiple classes at the same time. The evaluation of systems from neighboring fields of speech recognition and diarization dynamically align the system output with the ground truth, and evaluate the degree of misalignment between them. Polyphonic annotation and system output cannot be aligned in a unique way, therefore the error rate defined based on this misalignment must be adapted to the situation. In a similar way, evaluation of polyphonic music transcription uses metrics with modified definitions to account for overlapping notes played by different instruments.
Obtaining the reference necessary for training and evaluation of sound event detection systems is not a trivial task. One way to obtain annotated data is to create synthetic mixtures using isolated sound events—possibly allowing control of signal-to-noise ratio and amount of overlapping sounds [31
]. This method has the advantage of being efficient and providing a detailed and exact reference, close to a true ground truth. However, synthetic mixtures cannot model the variability encountered in real life, where there is no control over the number and type of sound sources and their degree of overlapping.
Real-life audio data is easy to collect, but very time consuming to annotate. Currently, there are only few, rather small, public datasets consisting of real-world recordings with polyphonic annotations. DARES data [19
] is one such example, but ill-suited for sound event detection due to the high amount of classes compared to amount of examples (around 3200 sound event instances belonging to over 700 classes). CLEAR evaluation data [29
] is commercially available and contains audio recorded in controlled conditions in a meeting environment. TUT Sound Events [32
] has recently been published for DCASE 2016 and contains sound events annotated using freely chosen labels. Nouns were used to characterize each sound source, and verbs to characterize the sound production mechanism, whenever this was possible, while the onset and offset locations were marked to match the perceived temporal location of the sounds. As a consequence, the obtained manual annotations are highly subjective.
For this type of data, no annotator agreement studies are available. One recent study on inter-annotator agreement is presented in [25
], for tagging of audio recorded in a home environment. Their annotation approach associated multiple labels to each 4-s segment from the audio recordings, based on a set of 7 labels associated with sound sources present. With three annotators, they obtained three sets of multi-label annotations per segment. The work does not address subjectivity of temporally delimiting the labeled sounds. The authors observed strong inter-annotator agreement about labels “child speech”, “male speech”, “female speech” and “video game/TV”, but relatively low agreement about “percussive sounds”, “broadband noise”, “other identifiable sounds” and “silence/background”. The results suggest that annotators have difficulty assigning labels to ambiguous sound sources. Considering the more general task of sound event detection with a large number of classes, there is no sufficient data generated by multiple annotators to facilitate inter-annotator agreement assessment. For the purpose of this study, we consider the subjective manual annotation or automatically generated synthetic annotation as correct, and use it as a reference to evaluate the system performance.
Evaluation of the system output can be done at different stages illustrated in the example in Figure 3
. It is possible to compare the event activity matrix obtained after preprocessing the annotation and the system output in the same form. If the system output is further transformed into separate event instances as in the annotations, the comparison can be performed at the level of individual events.