1. Introduction
Audio segmentation is an event-detection task that identifies audio events and their respective boundaries. It divides an audio signal into sections based on the composite acoustic classes such as speech, singing voice, and environmental sounds. Segmentation of audio helps us develop metadata regarding its content, which often serves as a preprocessing step for a bigger task like speech and lyrics transcription, content recognition, and indexing. Audio segmentation of radio and television programmes has been highly relevant in the literature to detect sections of speech, music, or both [
1].
High regions of acoustic change can be either detected by using a distance-metric or modelled as a supervised learning task [
2]. Distance-based segmentation directly detects regions of high acoustic changes in an unsupervised way. A distance metric such as Euclidean distance [
3], Bayesian information criterion (BIC) [
4], or generalized likelihood ratio (GLR) [
5] is computed to detect peaks of acoustic change. These peaks are associated with the transition from one acoustic class to another.
Segmentation-by-classification is a supervised learning method that divides audio into small frames, typically in the range of 10 to 25 ms. These audio frames are individually classified as music or speech and thus detect the boundaries of audio events. This technique has gained popularity after deep learning was investigated for audio segmentation.
Many studies have explored novel algorithms and neural network architectures to improve the performance of audio segmentation. However, there has been less focus on the availability of data to train these machine learning models. There are many openly available datasets that contain separate audio files of music and speech. For example, LibriSpeech contains a vast number of speech examples and GTZAN genre recognition dataset has many music examples. However, broadcast audio is well-mixed by engineers and it contains speech over background music or gradual fade in and fade out of music. Over the years, researchers have trained their models on broadcast audio and unfortunately cannot share their datasets due to copyright issues [
6,
7,
8]. This hinders the reproducibility of research.
Among datasets for audio segmentation, MuSpeak [
9] was an example dataset provided by the Music Information Retrieval Evaluation eXchange (MIREX) 2018 music and speech detection competition (
https://music-ir.org/mirex/wiki/2018:Music_and/or_Speech_Detection, accessed on 26 February 2021). It is an openly available dataset that contains approximately 5 h of broadcast audio with annotations of music and speech. Moreover, a dataset called Open Broadcast Media Audio from TV (OpenBMAT) [
10] comprises 27.4 h of audio, with annotations for the relative loudness of music, but not speech. To train models for segmentation, in-house datasets are generally annotated by authors or outsourced to annotators. For instance, Schlüter et al. [
6] paid students to annotate 42 h of radio broadcast. OpenBMAT was cross-annotated by three individuals.
Annotating broadcast audio takes up a lot of time. The annotator needs to precisely detect transition points, silences between speech, and music smoothly fading in and out. Therefore, annotating an hour of audio usually ends up taking four to five hours. For example, each annotator of OpenBMAT took approximately 130 h [
10]. Furthermore, the presence of background music at very low volumes becomes a subjective task with differences amongst annotators. Hence, audio is annotated by multiple individuals, making the process expensive and laborious.
In this paper, we investigate the challenges involved in automatically mixing audio content that would resemble radio data. The speech and music files used by the data synthesis procedure are restricted to openly available datasets to encourage reproducibility. We replicate the process of a mixing engineer by investigating fade curves and audio ducking. The literature comprises many mixing principles adopted by radio stations, such as sufficient loudness difference between speech and background music to make the speech intelligible. We evaluate and implement these mixing principles and choose optimal parameters for our data synthesis algorithm.
Our initial findings using this data synthesis procedure was accepted as a conference paper in IEEE ICASSP [
11]. However, the novel contributions of this journal paper are: (1) compare state-of-the-art neural network architectures for audio segmentation, (2) investigate how the loudness difference between speech and background music influences the performance of segmentation, (3) examine how the size of the training set improves performance (4) compare real-world training sets and synthetic training sets. The implementation, code, and pretrained models associated with this study is openly available in this GitHub repository (
https://github.com/satvik-venkatesh/train-synth-audio-seg/, accessed on 26 February 2021).
Paper Structure
Section 2 explains the procedure to artificially synthesise radio data. In
Section 3, we present the methods that are common to all experiments. We conduct four different experiments in
Section 4,
Section 5,
Section 6 and
Section 7. For clarity, we explain the experimental set-up and results for each experiment within the section itself.
Section 4 compares state-of-the-art neural network architectures. Subsequently, in
Section 5, we investigate the effect of loudness difference between speech and background music.
Section 6 evaluates how the size of the synthetic training set impacts performance. Finally, in
Section 7, we compare machine learning models trained on synthetic and real-world data.
Please note that the results presented in
Section 4,
Section 5 and
Section 6 were on both our validation set and test set. We present the results on the validation set because we are optimising model settings and fine-tuning parameters for data synthesis. All our decisions were informed by only the validation set because we did not want to influence the test results of the main experiment in
Section 7. However, for
Section 4,
Section 5 and
Section 6, we also present the results on the test set to ensure that we are not overfitting the validation set. The results for the final experiment in
Section 7 were only on the test set, which demonstrates the robustness of our data synthesis procedure.
2. Data Synthesis
2.1. Synthetic Examples
We considered four combinations of audio classes which commonly occur in radio programmes—speech, music, speech over music, or other. Another common feature in radio programmes is the smooth transition from one audio class to another. We synthesised audio examples with a fixed duration of 8 s. We felt that 8 s was long enough to clearly identify the audio class and capture transitions that might occur between one audio class to another. For clarity, we categorised the examples into two types—(1) Multi-class examples and (2) Multi-label examples. The former focuses on audio where either music, speech, or noise can occur. There is no simultaneous occurrence of two acoustic classes. In multi-label examples, we specifically focus on audio with speech over background music.
2.2. Audio Transitions
We observed two types of transition, which we termed as (1) normal fade transition and (2) cross-fade transition. In the former, an audio class fades out, followed a period of silence, and then a new acoustic class fades in. An example of this transition is when a Radio DJ introduces a song, followed by short gap, and then the song starts playing. In a cross-fade, a new acoustic class fades in while the old one is fading out. For instance, one song smoothly cross-fading into another song.
Figure 1 illustrates the two types of audio transitions.
Each synthetic example can either have no transitions or at the most one transition. Hence, for a multi-class example, if there are no audio transitions, there are three possible occurrences—music, speech, or noise. If there is one transition, there are nine possible permutations between the classes. Note that we also included repetitions of the same audio class. For example, in an interview, two different voices occur with a pause. In radio broadcast, noise examples are less likely to occur when compared to speech and music. Therefore, the probability of music, speech, and noise was set to be 0.4, 0.4, and 0.2 respectively.
For multi-label examples, as shown in
Figure 2, we performed audio ducking of background music, which is a common practice in broadcast audio. This process is explained in
Section 2.6. If there are no audio transitions, there is only one possible combination—music+speech. If there is one audio transition, the possible permutations are given below.
Music+speech to music: Initially, audio ducking is performed on the music and the volume is increased after the speech stops.
Music+speech to speech: The background music fades out at the transition point.
Music to music+speech: Initially, music is being played and ducking is performed when the speech starts.
Speech to music+speech: Music fades in at the transition point.
2.3. Time-Related Variables
In real-world radio data, many parameters are generally defined by the mixing engineer. For example, duration of the fade curve and time-stamp of the audio transition. For our artificial data synthesis, we randomised these parameters to obtain a variety of synthetic examples. All random sampling was done using uniform distributions within specified ranges.
Within the duration of an 8 s example, the time-stamp of an audio transition is randomised within the range of 1.5 s to 6.5 s. Subsequently, a fade duration is randomised within a range that is feasible. For example, if the time-stamp of transition is at 5 s, the minimum fade out duration could be 0 s (which is no fade out) and a maximum of 3 s. This technique helps us render very quick as well as gradual audio transitions.
2.4. Fade Curves
When rendering an audio transition, mixing engineers can adopt a variety of fade curves. We considered the four most popular fade curves [
12]—linear, exponential convex, exponential concave, and s-curve.
Figure 3 shows the different types of fade curves.
For each audio transition, we randomly choose fade curves. For the exponential convex, exponential concave, and s-curve, we need to define an exponent value. The exponent value was randomly chosen between 1.5 to 3.0.
2.5. Sampling Audio Files
Audio files pertaining to each class are stored in separate folders. Initially, a template of the synthetic audio examples is designed by specifying the list of audio classes, transition points, etc. Then, a random file belonging to the required audio class is selected. Consequently, a random segment of the required duration is sampled within the audio file.
2.6. Audio Ducking
Audio ducking is the process of reducing one signal with respect to another. In this case, we reduce the volume of background music with respect to speech, in order to make speech more intelligible.
Figure 2 shows an example of audio ducking in our system. This is a common practice in broadcast audio. Different radio stations have varying guidelines for mixing engineers to perform audio ducking. Torcoli et al. [
13] conducted a comprehensive analysis on the loudness difference (LD) between speech and background music. Listeners from different backgrounds had varying preferences. On average, LDs preferred by experts were 4 Loudness Units (LU) less than those preferred by nonexperts. In addition, individuals belonging to older age groups preferred greater LDs to clearly understand speech.
The literature does not provide us with an ideal value for LD. It depends on the mixing engineer, target-audience, and nature of audio content. Many broadcasters recommend a minimum of 7 to 10 LU for speech over music. Others, for instance, the UK Digital Production Partnership, recommends a minimum LD of 4 LU [
14].
Higher LDs cause the background music to become quieter. This leads to clearer speech, but the music becomes less impactful. Again, this depends on the nature of audio content. Depending on the programme, the LD could be as high as 23 LU [
13]. Moreover, OpenBMAT, a music-detection, contains audio files as low as −51 Loudness Units relative to Full Scale (LUFS).
There are two ways to perform audio ducking—volume automation and side-chain compression. We have adopted the former technique because it is relatively easier to calculate LD values. Loudness of audio was calculated using the integrated loudness metric by ITU BS.1770-4 [
15]. During data synthesis, we calculate the loudness of the speech segment. Subsequently, we adjust the gain of background music to the required LD. In this study, we evaluate how the network trains over different ranges of LDs.
Section 5 presents the methodology for these experiments.
Figure 4 depicts an overview of the data synthesis procedure. Note that in cases where audio ducking is performed, the network needs to predict the presence of both music and speech. In addition, when an audio class is fading in or out, the entire fade curve is labelled as 1. We do not consider power of the audio with respect to the mixture gain.
Unless mentioned otherwise, the probabilities for all events were equally weighted. For example, the chance of occurrences for multi-label and multi-class examples are 50% each. Four fade curves were considered in the paper and thus, each fade curve has a probability of 25%. Similarly, the probabilities for other events were calculated based on the total possible number of occurrences.
5. Experiment II: Loudness Difference Selection
5.1. Experimental Set-Up
In this experiment, we used only synthetic data to train the neural network. For each audio example with speech over background music, we need to select an LD between the audio classes. This LD cannot be constant because the neural network will become biased towards learning a specific LD. Therefore, we chose random LD values from a uniform distribution. However, the literature does not provide us with a clear-cut range of LDs. For our data synthesis procedure, we need to select an optimal maximum and minimum value of LD.
First, we set the minimum value at 7 LU. The maximum value was varied from 18 to 54 LU with steps of 3 LU. We synthesised 5120 examples and trained the network over these examples for each configuration. The choice of maximum LD is expected to only influence the performance on music. The greater the LD, the lower the volume of background music and vice versa. If the background music is sufficiently loud, we can precisely detect the presence of music. However, if the background music becomes too low, it becomes harder for the listener to precisely detect the presence of music. Hence, we analysed the precision, recall, and F-measure of music. We repeated the experiment five times using different random seeds. Regression analysis was performed on the observations using SPSS [
41]. Analysis of variance (ANOVA) was conducted to evaluate the level of significance.
Similarly, we set the maximum value at 21 LU and the minimum value was varied from 19 to −8 LU with a step of −3 LU. The smaller the LD, the louder the background music is. Negative LDs stand for cases when the background music is louder than the speech. Therefore, if the LD is too low or negative, we expect the precision of speech to be hindered. This often occurs in advertisements and radio jingles, which have smaller LDs. Again, for each configuration, we synthesised 5120 examples and trained the network over these examples. The minimum LD is expected to only influence the intelligibility of speech. Hence, we analysed the precision, recall, and F-measure of speech. We repeated the experiment five times. Regression analysis and ANOVA were performed on the observations.
5.2. Results
5.2.1. F-measure
F-measure is a metric that combines precision and recall by calculating their harmonic mean [
42].
Figure 5 presents a quadratic regression of how the F-measure of music changes with maximum LD. The maxima for validation, test, and combined curves lie at 23, 27, and 24 LU respectively. The validation and test curves were significant (
). However, the combined curve that jointly plots validation and test observations was not significant (
). Therefore, our results suggest that the optimal value of maximum LD lies somewhere between 23 and 27 LU.
Torcoli et al. [
13] suggested that depending on the nature of broadcast content, the LD can be as high as 23 LU, which has a reasonable overlap with our findings. However, we did not find a single optimal value for maximum LD based on F-measure but a range of values from 23 to 27 LU.
Figure 6 shows a regression analysis of how the F-measure of speech varies with minimum LD. The maxima for validation, test, and combined curves lie at 8, 2, and 5 LU respectively. The validation and test curves were significant (
). However, the combined curve that jointly plots validation and test observations was not significant (
). Therefore, our results suggest that the optimal value of minimum LD lies somewhere between 2 and 8 LU.
As explained in
Section 2.6, many broadcasters recommend a minimum of 7 to 10 LU for speech over music and some recommend a minimum LD of 4 LU [
13,
14]. Note that these are only recommendations to make speech intelligible. However, there are cases where mixing engineers choose LDs close to zero or even negative LDs [
13]. Therefore, in order to maximise F-measure, our results suggest that the minimum LD should lie between 2 and 8 LU.
As the F-measures of combined curves were statistically insignificant, it may benefit from further analysis. Are we overfitting the validation set? Or is there an optimal LD for different datasets? To address the questions, we analysed precision and recall in the following subsection.
5.2.2. Precision and Recall
Figure 7a presents a regression analysis of how precision of music varies with maximum LD. A quadratic curve was fitted to the observations. All curves evidently demonstrate than precision decreases as the maximum LD increases (
). Therefore, if the neural network was trained on examples that have very low volumes of background music, the precision is hindered.
On the other hand,
Figure 7b shows the relationship between recall of music and maximum LD. Initially, recall increases as the LD increases. However, the maxima for the validation, test, and combined curves lie approximately at 40 LU. This shows that the recall does not increase beyond an LD of 40 LU. In other words, the background music becomes too low to be perceived. Hence, we recommend that the maximum LD should never be greater than 40 LU, even if the researcher desires a high recall.
Figure 8a shows a regression analysis of speech precision with regards to minimum LD. Initially, as the minimum LD decreases, there is an increase in precision. The maxima for all the curves lie at approximately 10 LU. Below 10 LU, there is a decrease in precision. This is an interesting observation because the same LD is preferred by human listeners [
13]. Torcoli et al. [
13] conducted a study on the LDs preferred by human listeners. Based on the test results, they recommended a minimum of 10 LU between speech and background music. Contrarily, many broadcasters use LDs less than 10 LU, which makes speech intelligible [
13]. It is noteworthy that machine learning models in this study also preferred a minimum LD of 10 LU to maximise the precision of speech.
Figure 8b shows that recall continuously improves with a decrease in minimum LD. This is because the machine learning model learns more examples where speech and background music have a small LD. This also shows that the real-world radio examples collected from BBC Radio Devon contain cases where speech and background music have small LDs. However, as shown in
Figure 8a, the precision of speech is affected in these cases.
The results in this subsection and
Section 5.2.1 clearly demonstrate that there is a trade-off between precision and recall when selecting a maximum and minimum LD. A researcher may adjust these settings according to their objectives. Moreover, F-measure assigns an equal weightage to precision and recall, which might not be desirable. For further experiments in our study, we did not make any decisions based on the regression analysis because it includes the test set. Instead, we chose values that obtained the highest mean for F-measure on our validation set. The minimum and maximum LD were set to 4 and 33 LU respectively.
8. Conclusions
In this study, we evaluated the challenges involved in synthesising data that resembles radio broadcast. We surveyed the literature to understand the various mixing principles adopted by broadcasting stations. We incorporated fade curves to smoothly render transitions and audio ducking to facilitate the intelligibility of speech.
We evaluated state-of-the-art neural network architectures for the task of audio segmentation and observed that CRNN outperformed other models. This paper also investigated the impact of LD between speech and background music while training neural networks. There was a trade-off between precision and recall when selecting maximum and minimum LDs. Moreover, if the LD was less than 10 LU, the precision of the network gets affected. This choice of minimum LD for the machine learning model was similar to that of human listeners in the literature [
13]. This paper also recommended that the LD between speech and background music should not be greater than 40 LU because the music becomes too quiet to be detected. This emphasises that the task of background music-detection needs to be defined in greater detail by using threshold-related variables.
The results demonstrated that our data synthesis procedure is a highly effective technique to create large-scale training sets. We compared the effectiveness of real-world and synthetic training sets. Interestingly, artificial data surpasses the performance of real-world data in some scenarios. It generalises better to other data distributions because it does not depend on human annotations.
As this paper has significantly reduced the time and resources required to label datasets, it opens up many possibilities for future work. Audio features like Mel spectrograms and MFCCs discard the phase of the audio and only consider its magnitude. This might hinder the algorithm’s ability to capture audio transitions. End-to-end deep learning, as suggested by some researchers [
7,
48], seems like a promising approach to audio segmentation. The results in
Section 6.2 showed that the performance did not significantly improve beyond 10,240 examples. This suggests that other methods such as Generative Adversarial Networks (GANs) [
49,
50] might improve the quality of synthetic training sets.