Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio

Cesarini, Valerio; Addati, Vincenzo; Costantini, Giovanni

doi:10.3390/info16040288

Open AccessArticle

Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio

by

Valerio Cesarini

,

Vincenzo Addati

and

Giovanni Costantini

^*

Department of Electronic Engineering, University of Rome Tor Vergata, 00133 Roma, Italy

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 288; https://doi.org/10.3390/info16040288

Submission received: 18 February 2025 / Revised: 27 March 2025 / Accepted: 30 March 2025 / Published: 3 April 2025

(This article belongs to the Special Issue Signal Processing and Machine Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the issue of distinguishing commercially played songs from non-music audio in radio broadcasts, where automatic song identification systems are commonly employed for reporting purposes. Service call costs increase because these systems need to remain continuously active, even when music is not being broadcast. Our solution serves as a preliminary filter to determine whether an audio segment constitutes “music” and thus warrants a subsequent service call to an identifier. We collected 139 h of non-consecutive 5 s audio samples from various radio broadcasts, labeling segments from talk shows or advertisements as “non-music”. We implemented multiple data augmentation strategies, including FM-like pre-processing, trained a custom Convolutional Neural Network, and then built a live inference platform capable of continuously monitoring web radio streams. This platform was validated using 1360 newly collected audio samples, evaluating performance on both 5 s chunks and 15 s buffers. The system demonstrated consistently high performance on previously unseen stations, achieving an average accuracy of 96% and a maximum of 98.23%. The intensive pre-processing contributed to these performances with the benefit of making the system inherently suitable for FM radio. This solution has been incorporated into a commercial product currently utilized by Italian clients for royalty calculation and reporting purposes.

Keywords:

audio recognition; broadcast; CNN; music

1. Introduction

Broadcast audio transmissions like Frequency Modulation (FM) or web radios carry different kinds of content like commercial music tracks, advertising spots, talk shows, news, etc. Within this context, being able to differentiate the nature of the audio being broadcast is crucial for commercial or reporting reasons such as monitoring, metadata, and royalties. Depending on national regulations, radio stations can be asked to produce periodic reports containing all the songs played within a requested time range, providing metadata such as the date and time when each song has been played or the duration.

To automate the process, there currently are systems able to automatically recognize what song is being played, which at least removes the necessity for a human operator. Such systems are usually based on extracting identifying features from the audio and matching them against a large pre-existing song database, and they have to work 24/7, wasting resources on non-music content [1].

Automatic song identification technologies typically rely on “audio fingerprinting”, which is based on hashing unique landmarks or features of the signal and is designed to be robust to audio degradations (e.g., equalization, noise, etc.) and even handle pitch or tempo deviations [2]. As an example, one of the most commercially famous solutions called Shazam (formerly Shazam Entertainment Ltd., now owned by Apple, Inc., Cupertino, CA, USA) [3] is based on the identification of landmarks as peaks on the spectrogram, with the basic idea that the sequence of distinct time-frequency pairs is an identifier for a specific song. Once the audio has been fingerprinted it is then compared against a database using fast hash table lookups.

Other solutions employ more advanced or more diverse input features, with technologies that include the following [2]:

Constellation maps which take into account relationships between fingerprints, thus also considering their sequential nature;
High-level musical features such as the tempo, chord progression or lyrics. The usage of such metadata implies the existence of high-level models able to infer data such as harmony or to recognize (sung) voice and convert it into text (lyrics);
Feature-based high-level Machine Learning, which may employ classifiers trained on acoustic features usually derived from the spectrum, cepstrum (MFCC) [4] or pitch domain;
Deep Learning (DL) is usually implemented with Convolutional Neural Networks (CNNs) trained on spectrogram images, which automatically extract relevant features [5].

Identifying specific songs is a task that current technologies such as Shazam handle relatively well, making them suitable for automating the tracking of songs played on radio broadcasts in order to count them and produce reports and statistics. However, this approach requires the identifier to operate continuously, 24/7, leading to numerous service calls. This remains true even during times when the radio station is not streaming music, such as during talk shows or advertisements. In these instances, the song identification service may be invoked unnecessarily, prompting interest in a preliminary method to differentiate between commercially played tracks and other content, such as ads, speech or non-music segments.

Incorporating such a discrimination system, which we will refer to as a “music vs. non-music” discrimination problem, helps reduce the inefficient use of resources for monitoring radio broadcasts, lowering operational costs.

As previously mentioned, the primary challenges in developing such systems stem from the presence of music in advertisements and background soundtracks during talk shows. The fragmented structure of talk shows, where pauses in speech allow the music to become more prominent, along with the variability in advertisements—some of which include jingles—further complicate the task.

The aim of this paper is to propose a system that distinguishes commercially played tracks (“music”) from everything else that is being broadcast (“non-music”) on radios, to be used as a preliminary filter for reducing service calls to song identifiers, in turn, used to automate the reporting process.

Currently, the problem of preliminarily identifying if a radio station is playing a commercial track can also be partially faced by relying on broadcast metadata. However, such a solution is not definitive or robust because it is unsuitable for AM and FM without RDS (Radio Data System), as well as smaller stations where metadata is not transmitted, and in general even with bigger stations there is a number of human annotation errors and gaps.

The ACRCloud SDK service (by Google LLC, Mountain View, CA, USA) includes a radio broadcast suite, which reports “Cover Song Identification” and “Audio Fingerprinting” as their main resources, and includes services like unidentified content detection and music-vs-speech classification [6]. ACRCloud’s services are mainly directed towards song and custom content identification, self-reportedly based on fingerprinting technologies and, to the knowledge of the authors, with no alleged preliminary filter for avoiding unnecessary service calls: as stated by Mediarealm, “listens to your internet stream in real-time, detects the songs you are playing, and sends the song data through to a service” [7]. Other notable services are Gracenote (Gracenote Inc., Emeryville, CA, USA) [8], which specializes in the addition of contextual metadata and allows for easier discrimination if paired with SMD, and TuneSat (TuneSat LLC, New York, NY, USA) [9], which distinguishes when a song is being used for plain commercial broadcast versus when it is part of an ad, and it is a royalty service based on fingerprinting.

Other automatic solutions employ some kind of Speech-Music Discrimination (SMD) algorithms to distinguish between music and human speech, which may then be enhanced by radio metadata (if available) or fingerprinting. In a seminal 1997 work, Raj et al. [10] explore the effects of background music corrupting speech recognition, which reflects the problem in this paper, finding out that algorithms that have been successfully applied to noisy speech are also helpful in improving recognition for background music [11].

Many more complex solutions nowadays employ some kind of Deep Learning (DL) systems, often based on Convolutional Neural Networks (CNNs). CNN systems are based on a backpropagation neural network architecture where each layer performs multidimensional convolution (filtering) instead of simple dot products, to the point that each “neuron” is actually a filter matrix, and its pivots are the learnable (weights) [12]. This behavior makes CNNs very suitable for image analysis because its filtering nature allows the identification of graphically localized features within a pixel matrix, and its multi-layer architecture performs an automatic feature extraction. Therefore, when dealing with audio analysis, it is customary to transform the signals into an image file (pixel matrix) that holds useful characteristics of the sound, leading to spectrograms being a gold standard [13]. Spectrogram CNN-based deep learning systems have become a standard in audio analysis for various applications, including industrial-grade voice analysis [14], speaker recognition [15] and music classification [16].

As anticipated, most studies on the subject either focus on inter-song classification (by genre, cover identification, etc.) or music-vs-speech with no mention of ads or talk shows with background soundtracks.

A preliminary work from Wieser et al. [17] used an MFCC+SVM pipeline to distinguish music from speech from radio podcasts, notably finding issues when songs had no clear harmonic context and featured mainly rhythmic elements. One of the considered approaches is the one using a CNN developed by Jang et al. [18], using two public audio datasets featuring music, speech and noise samples: the “Mirex 2015 MuSpeak” [19] and the GTZAN set [20]. These samples have been mixed artificially in order to obtain three different combinations: music with speech, music with noise and speech with noise. The developed system reached accuracies spanning from 86.5% to 95.9%, but sources varied from CD tracks to internet videos, with little attention to the eventual presence of audio artifacts typical of radio transmissions (distortion, compression), and most notably, overlaps were made artificially without reflecting real talk shows or ads.

Another approach is based on the Temporal Convolutional Network (TCN) developed by Lemaire and Holzapfel [21]. TCN is a recently developed technology based on a common Convolutional Neural Network, but it is also able to take into consideration temporal factors of data, so it is able to infer rhythmic elements. Different public audio datasets have been used to develop the proposed system: the MUSAN Corpus [22], the GTZAN music dataset [20], Scheirer–Slaney Music Speech Corpus [23], Mirex 2015 MuSpeak [19], OFAI Speech and Music Detection Dataset [24] and ESC-50: Dataset for Environmental Sound Classification [25]. Moreover, an ad hoc developed dataset featuring music, speech and noise samples has been used: the Sveriges Radio dataset [21]. Both real broadcast samples and artificially generated samples mixing music, speech and noise have been used. Samples have been resampled to 22,050 Hz and split into 90 s-length files. Data augmentation has been used, with music and speech randomly overlapping in order to train the system on those cases deemed “borderline”. The developed system reached an accuracy of 96.8%.

The implementation by Tsipas et al. [26] diverges from neural network-based methods, instead utilizing classical machine learning classifiers such as Support Vector Machines (SVMs), Random Forest and Logistic Regression on acoustic features representing temporal, spectral and cepstral characteristics. Although the maximum accuracy reaches 97.7%, the dataset is limited by being composed of under three hours of material composed speech and music samples artificially mixed to re-create “non-music” in broadcasts.

A recent study by Bhattacharjee et al. [27] deals with the specific problem of music overlapped with speech and describes an interesting feature extraction pipeline yet again centered on the presence of percussive elements but has no direct implications with radio broadcasts.

The main difference between plain “music vs. speech” and “music vs. non-music” is that in most radio content, there actually is some kind of music background, be it a jingle in ads or background tracks in talk shows. Thus, different sources are overlapped, and the problem is redefined as contextual analysis, where the non-music bits have to be identified by the explicit or implicit identification of complex patterns such as the presence of speech (non-rhythmic and not sung voices), the more frequent changes in pace or sound, the pauses in speech, etc.

To the knowledge of the authors, there currently are no reliable, publicly available systems for preliminary filtering of the broadcasted content, which in turn allows greatly reduced service calls to identifier software, minimizing cost and net traffic.

With these premises, the authors anticipate that very system proposed and implemented within this work is currently being employed in the Italian industry with success, embedded in the solution called “Compilerò X Radio” [28] by DaVinci Solutions S.r.l., offering an automatic music report generation system for Italian radios and web-radios compatible with the requirements of Italian royalties collecting authorities like SIAE [29], SCF [30] and LEA [31,32].

As far as technology goes, in line with the previously mentioned studies, we decided to implement our solution with a custom Convolutional Neural Network (CNN) applied to spectrogram images. A common criticality with the existing literature is related to the datasets used, which often only employ music or speech that the authors try to mix or do not come from real radio broadcasts.

With the aim of building a reliable, case-specific dataset without having to rely on artificial data generation or excessive augmentation, we strive to collect audio from real radio broadcasts from various stations and thus employ our custom dataset made of 139 h of audio containing music and “non-music”.

The developed system is based on a custom architecture CNN trained on spectrograms with the aim of distinguishing “music” as in commercially played tracks versus “non-music” as in everything else, including speech, talk shows or ads. The system has to work in a robust way and with a fast inference because its use case requires it to be able to discriminate audio content on 15 s length samples in order to decide whether to make service calls to song identifiers. In our case, the 15 s sample is split into three 5 s chunks, and the final prediction is based on the most recurrent prediction among the three. Additional experimentations have been carried out and presented regarding the usage of data augmentation and radio-like pre-processing on the raw audio data.

2. Materials and Methods

2.1. Dataset

The dataset was developed specifically for the subject of this paper and consists of audio from radio broadcasts mined from internet streams. It has been developed following a partnership with an Italian local FM radio named Love FM [33], which granted access to its broadcast database. Part of the dataset was also developed by recording other famous Italian FM radios, with Rai Radio 2 [34] used as an additional source for training samples and two other radios only used in the test set (Virgin Radio [35], R101 [36]) to check the performances of the models on material coming from previously unseen radio broadcasts.

More than 300 h of the broadcast were recorded and stored, but not all of them were used as training/testing material because we wanted to avoid the usage of “consecutive” chunks coming from the same source, which could bias the training set or, especially, the testing performances if chunks from the same source appeared in both training and test set. All recordings were manually annotated and classified into two classes: music and non-music. In general, the music class only contains commercially played tracks, whereas non-music entails everything else, including speech or ads. As for mixed content, it was classified according to the predominant source both volume and context-wise. As an example, a talk show where a speaker talks over some background music is classified as non-music as the predominant part is the speaker’s voice.

A borderline case is represented by all-music jingles. Although they do not represent commercially sold music tracks, they ultimately can be considered original songs, and their structure and sound are identical to them. Therefore, they have to be considered as “music”, also taking into account the fact that they would be impossible to classify otherwise for a model trained to distinguish pure music from other elements.

All samples were ultimately recorded in the .wav PCM format with a 44,100 Hz sampling frequency and a bit-depth of 16 bits, but the nature of the sources was various and each different radio sounded different. In order to equalize the dataset, a pre-processing pipeline was applied.

Moreover, because radio broadcasting usually involves additional processing when sending data over FM instead of web radios, a radio preset was also applied in order to check the performances of the models.

All the audio material will then be converted into mel-spectrograms, so the final dataset will be made only by spectrogram images, keeping the 2 subsets classification in music and non-music. The 2 subsets will be made of the same number of samples in order to have a balanced trained model.

Finally, as a benchmark, we also evaluated our system on the GTZAN dataset, which contains 1000 music tracks of 30 s in length each [20].

2.2. Pre-Processing

All the audio material has been cut into 5 s length samples with a sample rate of 44,100 Hz and a bit-depth of 16 bits, storing them in the non-compressed WAV PCM format.

Having a huge amount of source material, it has been possible to generate 5 s length samples randomly selecting different parts of the audio in order to have non-continuous sequences that could end up creating biases. In the end, a grand total of 100,000 non-consecutive broadcast chunks were used in the training set, for a total of roughly 139 h of audio material.

Audio samples are stored after formatting and pre-processing with the following specifications:

Non-compressed WAV PCM format;
Mono channel;
Original sample rate: 44,100 Hz, then reduced to 22,050 Hz by decimation;
Bit-depth: 16 bit;
Fade-in and fade-out made with rapid volume envelopes of 0.5 s at the beginning and at the end of the sample, resulting in a windowing process used to remove any possible artifacts introduced by the cutting process.

Because different radio stations provided audio with different wide bands, we opted to reduce everything to the 20–11,025 Hz band. This was done by decimation (downsampling with an anti-aliasing low-pass filter) by simply reducing the original sampling rate, which has the added advantage of reducing file size as well as the amount of information for the CNN.

Acoustic information above 11 kHz is less critical for distinguishing between songs and non-music segments. This is because speech typically contains much less significant information beyond 8 kHz [37], whereas musical elements often do [38]. Consequently, audio segments categorized as “non-music” and consisting solely of speech remain clearly distinguishable from music due to the lack of relevant spectral information above 8 kHz. However, when background jingles are present, high-frequency information becomes less relevant compared with detecting overlapping voices and analyzing rhythmic elements, whose transient features reside predominantly in lower-frequency bands [38].

This pre-processing is applied indiscriminately to all samples entering our models in the training, test or inference phase.

Differently from web stations, in radio broadcasting, especially in the FM industry, it is customary to process audio in order to reduce the dynamic range and reach a very high loudness. This processing was usually carried out by using heavy compressors, limiters or stereo image controllers, and is now also implemented digitally [39]. Of course, this is a destructive process for the audio and creates noticeable differences in volume, dynamic range and even equalization because it inherently creates distortion. Therefore, we chose to also experiment by introducing this processing to our dataset in order to check if performances would reduce the inputs taken by FM radio and to check the robustness of the models. In order to process audios with a real radio broadcast pipeline, we used a digital preset provided by Love FM radio and used in their own broadcast, relying on a software named Thimeo Stereo Tool (v10.10) [40]. The preset includes the following audio processing chain:

Heavy multi-band compression in order to reduce dynamic and get a very high loudness;
Equalization that emphasizes low, mid-low and high frequencies and reduces the presence of mid-high frequencies (roughly between 800 and 5000 Hz);
Hard-limiting to 0 dB, avoiding any clipping;
Stereo widening. This is applied before converting audio to mono by simple channel averaging. Stereo wideners work by manipulating the phase, timing, or EQ differences between the left and right channels to create a sense of wider stereo imaging, and in general, this may bring to more phase cancellation between the channels, resulting in a less rich, duller-sounding mono signal.

For consistency, models trained on processed versions of the dataset will also make inferences on audios that endured the same processing pipeline.

In order to transform audio into images suitable for CNN training, we opted to use mel-spectrograms [41]. Differently from “classical” spectrograms, they are generated using the Mel scale, which is a spectral re-weighing whose main aim is to reproduce the human sound perception, which is not linear across the whole spectrum. Due to this behavior, it is one of the most used solutions for Machine Learning applications [42], and it follows this formula:

M e l (f) = 2595 \log (1 + \frac{f}{700}) .

(1)

Spectrograms were generated from 5 s chunks and were saved as square images with a resolution of 256 × 256 pixels. Square images are chosen because of the filtering nature of the 2D convolutional layers of the CNN, which then require no padding.

Mel spectrograms are generated with the following characteristics:

Sample rate of 44,100 Hz, same as the source audios;
FFT size of 8192 in order to have a high-quality representation of the audio;
FFT step size of 828, whose value influences the image width;
FFT bin count of 256, whose value is equal to the image height;
Maximum frequency of 11,025 Hz, same as the input bandpass filtering;
Color scale set to grayscale, in order to simplify work for the CNN [43] by generating a 1-channel input.

Figure 1 details some Mel-spectrogram examples used for the CNN training.

2.3. Data Augmentation

Although our main aim was to build a big enough dataset that did not require the usage of artifacts like artificially mixing music and speech to create overlaps, data augmentation is a common procedure in AI applications, especially when DL is involved. So, we still decided to employ it and to compare the performances of the models with or without it. We selected augmentation techniques that either re-create common artifacts and audio damages in broadcast audio (noise addition, time shift, playback speed change and reduced dynamic range) or have been proven as effective on spectrogram-based CNN analyses (pitch shifting, time and frequency masking). Both audio-grade and spectrogram-grade artifacts were used to generate new data with the following procedures:

Noise addition, specifically white noise (flat band) with gain levels randomly spanning from 0% to 20% of the virtual fader level;
Backward or forward time-shifting for a random value in the range of 0 and 1 s;
Playback speed change of a random value in the range of −20 and 20 semitones. The relationship between a rate change and semitone shifting is related to the fact that an octave, which is made of 12 semitones, equals doubling the rate, leading to the conversion formula:

R a t e = {(\sqrt[12]{2})}^{N} = {1.0594}^{N},

(2)

where N is the number of semitones. Shifting audio by an octave means doubling the frequency, which translates to the fact that double the rate is associated with 12 semitones (one octave). As an example, N = 10 semitones will bring a rate increase of

2^{10 / 12} = 1.78 .

Pitch-shifting using a random integer value between −12 and 12 semitones. This was performed by compensating the playback speed so audio was not faster/slower;
Full-range compression in order to reduce dynamic range, with a ratio of 10 and a threshold set at −20 dB;
Frequency masking on the spectrogram, in line with Google’s SpecAugment solutions [44], is proven effective by many studies involving CNN for audio analysis. We implemented a system for “masking” random portions (bands) of the frequency axis on the spectrograms. This results in horizontal black bands [44];
Time masking on the spectrogram (vertical bands).

Frequency and time masking were applied indiscriminately on the spectrograms by randomly selecting 1 to 4 bands with random orientation (vertical or horizontal) and random width between 10 and 20 pixels and rendering them black.

For instances where the augmentation would shorten the sample, specifically backward time shift and forward playback speed change, the remaining part of the spectrogram would just result as a black vertical band, which is not an issue because it reflects the augmentation called “time masking”.

A random number generator was used in every augmentation instance to decide how many and which augmentation procedures the sample would endure. This means that each augmented sample may have one or more of the abovementioned artifacts.

Figure 2 shows an example of augmented spectrograms using the audio/spectrograms in Figure 1 as a source.

2.4. Augmented Data Distribution

Because we introduced a huge portion of augmented data, we set up a statistical validation procedure to ensure that their distribution was similar to the one of the original dataset. Because some of our augmentation techniques were applied directly to greyscale spectrogram images, we used those as data inputs. Therefore, we derived a distribution by computing high-level embeddings of the images by passing them as a tensor through a pre-trained CNN and extracting the output vectors [45,46]. For this purpose, we used the ResNet18 pre-trained on the ImageNet dataset. This allows for the extraction of compact numerical representations capturing the essential visual characteristics of the image, encoding semantic and visual information.

Maximum Mean Discrepancy (MMD) [47] was computed to evaluate the similarity between the augmented and original embeddings. MMD is a measure of distance between two distributions which are derived from their embeddings mapped into the Reproducing Kernel Hilbert Space (RKHS). However, actual mapping into the RKHS is avoided by employing Mercer’s “kernel trick”, which allows computing dot products of two embeddings in a higher-dimensional space implicitly by just doing computations with the kernel functions, resulting in the formula [47]:

M M D^{2} = \frac{1}{n^{2}} \sum_{i, j} k (x_{i}, x_{j}) + \frac{1}{m^{2}} \sum_{i, j} k (y_{i}, y_{2}) - \frac{2}{n m} \sum_{i, j} k (x_{i}, y_{j}),

(3)

where we chose a Gaussian kernel k; x is the first embedding (n samples) and y is the second embedding (m samples).

2.5. CNN Architecture

We implemented custom CNN models trained on grayscale, 256 × 256-pixel mel-spectrogram images. We experimented with several architectures, changing the number of hidden layers and neurons/filters, using shallower or deeper nets. We also experimented by using pre-trained CNN structures, namely the ResNet50 [48] pre-trained on the ImageNet dataset, which has already been proven effective in audio analysis. However, after several attempts, a custom architecture was chosen, as it was the best performing on experimental subsets and was subsequently used in all the different training scenarios of the present paper. Many different models were trained on the same architecture in order to better assess performances and generalization power over different variations of datasets: reduced, radio-processed and extended.

The CNN has been developed using the Tensorflow framework [49] and the custom architecture is described in Figure 3 with a flowchart.

The network architecture consists of a repeated sequence of three layers, applied four times, with the only variation being the filter size parameter in each repetition. The final three layers are responsible for flattening the data and reducing them to the target number of output classes, which in this case is two. A Conv2D layer applies convolution with ReLu activation to extract features, followed by BatchNormalization for stability and MaxPooling2D to downsample. The Flatten layer converts data into a 1D array for Dense layers, which perform classification with the final output layer.

An 80-20 training-test split has been performed so that each model was evaluated on 20% of the initial training set. As mentioned earlier, trained models were then re-evaluated on external validation data gathered from other radio stations.

Tensorflow checkpoints are used in order to save only the model with the highest accuracy regardless of the reached epoch iteration number, and they have been configured to store the model only if its accuracy is higher than 90%. A stopping criterion over the epoch iterations was set in order to stop training after reaching max accuracy on the test set without improvements over the next 5 epochs. The authors anticipate that no checkpoint/stop was reached after the 50th epoch.

The net was trained with 64-sample mini-batches, chosen as a compromise between a reasonably high number of samples for a more comprehensive gradient and the fact that, with hundreds of thousands of data, there is enough room for iterations and generalization. We used an adaptive learning rate Adam optimizer [50] over an initial learning rate of 0.001, with rates for each parameter being computed using the first and second moments of the backpropagation gradients:

θ^{'} = θ - α \frac{m_{1}}{\sqrt{m_{2}} + ε},

(4)

where

θ

’ represent the new learnable set for the network, updated from

θ

with an adaptive learning rate of

α

scaled by the momentums

m_{1}

and

m_{2}

;

ε

is an arbitrarily small value. The momentums

m_{1}

and

m_{2}

refer to the first and second moment of the gradient (computed using the Hadamard product [51] of the gradient of the loss function), normalized by the beta values to the power of the iteration.

The learning rate adapts following two hyperparameters b₁ and b₂, usually initialized as values close to 1 (like 0.99) and raised to the power of the iteration:

α' = α \frac{\sqrt{1 - b_{2}^{t}}}{1 - b_{1}^{t}},

(5)

where t is the t-th iteration.

2.6. Live Inference System

Foreseeing the final commercial product, a real-time prediction system was developed and put to use for the validation of various radio stations.

As anticipated, the real-world use case of the product involves latencies longer than the 5 s that we considered for training samples, in turn allowing for inferring over a 15 s audio chunk.

The system we built listens to a web radio streaming URL and produces a mel-spectrogram each 5 s. After producing 3 spectrograms for a total of 15 s of audio, the CNN is called for inference and three predictions are made and then rendered into one by majority voting. Predicted labels through argmax and confidence probability through softmax are both stored for clients. Experimental results are reported both in terms of 5 s (single chunk) and 15 s (real use case) accuracies to check if the three-chunk allowance brought any improvements. Trained models are used for predictions in a C# .NET application using the ML.NET framework [52], a Tensorflow-compatible Machine Learning framework by Microsoft. Figure 4 displays the live inference platform built and currently in use for clients—the only difference being a human annotation process for the validation of the present paper.

This system was also employed to test our trained models on the GTZAN dataset by dividing each audio into 5 s and 15 s chunks, thus resulting in 6000 and 2000 samples, respectively, considering an initial length of 30 s for each audio.

3. Results

Three different models of the selected CNN architecture were trained on different versions of the dataset: the first one involves 70,000 initial samples, a 10× data augmentation that leads to 700,000 samples and applies the previously mentioned broadcast pre-processing to the same dataset. The second one features the same dataset and augmentation but without additional processing. The last one employs an extended version of the dataset, for a total of 100,000 initial samples and a 7× data augmentation (leading to 700,000) to check the importance of the data augmentation and of the ratio between augmented and non-augmented.

The statistical analysis brought an MMD of 0.0142 for the comparison of the distribution of the augmented vs. original dataset, which suggests similarity between the two sets. All intermediate possibilities (less/more augmentation, applying broadcast processing to the extended dataset, etc.) were tried as well but did not lead to noticeable results in terms of accuracy increase or decrease. Training results and a synopsis of the models are reported in Table 1.

The training accuracy value is the value returned by Tensorflow at the end of the epoch, and it is not the accuracy of a real-world scenario, which will be calculated by testing these models against live, real radio recordings.

As for validation, we used the inference system we described in the “Inference” section live on web radio streaming, which resulted in newly acquired audio samples from four radios: Love FM, Rai Radio 2, Virgin Radio and R101.

Only the first two radios provided samples used in the training set, whereas Virgin and R101 were previously unseen by the models. A total of 1360 samples, i.e., 2 h of validation audio, were gathered by our live test and were annotated by human operators with the following criteria that mirror those used in the training dataset:

Every bit of a commercially played music track is considered “music”;
Music-only jingles, which ultimately consist of an “original” song (albeit commercial), are also considered “music”;
Talk shows with background music are considered “non-music”;
Ads with slogans, pre-recorded voices or non-musical elements are considered “non-music” regardless of the background;
Speech-only, or noise, are considered “non-music” for obvious reasons.

Validation results are reported in Table 2, Table 3, Table 4, Table 5 and Table 6, divided according to the following:

Inference model used (broadcast processing, no-broadcast, extended);
Radio (Love FM, Rai Radio 2, Virgin, R101);
Aggregated 5 s or 15 s input. In the latter case, the final accuracy refers to predictions obtained by majority voting between the three consecutive 5 s chunks of predictions.

Validation results are reported in Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 and contain the results of our models applied on new recordings of stations that were included in the training set, as well as new datasets recorded from new radio stations, namely R101 and Virgin (with different schedules and audio pre-processing). Table 8 reports the results on the GTZAN dataset for the task of detecting music, with the same division into genres originally featured in the dataset metadata. Because all samples in GTZAN are music, the best possible scenario was obtaining 100% accuracy in its detection, which was achieved on 15 s buffers.

4. Discussion

With the aim to build a solid pipeline and train on real-world data with no artificial audios and with the right dimensionality, we collected 139 h of non-consecutive audio chunks from two radio broadcasts, trained a custom CNN architecture and validated it with a custom system running live on web radios.

In the literature, there is a certain scarcity of examples of automatic systems that detect music or non-music from radio broadcasts. Although some studies report promising performances, their criticality lies in the dataset, which is often too small or synthetic, built by manually adding background to speech data. However, the “non-music” data built this way do not reflect what is usually broadcast on radio stations, which includes peculiar covariates such as professionally recorded and compressed speech uttered by voice professionals in a controlled environment, mixed with background music and/or sound effects, and ultimately processed for broadcasting. Table 9 displays a brief comparison of studies alongside ours.

Three main versions of the CNN were trained in order to explore different aspects: the first one is on a reduced dataset (70,000 samples) whose number is increased by a 10× data augmentation. The second one employs radio-like processing that features heavy compression and similar artifacts, made using a real radio preset: this net was used in order to check if our systems would still work were they to be applied on FM radios where audio is processed. The third net uses more training data and less augmented ones.

Before dealing with the results, some considerations are due regarding the input data: as anticipated, we strive to collect many real-world, uncorrelated, non-consecutive samples. The differences between radio stations, even in web streaming, lead to the need for a “normalization” pre-processing procedure, which brought us to convert all audios to mono .wav to tame artifacts due to cuts with windowing—fade-in and fade-out and to reduce the passband to a max frequency of 11,025 Hz. This operation also facilitates computation for the CNN, as the spectrogram contains less potentially misleading information: most of the high-frequency (>10 KHz) content is either hiss, negligible frequencies of some distorted or wide-band instruments, or high components of transients typical of percussions in music or consonants in speech [38]. Moreover, some radio stations already low-pass their material, so it was necessary to find a common frequency to cut.

Audio data augmentation procedures were carried out using artifacts that reflect real-world modifications on radio audio data, especially when taken from streaming or FM stations: pitching, rate change and noise. Graphical SpecAugment procedures were also used to mask parts of the spectrograms as they have proven efficient in other studies.

The underlying distribution of the augmented data was compared to the original one and achieved an MMD of 0.01421, which indicates similarity between the two distributions and reflects the desired result when performing augmentation, considering that synthetic data should reflect the original distribution to some degree to avoid biasing the models without being too similar (MMD < 0.01) to avoid redundancy [47].

Looking at the prediction results, it is safe to say that, with accuracies topping 97%, the proposed system works sufficiently well to be implemented and put in production.

The models were tested live on newly recorded chunks, with consecutiveness limited to three chunks for a total of 15 s. They were tested on radios whose material was used in the training set as well as new stations, and the performances were only negligibly lower for the latter case. We also experimented using material from just one radio (Love FM) as a training set, and in this case, results were much worse, with an average accuracy of 72.80% on 5 s chunks and 70.58% on 15 s chunks. This suggests that the differences in material and audio processing between different radio stations are not negligible, and a comprehensive dataset should come from multiple sources.

The best-performing model was the one that implemented broadcast-like processing on the training data, with a maximum accuracy of 98.23% on 15 s chunks from Love FM and an average accuracy of 96.47% throughout all radio stations. The broadcast-like processing was performed using a real radio preset and reflects exactly the material that goes on FM stations. Whenever validation chunks were taken from web streaming where processing was not applied, then we re-applied it before inference.

This is a promising result because it not only ensures that the system works even on FM streaming and heavily processed/noisier audio, but it even offers higher performances. We impute this to the fact that the information lost through heavy compression and stereo expansion are not crucial to the differentiation between music and non-music and may, in fact, simplify the task for CNNs because it now has to deal with fewer variations in dynamics. These advantages have to be weighed against the fact that the additional processing has obvious computational costs and leads to less lightweight models.

The worst performing model among the three was the one without broadcast processing and where data augmentation was more present, which leads to the quite obvious speculation that it is always better to enlarge the dataset with new, real audio data. However, the differences in performances are low, which in turn confirms the effectiveness of data augmentation.

The use case for the proposed system allows for longer latencies, which brought us to predict every 15 s of audio instead of 5 to check if the majority voting between three chunks would perform better than the single one. Results between different windowing tables are not fully consistent because even though a single 5 s prediction may be correct, the overall result of the 15 s prediction group could be incorrect, or vice versa. Thus, the start time of the window group can directly influence the results.

The advantage of using longer samples resides in the fact that for a few seconds of audio, content like talk shows or ads may have pauses in which only the background music is present, leading to misclassification. In fact, using 15 s brought higher accuracies and the current product sold by DaVinci S.r.l. employs that version with success and client satisfaction.

As far as errors are concerned, musical jingles, which are only made of original songs, have been labeled as “music” and are usually predicted as such. Most of the classification errors came from advertising spots featuring music overlapped with normal speech, being wrongly classified as music. In general, it is safe to say that the more intrusive the background music is, the higher the risk of making an unnecessary service call when a talk show or ad is actually featured. The model performs well when tested on the public GTZAN dataset containing samples of commercial songs. Using 5 s chunks brings to sub-optimal performances for genres that contain a lot of spoken parts, such as hip hop. However, the 15 s model achieves perfect accuracy on the whole dataset demonstrating the power of our system to detect music. We argue that such a system should not be evaluated on non-music data that are synthetic or comprised of home recordings of pure speech because broadcast transmissions exhibit drastically different characteristics, especially related to the professional recording and pre-processing of speech mixed at the source with music and sound effects, then re-processed in order to enter a broadcast.

Although the nature of errors is different if the misclassified bit is music or non-music, the authors believe that they are both non-critical cases, especially when a music bit is misclassified as non-music. Wrongfully detecting talk shows/ads as “music” leads to an unnecessary service call, but with 5% error rates, the increase in service cost is negligible, and from a logistic point of view, the issue is solved when the identifier simply finds out that there is no commercial track. On the other hand, wrong detections of music as non-music, less present and mostly linked to spoken bits in songs or rap tracks, may lead to a lack of detection for a song for royalty purposes; however, considering that commercial tracks are a few minutes long, it is almost impossible to miss them analyzing 15 s chunks: it only takes one correct chunk over few minutes of song to make a call to the identifier service, which will in turn correctly label the song.

5. Conclusions and Future Work

This paper aimed to develop an intelligent system able to predict if a broadcast stream is playing music or not, with the specific use case to employ it as a proprietary, preliminarily monitoring system that reduces service calls to music identifiers in order to minimize cost and to avoid having them running 24/7.

For this purpose, we collected over 139 h of audio in the form of 5 s chunks from two radio stations and trained three versions of a custom CNN architecture, the first one employing a 10× data augmentation, the second including a broadcast-like pre-processing and the third using more real data and a 7× data augmentation.

Results show how the proposed models all perform well for the use case: we validated the system by building a live tool that listens and records audio from web radios and calls the CNN for inference. The best-performing model is the one that involves broadcast-like processing and analyses three consecutive chunks for a total of 15 s of audio. This is possible because the latencies requested for calling music identifier services are usually longer than that. Maximum accuracy is 98.23% reached on a single radio station, whereas average accuracy is 96.47%.

Preliminary conclusions suggest that heavy pre-processing is a convenient solution as it not only improves accuracy by removing unnecessary information for the CNN but also adapts the system to FM broadcasts. Training on audios that all come from a single radio station brings worse performances, which confirms the importance of multifaceted input data for DL systems. Using real data instead of augmented ones brings negligible improvements when the dimensionality is 70,000 samples versus 100,000, but still leads to a slight increase in accuracy. With this premise, it can be seen that the most accurate model for 5 s long windows is the “Extended” model, peaking at 97.35%.

To the knowledge of the authors and after a brief literature review, no other studies have concentrated on music versus non-music on broadcast with this big a dataset, especially considering that most contributions manually crafted training data by artificially mixing music with speech, replicating talk shows.

One of the inherent limitations of such a study is represented by the borderline case of jingles, which resemble original songs in structure and sound, making them impossible to classify differently from music by models trained to separate music from non-music elements. As of today, jingles still unavoidably lead to service calls to music identifiers, which mark them as “unrecognized”, closing the issue. Preliminarily reducing those calls can only be solved by contextual analysis embedding radio metadata (not the scope of this paper), or refinement of inference models by re-training on the specific jingles of the considered radio station, or the addition of more complex models, which can attempt contextual analysis by the detection of lyrics while still being subject to errors.

The promising results displayed in this paper and their value are confirmed by the fact that a commercial product was made, and it is currently in use by several clients; however, the results also suggest that better systems could be built by training on more radio stations.

Future works and perspectives include the expansion of the dataset, the improvement of performances in terms of latency and model dimensions, and especially the refinement of the classification with the possible addition of radio metadata and contextual analysis. Due to the peculiar nature of non-music bits in radio broadcasts, we do not consider artificial mixes of public speech and music or noise to be comparable. Therefore, every dataset extension should be performed by including a different radio station in the dataset, which in turn brings different show formats, voice/music balance and processing solutions. However, the artificial mix of (professionally recorded) speech with music could be a promising solution for data augmentation.

Because our system showed promising performances to the point that it is being used on the market, we would like to also concentrate our future works on improving its computational performances, especially from the point of view of file size and ease of deployment, because latency is not an issue considering that songs usually last a few minutes.

The system proposed in this paper is embedded in a product on the market called “Compilerò X Radio”, sold by DaVinci Solutions S.r.l. and currently in use by various Italian clients in association with a song identifier for royalties.

Author Contributions

Conceptualization, V.A.; methodology, V.C. and G.C.; software, V.A.; validation, G.C.; data curation, V.A.; writing—original draft preparation, V.A.; writing—review and editing, V.C.; visualization, V.C.; supervision, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study has been collected from web radios protected by copyright and has been employed under the fair use of transformative use. No products or transmissions can and will contain audio snippets from the above-mentioned sources, nor is it possible to share them with the public due to copyright laws.

Acknowledgments

The authors would like to thank DaVinci S.r.l. for allowing the disclosure of research that resulted in a commercial product called “Compilerò X Radio”. The authors would also like to thank Voicewise S.r.l. for the support and expertise in the research.

Conflicts of Interest

V. Addati is the founder of DaVinci S.r.l. (Canosa di Puglia, Italy), the company that implemented the results of the present paper in a commercial product.

References

Al-Shoshan, A.I. Speech and Music Classification and Separation: A Review. J. King Saud. Univ.—Eng. Sci. 2006, 19, 95–132. [Google Scholar] [CrossRef]
Cano, P.; Batlle, E.; Kalker, T.; Haitsma, J. A Review of Audio Fingerprinting. J. VLSI Sign Process Syst. Sign Image Video Technol. 2005, 41, 271–284. [Google Scholar] [CrossRef]
Wang, A. An Industrial-Strength Audio Search Algorithm. In Proceedings of the ISMIR 2003, 4th International Conference on Music Information Retrieval, Baltimore, MD, USA, 27–30 October 2003. [Google Scholar]
Bogert, B.P.; Healy, M.J.; Tukey, J.W. The Quefrency Analysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphe Cracking. Proc. Symp. Time Ser. Anal. 1963, 15, 209–243. [Google Scholar]
Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep Learning for Audio Signal Processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
Broadcast Monitoring. ACRCloud. Available online: https://www.acrcloud.com/broadcast-monitoring/ (accessed on 10 October 2024).
Realm, M. Live Radio Music Fingerprinting with ACRCloud and MetaRadio. Media Realm. Available online: https://www.mediarealm.com.au/articles/metaradio-standalone-acrcloud-setup-live-radio-fingerprinting/ (accessed on 10 October 2024).
Gracenote|Media and Entertainment Metadata Solutions. Available online: https://gracenote.com/it/ (accessed on 10 October 2024).
TuneSat. Available online: https://tunesat.com/tunesatportal/home (accessed on 10 October 2024).
Raj, B.; Parikh, V.N.; Stern, R.M. The effects of background music on speech recognition accuracy. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 21–24 April 1997; Volume 2, pp. 851–854. [Google Scholar] [CrossRef]
Cesarini, V.; Costantini, G. Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set. Appl. Sci. 2024, 14, 1446. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Ke, Y.; Hoiem, D.; Sukthankar, R. Computer vision for music identification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 597–604. [Google Scholar] [CrossRef]
Amato, F.; Cesarini, V.; Olmo, G.; Saggio, G.; Costantini, G. Beyond breathalyzers: AI-powered speech analysis for alcohol intoxication detection. Expert. Syst. Appl. 2025, 262, 125656. [Google Scholar] [CrossRef]
Costantini, G.; Cesarini, V.; Brenna, E. High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors 2023, 23, 3461. [Google Scholar] [CrossRef]
Koutini, K.; Eghbal-zadeh, H.; Widmer, G. Receptive Field Regularization Techniques for Audio Classification and Tagging With Deep Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1987–2000. [Google Scholar] [CrossRef]
Wieser, E.; Husinsky, M.; Seidl, M. Speech/music discrimination in a large database of radio broadcasts from the wild. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 2134–2138. [Google Scholar] [CrossRef]
Jang, B.-Y.; Heo, W.-H.; Kim, J.-H.; Kwon, O.-W. Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel. EURASIP J. Audio Speech Music Process. 2019, 2019, 11. [Google Scholar] [CrossRef]
2015:Music/Speech Classification and Detection—MIREX Wiki. Available online: https://www.music-ir.org/mirex/wiki/2015:Music/Speech_Classification_and_Detection#Dataset_2 (accessed on 11 December 2022).
GTZAN Music/Speech Collection. Available online: https://www.kaggle.com/datasets/lnicalo/gtzan-musicspeech-collection (accessed on 11 December 2022).
Lemaire, Q.; Holzapfel, A. Temporal Convolutional Networks for Speech and Music Detection in Radio Broadcast. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR, Delft, The Netherlands, 4 November 2019; pp. 229–236. [Google Scholar] [CrossRef]
Snyder, D.; Guoguo, C.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. Available online: http://openslr.org/17 (accessed on 11 December 2022).
Scheirer-Slaney Music/Speech Corpus. Available online: https://www.ee.columbia.edu/~dpwe/sounds/musp/scheislan.html (accessed on 11 December 2022).
Schlüter, J.; Sonnleitner, R. Unsupervised Feature Learning for Speech and Music Detection in Radio Broadcasts. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx-12), York, UK, 17–21 September 2012; Available online: https://ofai.at/papers/oefai-tr-2012-11.pdf (accessed on 1 February 2025).
Piczak, K.J. ESC: Dataset for Environmental Sound Classification. Harvard Dataverse. 2015. Available online: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT (accessed on 1 February 2025).
Tsipas, N.; Vrysis, L.; Dimoulas, C.; Papanikolaou, G. Efficient audio-driven multimedia indexing through similarity-based speech/music discrimination. Multimed. Tools Appl. 2017, 76, 25603–25621. [Google Scholar] [CrossRef]
Bhattacharjee, M.; Prasanna, S.R.M.; Guha, P. Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1–10. [Google Scholar] [CrossRef]
Addati, V.; Ricatti, D. Compilerò X Radio. DaVinci Solutions Srl. Available online: https://www.davincisolutions.it/compileroxradio/ (accessed on 29 March 2025).
SIAE—Società Italiana degli Autori ed Editori. Available online: https://www.siae.it/it/ (accessed on 17 October 2024).
Royalties: Diritti Connessi e Copia Privata—SCF. Available online: https://www.scfitalia.it/produttori/royalties/diritti-connessi.kl (accessed on 17 October 2024).
LEA: Associazione No-Profit che Tutela il Diritto d’Autore. Available online: https://leamusica.com/ (accessed on 17 October 2024).
Gazzetta Ufficiale. Available online: https://www.gazzettaufficiale.it/eli/id/1941/07/16/041U0633/sg (accessed on 17 October 2024).
LOVE FM—Ogni Canzone, un’Emozione. Available online: https://www.lovefm.it/ (accessed on 11 December 2022).
Rai Radio 2|Canale|RaiPlay Sound. Available online: https://www.raiplaysound.it/radio2 (accessed on 17 October 2024).
Virgin Radio—Style Rock. Available online: https://www.virginradio.it/ (accessed on 17 October 2024).
R101—Enjoy the Music. Available online: https://www.r101.it/ (accessed on 17 October 2024).
Monson, B.B.; Hunter, E.J.; Lotto, A.J.; Story, B.H. The perceptual significance of high-frequency energy in the human voice. Front. Psychol. 2014, 5, 587. [Google Scholar] [CrossRef]
Plazak, J.; Huron, D.; Williams, B. Fixed Average Spectra of Orchestral Instrument Tones. Empir. Musicol. Rev. 2010, 5, 10–17. [Google Scholar] [CrossRef][Green Version]
Pan, D. Digital Audio Compression. Digit. Tech. J. 1993, 5, 28–40. [Google Scholar]
Stereo Tool—Thimeo. Available online: https://www.thimeo.com/stereo-tool/ (accessed on 18 December 2022).
Stevens, S.S. A scale for the measurement of a psychological magnitude: Loudness. Psychol. Rev. 1936, 43, 405–416. [Google Scholar] [CrossRef]
Dörfler, M.; Bammer, R.; Grill, T. Inside the spectrogram: Convolutional Neural Networks in audio processing. In Proceedings of the 2017 International Conference on Sampling Theory and Applications (SampTA), Tallin, Estonia, 3–7 July 2017; pp. 152–155. [Google Scholar]
Wyse, L. Audio spectrogram representations for processing with convolutional neural networks. arXiv 2017, arXiv:1706.09559. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Jung, J.-H.; Lee, S.-J.; Kim, H.-S. Estimation of Average Grain Size from Microstructure Image Using a Convolutional Neural Network. Materials 2022, 15, 6954. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: www.tensorflow.org (accessed on 17 October 2024).
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar] [CrossRef]
natke, ML.NET Documentation—Tutorials, API Reference. Available online: https://learn.microsoft.com/en-us/dotnet/machine-learning/ (accessed on 17 October 2024).

Figure 1. Example of two mel-spectrograms from raw audio chunks. X-axis is time, Y-axis is frequency and darker color is associated with lower amplitude.

Figure 2. Augmented spectrograms (same source as the ones in Figure 1). In the examples, time and frequency masking is present, as well as white noise. (Left) one frequency, one-time masking strips. (Right) one frequency, two-time masking strips. X-axis is time, Y-axis is frequency and darker color is associated with lower amplitude.

Figure 3. CNN flowchart. A “Conv Block” is made of a 2D convolutional kernel (number of filters is specified in the parenthesis), a ReLu Activation Function and a Batch Normalization layer.

Figure 4. Live inference platform used for validation and in production. Live web radio streams are automatically parsed, and 5 s audio chunks enter the C# ML.NET application, where the chosen CNN model performs single 5 s prediction as well as a buffered three-way prediction with majority voting for 15 s windows.

Table 1. Performances of the three different nets on the training dataset.

Model Info	Stopping Epoch	Augmentation per Sample	Training Accuracy
Broadcast processing	23	10	0.9638
No processing	21	10	0.9637
Extended: no broadcast processing + more source samples	46	7	0.9637

Table 2. Validation results for 5 s chunks, “broadcast processing” model.

Radio	Correct Results	Incorrect Results	Correct %	Error %
Love FM	327	13	96.18	3.82
Rai Radio 2	315	25	92.65	7.35
R101	316	24	92.94	7.06
Virgin	325	15	95.59	4.41
AVERAGE			94.17	5.83

Table 3. Validation results for 15 s chunks, “broadcast processing” model.

Radio	Correct Results	Incorrect Results	Correct %	Error %
Love FM	111	2	98.23	1.77
Rai Radio 2	110	3	97.35	2.65
R101	108	5	95.58	4.42
Virgin	107	6	94.69	5.31
AVERAGE			96.47	3.53

Table 4. Validation results for 5 s chunks, “no processing” model.

Radio	Correct Results	Incorrect Results	Correct %	Error %
Love FM	329	11	96.76	3.24
Rai Radio 2	308	32	90.59	9.41
R101	318	22	93.53	6.47
Virgin	318	22	93.53	6.47
AVERAGE			93.60	6.40

Table 5. Validation results for 15 s chunks, “no processing” model.

Radio	Correct Results	Incorrect Results	Correct %	Error %
Love FM	110	3	97.35	2.65
Rai Radio 2	105	8	92.92	7.08
R101	106	7	93.81	6.19
Virgin	106	7	93.81	6.19
AVERAGE			94.47	5.53

Table 6. Validation results for 5 s chunks, “extended” model.

Radio	Correct Results	Incorrect Results	Correct %	Error %
Love FM	331	9	97.35	2.65
Rai Radio 2	322	18	94.71	5.29
R101	315	25	92.65	7.35
Virgin	321	19	94.41	5.59
AVERAGE			94.78	5.22

Table 7. Validation results for 15 s chunks, “extended” model.

Radio	Correct Results	Incorrect Results	Correct %	Error %
Love FM	109	4	96.46	3.54
Rai Radio 2	106	7	93.81	6.19
R101	105	8	92.92	7.08
Virgin	107	6	94.69	5.31
AVERAGE			94.47	5.53

Table 8. Validation results of our models (5 s and 15 s buffers) applied to the GTZAN dataset, divided by genre.

Genre	5 s Accuracy %	15 s Accuracy %
Blues	90.67	100
Classical	99.49	100
Country	99.50	100
Disco	99.50	100
Hiphop	84.62	100
Jazz	96.30	100
Metal	95.50	100
Pop	97.00	100
Reggae	97.50	100
Rock	99.67	100
AVERAGE	95.98	100

Table 9. Comparative literary analysis. The “non-music samples” column details if the samples referred to non-music were real (collected from real broadcasts) or recreated artificially by mixing speech or noise with music.

Study	AI Models	Data Characteristics	Non-Music Samples	Maximum Reported Accuracy
Jang et al. [18]	CNN	52 h sampled from various public datasets of music, speech and noise.	Artificially recreated via mixing	91.48%
Lemaire and Holzapfel [21]	Temporal CNN	Various public datasets containing music or speech plus a custom-built one (Sveriges dataset, 42 h of music and 17 h of non-music).	Real (Sveriges dataset) and artificial	96.80%
Tsipas et al. [26]	SVM, Random Forest, Logistic Regression	<3 h sampled from public datasets of music, speech and noise.	Artificially recreated via mixing	97.70%
Cesarini, Addati and Costantini (this paper)	CNN	300 h (14 h non-consecutive) of broadcast samples containing music or non-music	Real	98.23%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cesarini, V.; Addati, V.; Costantini, G. Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio. Information 2025, 16, 288. https://doi.org/10.3390/info16040288

AMA Style

Cesarini V, Addati V, Costantini G. Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio. Information. 2025; 16(4):288. https://doi.org/10.3390/info16040288

Chicago/Turabian Style

Cesarini, Valerio, Vincenzo Addati, and Giovanni Costantini. 2025. "Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio" Information 16, no. 4: 288. https://doi.org/10.3390/info16040288

APA Style

Cesarini, V., Addati, V., & Costantini, G. (2025). Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio. Information, 16(4), 288. https://doi.org/10.3390/info16040288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Industrial-Grade CNN-Based System for the Discrimination of Music Versus Non-Music in Radio Broadcast Audio

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Pre-Processing

2.3. Data Augmentation

2.4. Augmented Data Distribution

2.5. CNN Architecture

2.6. Live Inference System

3. Results

4. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI