Web Radio Automation for Audio Stream Management in the Era of Big Data

: Radio is evolving in a changing digital media ecosystem. Audio-on-demand has shaped the landscape of big unstructured audio data available online. In this paper, a framework for knowledge extraction is introduced, to improve discoverability and enrichment of the provided content. A web application for live radio production and streaming is developed. The application o ﬀ ers typical live mixing and broadcasting functionality, while performing real-time annotation as a background process by logging user operation events. For the needs of a typical radio station, a supervised speaker classiﬁcation model is trained for the recognition of 24 known speakers. The model is based on a convolutional neural network (CNN) architecture. Since not all speakers are known in radio shows, a CNN-based speaker diarization method is also proposed. The trained model is used for the extraction of ﬁxed-size identity d-vectors. Several clustering algorithms are evaluated, having the d-vectors as input. The supervised speaker recognition model for 24 speakers scores an accuracy of 88.34%, while unsupervised speaker diarization scores a maximum accuracy of 87.22%, as tested on an audio ﬁle with speech segments from three unknown speakers. The results are considered encouraging regarding the applicability of the proposed methodology.


Introduction
In the past decade, there has been a breakthrough concerning the production and distribution of digital content on the web and in social media. This outbreak of available data has highlighted the importance of the development of suitable tools and frameworks for editing and management of the created content, aiming at more efficient distribution and consumption. Meanwhile, there have been important developments in data-driven techniques for automated semantic information retrieval of broadcast content, based on machine and deep learning models [1][2][3][4][5][6][7][8][9][10][11]. For the aims of radio and television production, validation and media asset management are essential for semantic annotation and archiving of content. Radio producers can also benefit from semantic analysis tools for more efficient content production [12,13].

Radio in an Evolving Ecosystem
The transition from analog radio to the digital world has followed a different path than digital television. There was not a global mandatory transition from analog to digital format broadcasting in the radio spectrum. One big factor that shaped the landscape was the rise of web radio. A big part of radio consumption is via personal computers through web radio station websites. Smartphone devices, which comprise a major choice for mobile radio listening, are no longer equipped with radio receivers, but depend on audio transmitted over 3G/4G networks [14]. While traditional radio is supplemented creating an Internet of Things [25]. Mobile data and the upcoming 5G network technology promise All-IP Network (AIPN) services for cloud computing [26].
Audio and speech analytics specify the extraction of information from unstructured audio data. Main applications include the domains of customer service, call centers, social media, the media industry, health care, and content-based analytics. These industries produce big audio data streams daily [24].
Deep learning has been associated with big data handling [27][28][29]. Such techniques utilize a massive amount of data for hierarchical feature extraction to provide complex abstractions and data representations [27][28][29]. There are many technical challenges to be addressed in managing large-scale, high-dimensional, rapidly changing data. While the field is still considered low in maturity [29], deep learning has been proved to have greater potential for efficient management of big data volumes, compared to traditional ML approaches [27][28][29].

Speaker Diarization
Speaker diarization is defined as the problem of deciding "who spoke when?" [30,31], which serves many applications in broadcasting [32], conferencing, and intelligent information retrieval. It includes the sub-tasks of segmenting the input audio and assigning each segment to a certain speaker. Two main approaches are dominant in the literature: top-bottom and bottom-up clustering [30,31]. The number of clusters corresponds to the number of different speakers. In the first approach, the model is initialized with one (or a few) clusters, while in the latter with an excessive number of clusters. In both strategies, the goal is to converge to the optimum number of clusters/speakers. diarization error rate (DER) is commonly used to measure the performance of speaker diarization systems. It is defined as the sum of missed speech error, false alarm speech error and speaker error [31]. The first two errors refer to voice activity detection error, while the third refers to the assignment of speech segments to a wrong speaker. In some cases, DER only refers to speaker error to simplify the evaluation [33].
Identity vector (i-vector) has been the standard feature extraction procedure for speaker recognition and, by extension, speaker diarization. The audio input is segmented in an unsupervised way in 1-2 s segments, from which i-vectors are extracted [34]. A factor-analysis front-end along with principal components analysis of i-vectors is investigated in [35], using data from telephone conversations. Spectral clustering is also proposed as an alternative to K-means for the stage following i-vector extraction and principal components analysis [33]. Another clustering approach is evaluated against agglomerative clustering, incorporating integer linear programming in order to find an optimal clustering solution, leading to a DER decrease [36]. ILP is also integrated into the open-source toolbox for broadcast news diarization LIUM [32]. In the case of the diarization of a big collection of recordings, the clusters which are defined after applying diarization separately in each recording can be used to perform a two-stage clustering approach, and compress the information [37].
In the most common scenario, Gaussian mixture models and factor analysis are used to reduce dimensionality resulting in the compressed representation of i-vectors, which are then compared using probabilistic linear discriminant analysis (PLDA). In [38], a deep neural network (DNN) is trained to learn a fixed embedding and scoring metric to replace the stages of i-vector extraction and PLDA scoring used in baseline techniques and is proved to outperform them. The DNN is used to map speech utterances to fixed x-vector embeddings, having as its input 30-dimensional mel-frequency cepstral coefficients (MFCCs) with a frame-length of 25 ms, mean-normalized over a sliding window of up to 3 s. The experimentation is extended for multi-speaker recordings, achieving state-of-the-art accuracy in common databases [39]. In a supervised approach, an ANN architecture is investigated in [40]. MFCC features are extracted from two audio frames and are used as input in the first layer of the ANN. The ANN is trained as a classifier with two classes, deciding if the two frames belong to the same speaker or not.

Research Aims
The motivation of the current research is to make use of domain knowledge in the field of radio production and state-of-the-art machine learning practice to enhance the processes of radio production, distribution and consumption. The directions set by industry experts for the future of radio, as well as the advances in big data management, have been taken into consideration.
The recent popularity of podcasts and audio-on-demand, as described in Section 1.2, highlights the need for discoverability. To facilitate customization and personalization, listeners have to be able to access radio content based on several criteria, like topic selection, radio producer, guests, music aesthetics, etc. While radio stations may manually provide some tags and descriptions, this direction requires knowledge extraction from the unstructured audio streams.
The main contributors of big audio data are radio stations, producers and amateur users. The aforementioned categories can all benefit from the adaptation of web radio to emerging listening habits. In this paper, a framework is proposed that links knowledge extraction to the production process. A web-application for live radio production is presented that integrates real-time semantic annotation and logging. By providing the application publically, radio producers contribute without stressing their common workflows. A semi-structured XML scheme is described to enhance access and management of stored content. Additionally, deep learning techniques are evaluated for unsupervised knowledge extraction.
In Section 2, the concept and the functionality of the web application are presented. The knowledge structure is also explained within the general framework of radio content on demand access. An approach for speaker recognition and diarization based on deep convolutional neural network modalities is introduced. The implementation and experimental procedures are explained thoroughly. In Section 3, the evaluation results are presented. In Section 4, the conclusions of the research are discussed, and future research goals are set.

A Framework for Knowledge Extraction from Radio Content
Radio shows are usually recorded to be stored by the station for archiving reasons or to be accessible to users online. In the most common scenario, the producer is responsible for writing a small description that accompanies the provided podcast, while the audio file is unstructured. The vision of personalized and customized radio content access requires a structured information extraction from audio files that can be discovered by listeners. Users have to be able to browse podcasts based on search queries regarding the content. Moreover, listeners should be able to access the parts of interest in a specific audio file, based on the content.
As it has been described in Section 1.2, one of the main visions for the new radio concerns the enrichment of live broadcast and on-demand content with visual information that can be accessed across multiple platforms and social media. For this reason, along with segmentation and intelligent information retrieval, textual information concerning the broadcast content can be provided to the audience.
The proposed knowledge extraction scheme is demonstrated in Figure 1. The delivered content is segmented into music and speech segments [48]. Music metadata is extracted from every music segment, providing information concerning the title, artist, genre, etc. For the analysis of the speech segments, speaker transition detection, speaker diarization, or speaker recognition in the case of known speakers/radio producers is applied, to divide discussions into specific-speaker speech excerpts. Speech-to-text allows the extraction of transcripts for every excerpt. The structure of the segmented audio file is shown in Figure 1.
Information 2020, 11, x FOR PEER REVIEW 5 of 16 excerpts. Speech-to-text allows the extraction of transcripts for every excerpt. The structure of the segmented audio file is shown in Figure 1. Given this information, the interactive transcripts of podcasts and radio shows can be provided so that the users can navigate through the document, search for keywords, read the generated text, acquire information and jump to the desired audio segment. For example, a user interested in commentary on a specific topic by several analysts can search using suitable keywords, and listen to the specific parts of conversations. This is also applicable for hearing-impaired audiences, who are usually excluded from anything that happens on the radio. An implementation of such an interactive transcript in HTML and Javascript is shown in Figure 2. Every annotation of a different speaker with the respective starting point in time is a functional button that plays the audio at the specified time of the selected utterance. Given this information, the interactive transcripts of podcasts and radio shows can be provided so that the users can navigate through the document, search for keywords, read the generated text, acquire information and jump to the desired audio segment. For example, a user interested in commentary on a specific topic by several analysts can search using suitable keywords, and listen to the specific parts of conversations. This is also applicable for hearing-impaired audiences, who are usually excluded from anything that happens on the radio. An implementation of such an interactive transcript in HTML and Javascript is shown in Figure 2. Every annotation of a different speaker with the respective starting point in time is a functional button that plays the audio at the specified time of the selected utterance.
Furthermore, when real-time segmentation and annotation is available, textual data streaming containing semantic information of the broadcast audio is possible, on the web, on social media and other platforms, or even as dynamic label segment (DLS) in digital audio broadcasting (DAB) [49].

A Web Application for Live Radio Production and Annotation
The technologies involved in the extraction of the aforementioned knowledge scheme include speech/music classification, song identification, speaker diarization, speaker recognition, and speech-to-text. These problems are mostly addressed with supervised and unsupervised machine learning. While all of these fields have grown significantly in the past decade, data-driven predictions always involve a certain error rate. Additionally, some of them, like speaker recognition, cannot be treated globally, but a dedicated model has to be trained for a specified group of known speakers, e.g., the employees of a radio station. In this case, vast amounts of labeled data are required. Manual data labeling is a time-consuming process, vulnerable to human error. This often sets a bottleneck to the viability of deep learning approaches, since small organizations cannot afford this cost. Furthermore, when real-time segmentation and annotation is available, textual data streaming containing semantic information of the broadcast audio is possible, on the web, on social media and other platforms, or even as dynamic label segment (DLS) in digital audio broadcasting (DAB) [49].

2.2.A Web Application for Live Radio Production and Annotation
The technologies involved in the extraction of the aforementioned knowledge scheme include speech/music classification, song identification, speaker diarization, speaker recognition, and speechto-text. These problems are mostly addressed with supervised and unsupervised machine learning. While all of these fields have grown significantly in the past decade, data-driven predictions always involve a certain error rate. Additionally, some of them, like speaker recognition, cannot be treated globally, but a dedicated model has to be trained for a specified group of known speakers, e.g., the employees of a radio station. In this case, vast amounts of labeled data are required. Manual data labeling is a time-consuming process, vulnerable to human error. This often sets a bottleneck to the viability of deep learning approaches, since small organizations cannot afford this cost.
To surpass this obstacle, a web application has been developed. The application covers the common workflows and functionality of most live mixing and radio production software, while, at the same time, integrating semantic enhancement of the produced files on-the-fly. The functionality is available to the user through an HTML5-based graphical user interface, which is demonstrated in Figure 3. The producer can load audio files to create playlists, adjust the audio settings, apply crossfade for transitions, and turn on and off speaker microphones. For every change of sound source (microphone inputs) or songs from the playlist, a transition event is registered in the log file. This log file serves as the annotation file, containing the time segmentation and information accompanying the audio file. Depending on the recording and streaming setup, the inputs of the computer can match the different channels of a multichannel audio interface or the mixed output of the external audio mixing console. In the first case, every speaker is logged separately. In the second, only the transition from music to speech is annotated. The metadata of the songs are also stored in the log file, as shown in Figure 3. To surpass this obstacle, a web application has been developed. The application covers the common workflows and functionality of most live mixing and radio production software, while, at the same time, integrating semantic enhancement of the produced files on-the-fly. The functionality is available to the user through an HTML5-based graphical user interface, which is demonstrated in Figure 3. The producer can load audio files to create playlists, adjust the audio settings, apply cross-fade for transitions, and turn on and off speaker microphones. For every change of sound source (microphone inputs) or songs from the playlist, a transition event is registered in the log file. This log file serves as the annotation file, containing the time segmentation and information accompanying the audio file. Depending on the recording and streaming setup, the inputs of the computer can match the different channels of a multichannel audio interface or the mixed output of the external audio mixing console. In the first case, every speaker is logged separately. In the second, only the transition from music to speech is annotated. The metadata of the songs are also stored in the log file, as shown in Figure 3. For broadcasting, two open-source tools have been integrated into the application. Liquidsoap [50] is a popular and reliable solution for web radio and television stations for multimedia streaming. On top of liquidsoap, Icecast [51] is used as a streaming server for audio/visual content, to create the web radio channel. It supports all popular formats and allows the addition of more. The audience For broadcasting, two open-source tools have been integrated into the application. Liquidsoap [50] is a popular and reliable solution for web radio and television stations for multimedia streaming. On top of liquidsoap, Icecast [51] is used as a streaming server for audio/visual content, to create the web radio channel. It supports all popular formats and allows the addition of more. The audience can receive the broadcast stream through an internet browser or a dedicated application capable of receiving and playing audio data streams, like the open-source VLC media player.
In Figure 4, the architecture of the developed application is demonstrated [52]. During live broadcasting, the annotation log file can be enriched with information concerning audience engagement and ratings (e.g., Google Analytics), as well as live interaction, comments, emoticons, etc. This provides further semantic metadata that concern quality-of-experience evaluation, emotional public reaction, etc. [53,54]. Such analytics, in correlation with the delivered content, provide insight for future planning. The baseline metadata scheme can also be extended to involve speech [55,56] and music [57,58] emotional cues. As it is depicted in Figure 4, the functionality that concerns different groups of interest is unified in a common framework. The radio stations and producers want to efficiently annotate and manage their produced content, enrich their content with live information and metadata, make their content discoverable and more appealing, gather and analyze analytics from public interaction during the shows. In addition, they can achieve the above by using free and publically available software. The audience can address their need for accessing radio content on-demand, have a richer augmented experience, and personalize and customize the consumed radio content according to their interests and listening habits. Meanwhile, the automated creation of labeled big audio data provides to the engineering and academic community huge and high-quality publically available datasets to develop multi-purpose models.  . The architecture of the web application for web radio production, streaming, and annotation.

Speaker Recognition with Convolutional Neural Networks.
Speaker recognition is the supervised machine learning task of assigning speaker classes to audio segments. For this task, a fixed group of speakers/classes and a respective annotated dataset are required. Speaker recognition applies to the needs of radio stations, where a certain number of known speakers/producers are present in podcasts. However, it is not possible to have a universal model for all interested stations. A dedicated model has to be trained for every defined speaker group. Radio stations can make use of the web application presented in Section 2.4 for an amount of time to initialize the annotated dataset needed for their speaker recognition purposes.
For modeling and experimentation purposes, we have used the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [59]. The RAVDNESS database contains audiovisual speech and song files. For speaker recognition training and evaluation, we have used the audio-only

Speaker Recognition with Convolutional Neural Networks
Speaker recognition is the supervised machine learning task of assigning speaker classes to audio segments. For this task, a fixed group of speakers/classes and a respective annotated dataset are required. Speaker recognition applies to the needs of radio stations, where a certain number of known speakers/producers are present in podcasts. However, it is not possible to have a universal model for all interested stations. A dedicated model has to be trained for every defined speaker group. Radio stations can make use of the web application presented in Section 2.4 for an amount of time to initialize the annotated dataset needed for their speaker recognition purposes.
For modeling and experimentation purposes, we have used the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [59]. The RAVDNESS database contains audiovisual speech and song files. For speaker recognition training and evaluation, we have used the audio-only speech subset of the database. As explained in the accompanying documentation, the database contains 1440 recordings of 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. The dataset covers the case of a radio station with 24 regular producers with gender balance.
To make the most of the available data and improve generalization, some common audio data augmentation techniques [56,60] have been applied:

•
Background Noise: Additive Gaussian White Noise (AGWN) was added to the existing audio files. This technique doubles the number of files used for training and helps generalization in noisy conditions. • Dynamic Range: The audio files provided in the RAVDNESS database are not normalized. Machine learning models are vulnerable to overfitting to energy characteristics. In the training session, we used the existing audio files as well as normalized versions of them. This aims at making recognition performance robust to energy fluctuations, e.g., with the speaker moving farther and closer to the mic. • Time Shift: Time shifting of audio segments is achieved by extracting features from heavily overlapping windows to increase the instances. This is similar to the different image cropping data augmentation technique that has been popularized in visual object detection. In our approach, 90% overlapping between successive observation windows was chosen.
A convolutional neural network architecture was used for classification [48,56]. The CNN is much more lightweight than the LSTM architectures, while it also models spectro-temporal information when it is fed with spectrograms as input [48,56]. The architecture of the network along with the selected hyperparameter values are presented in Table 1 and Figure 5. CNNs are vulnerable to overfitting the training data. Dropout layers randomly discard a portion of calculated weights. To estimate overfitting, the accuracy in the training and the validation set are compared. In the initial experiment, training accuracy was found to be much higher, even with the dropout layers. Thus, an L2 regularization was added in Layer 12. The model was compiled using categorical cross-entropy loss function, for the estimation of multi-class probabilities. The Adamax optimizer, an alternative to the popular optimizer Adam [61], was chosen after experimentation.
Information 2020, 11, x FOR PEER REVIEW 10 of 16 segment audio files were unknown speakers are present. As described in the literature review of Section 1.3, speaker diarization is an unsupervised clustering task. Identity vectors (i-vectors) are extracted from unstructured audio files to be used as input to clustering algorithms. In the past few years, state-of-the-art approaches have used deep learning models to extract i-vectors, which are in this case called d-vectors. The trained model described in Section 2.3 which is fit for multi-class speaker recognition is used for the formulation of the d-vectors. The two last layers of the network, the dropout Layer 11 and the dense neural network Layer 12, which are used for the classification task, are discarded. The resulting architecture is shown in Figure 5. Having as input Mel-scale spectrograms of the same shape as the ones used for training (fs = 44,100, window = 1 s, 128 Mel-coefficients), the output of the model is the vector of the 64 weights of the dense neural network Layer 10, which is the final layer of the modified network. This fix sized vector of 64 values is used as the d-vector for clustering and speaker diarization.
The d-vector should be efficient for the clustering of audio files with unknown speakers who were not included in the training dataset. This is why cross-corpus evaluation was performed. An audio file of 20 min in length was used, containing speech from three speakers, two male and one Implementation of the described architecture and training was held using the Keras toolbox for the Python programming language [62]. The librosa toolkit for Python [63] was used to extract Mel-scale spectrograms with a dimension of 128 Mel-coefficients from the audio files with a sampling frequency of fs = 44,100 samples/s for windows of 1 s with 90% overlap [48,56]. The extracted spectrograms were used as input to the 2D convolutional neural network.

Speaker Diarization
While speaker recognition is sufficient for the case of known speakers, this is not always the case in radio podcasts. Many producers have guests in their shows. Speaker diarization is needed to segment audio files were unknown speakers are present. As described in the literature review of Section 1.3, speaker diarization is an unsupervised clustering task. Identity vectors (i-vectors) are extracted from unstructured audio files to be used as input to clustering algorithms. In the past few years, state-of-the-art approaches have used deep learning models to extract i-vectors, which are in this case called d-vectors.
The trained model described in Section 2.3 which is fit for multi-class speaker recognition is used for the formulation of the d-vectors. The two last layers of the network, the dropout Layer 11 and the dense neural network Layer 12, which are used for the classification task, are discarded. The resulting architecture is shown in Figure 5.
Having as input Mel-scale spectrograms of the same shape as the ones used for training (fs = 44,100, window = 1 s, 128 Mel-coefficients), the output of the model is the vector of the 64 weights of the dense neural network Layer 10, which is the final layer of the modified network. This fix sized vector of 64 values is used as the d-vector for clustering and speaker diarization.
The d-vector should be efficient for the clustering of audio files with unknown speakers who were not included in the training dataset. This is why cross-corpus evaluation was performed. An audio file of 20 min in length was used, containing speech from three speakers, two male and one female. The three speakers were not included in the group of 24 speakers of the RAVDNESS dataset, and they were recorded in the same studio, using the same equipment. The d-vector is extracted from windows of 1 s with a 90% overlap, as described.
The performance of several clustering algorithms in the task of speaker diarization with d-vector input is evaluated. The results for all algorithms are presented in Section 3. K-means [64] is one of the most popular clustering algorithms in the literature, assigning observations to the centroid with the nearest mean after a number of iterations, minimizing the inertia. The initial values of the centroids are chosen using the kmeans++ algorithm for faster convergence. Agglomerative clustering is a hierarchical unsupervised classficiation approach, starting from an initial cluster and performing successive splits and merges [65]. The ward linkage criterion was selected, which minimizes the variance of the clusters being merged. The linkage is computed using Euclidean distance. The Birch algorithm builds a tree called the clustering feature tree (CFT) [66]. The maximum number of sub-clusters in each node was set to 50 and the threshold for the merging/splitting of neighboring sub-clusters to 0.5.

Results
The data augmentation procedures, described in Section 2.3, resulted in a dataset with 144,271 instances, balanced between the 24 classes. For the evaluation of the speaker recognition model, the dataset was split into three sets, following common practice: training (70%), validation (15%) and testing (15%). The validation set was used for hyperparameter tuning, while the test set contained unseen data held out for evaluation. After the fine-tuning of the model, as presented in Section 2.3, classification accuracy in the test set was estimated equal to 88.34%. The learning curves for the training and validation sets are shown in Figure 6. The results in the three sets (training/validation/test) show a balanced performance in known and unseen data and prove that overfitting issues have been addressed efficiently. A graphical representation confusion matrix is depicted in Figure 7, indicating the relation between the predicted and actual speakers.
Information 2020, 11, x FOR PEER REVIEW 11 of 16 training and validation sets are shown in Figure 6. The results in the three sets (training/validation/test) show a balanced performance in known and unseen data and prove that overfitting issues have been addressed efficiently. A graphical representation confusion matrix is depicted in Figure 7, indicating the relation between the predicted and actual speakers.    For speaker diarization evaluation, the classifiers described in Section 2.4 were evaluated for their performance in clustering unknown speakers. We have used a cross-corpus, vocabularyindependent and language-independent evaluation strategy. The unknown speakers have been 100 S1 S3 S5 S7 S9 S11 S13 S15 S17 S19 S21 S23 Actual Speaker Accuracy (%) Predicted Speaker For speaker diarization evaluation, the classifiers described in Section 2.4 were evaluated for their performance in clustering unknown speakers. We have used a cross-corpus, vocabulary-independent and language-independent evaluation strategy. The unknown speakers have been selected from the AESDD dataset described in [55]. The unsupervised diarization results are shown in Table 2. The BIRCH clustering algorithm scores the best results in every experiment.

Conclusions
We have presented a framework that addresses the need for evolution in radio production practice in the evolving digital media ecosystem and the challenges for efficient management of publically available big audio data. Radio production has to adapt in the direction of providing richer content, and more discoverable audio-on-demand, to maintain a broader audience and be more appealing to younger audiences. We have proposed a knowledge extraction scheme for podcast segmentation and annotation, involving speech/music detection, speaker recognition and diarization, speech-to-text for transcription extraction, and music metadata extraction. While many approaches to supervised and unsupervised machine learning models for audio information retrieval appear in the literature, we estimate that the most effective, robust and inexpensive in terms of working hours is the annotation of radio shows during the production stage. For this reason, an application has been developed, covering live production functionality, while providing real-time event logging. The annotated audio files can be browsed through a dedicated interactive GUI which integrates the extracted information. This methodology may also be applied for live metadata streaming to enrich the provided content.
The proposed web radio application can also be used for the creation of a dataset to train models for recognition of the regular speakers on audio stations. A CNN model was trained and evaluated for 24 different speakers, a number that corresponds to the upper limit of a typical radio station needs. The recognition accuracy of the model is 88.34% for windows of 1 sec. Since it is quite common for radio shows to have one or more guests along with the regular speakers, an unsupervised clustering module was also considered necessary. Following a state-of-the-art approach to speaker diarization, the CNN model was used to extract identity vectors. Experimentation with two, three and four unknown speakers was conducted, resulting in unsupervised clustering with a maximum of 82.33%-96.72% accuracy for the BIRCH algorithm, which corresponds to a DER of 17.67%-3.28%, since voice activity detection errors are not taken into consideration. This score is very close to the results for supervised speaker recognition. Prior knowledge of the number of guests/clusters was proved very important for performance. The resulting diarization error rate (DER) outperforms common reported results of i-vector approaches [41] (DER = 20.54%-42.63%), and is comparable to state-of-the-art d-vector approaches including: ANN [40] (DER = 25.9%-32%), CNN [47] (DER = 15.3%-24.6%), and LSTM [41] (DER = 12.3%-27.3%). The results from papers [40,41,47] are mentioned to provide further insight to the reader concerning the state-of-the-art. However, a direct comparison between the results is not applicable due to different problem definitions in every study (different datasets, number of speakers, experimental parameters, etc.).

Limitations and Future Work
The main limitations of the proposed system are set by the diarization error rates. A more lightweight diarization methodology has been proposed, which produces results comparable to the state-of-the-art and clearly outperforms older i-vector systems. Moreover, the clustering performance for an undefined number of unknown speakers is satisfactory only for the birch algorithm. Since the novelty of the present research lies mostly in the convergence of state-of-the-art technologies in a modular approach, the individual modules (speaker diarization, speaker recognition, speech-to-text) can be updated over the use of the framework, based on developments of the aforementioned fields. Furthermore, in the presented approach, speaker recognition is performed for individual windows of 1 s. With an aggregation over longer complete speech excerpts, which is the most common case in real-life scenarios, robustness is expected to improve by use of majority voting across successive frames and discarding of outlier values, thus, the proposed methodology is considered applicable.
The proposed automation is expected to create value through its use. The benefits concern radio organizations, which can organize and document their archive more efficiently, the audience, who can enjoy a more personalized and customized experience, as well as the scientific community. Every annotated podcast available online can serve as a valuable big dataset. The manual labeling of this data corresponds to many working hours and an unavoidable annotation error rate. The future plans of the project include the collaboration with radio stations for testing and feedback collection. The application will be available as open-source to make it more appealing to radio producers. However, in the initial state, it is estimated that the most efficient kick-start would be the collaboration with some of the many academic web radio stations.