Investigation of Spoken-Language Detection and Classiﬁcation in Broadcasted Audio Content

: The current paper focuses on the investigation of spoken-language classiﬁcation in audio broadcasting content. The approach reﬂects a real-word scenario, encountered in modern media / monitoring organizations, where semi-automated indexing / documentation is deployed, which could be facilitated by the proposed language detection preprocessing. Multilingual audio recordings of speciﬁc radio streams are formed into a small dataset, which is used for the adaptive classiﬁcation experiments, without seeking—at this step—for a generic language recognition model. Speciﬁcally, hierarchical discrimination schemes are followed to separate voice signals before classifying the spoken languages. Supervised and unsupervised machine learning is utilized at various windowing conﬁgurations to test the validity of our hypothesis. Besides the analysis of the achieved recognition scores (partial and overall), late integration models are proposed for semi-automatically annotation of new audio recordings. Hence, data augmentation mechanisms are o ﬀ ered, aiming at gradually formulating a Generic Audio Language Classiﬁcation Repository. This database constitutes a program-adaptive collection that, beside the self-indexing metadata mechanisms, could facilitate generic language classiﬁcation models in the future, through state-of-art techniques like deep learning. This approach matches the investigatory inception of the project, which seeks for indicators that could be applied in a second step with a larger dataset and / or an already pre-trained model, with the purpose to deliver overall results.

Extending the above notes, contemporary technological developments that are employed in the categorization of the plethora of data, found in various forms, may provide media agencies with the fastest and most reliable outcomes regarding information retrieval. In this framework, language identification/discrimination can be deployed in a large amount of multilingual audio recordings that are collected daily in newsrooms and need to be monitored and further analyzed. Nowadays, we have a polyglot worldwide environment within nations, as well as at the international level, so audio-driven Natural Language Processing (NLP) is not easily deployable in physical everyday communication [34]. The same applies to the detection of words belonging to different languages, mixed within a sentence, a very common feature in conversational speech (including radio broadcasting), e.g., to present the right terms, titles or locations. In this context, language identification and discrimination, as an automated process, can become a valuable tool, especially for content management, labeling and retrieval purposes. The incoming data streams from worldwide radio stations would be automatically classified, creating large digital audio repositories in the newsrooms. Afterwards, the assorted files could be forwarded to an additional kind of elaboration, leading to the creation of numerous content archives. For instance, term recognition processes, encountered in audio and audiovisual content, and generically speech-to-text (STT) services can be expedited by stressing, first, the preprocessing language detection problem.
The present paper investigates language discrimination of audio files derived from real-world broadcasted radio programs. The aim is the differentiated patterns that appear in everyday radio productions, such as speech signals, phone conversations and music interferences to be initially detected and afterward classified. A small dataset works as the inception of the project, since it is an investigatory approach, which will be gradually enhanced with more experiments. At this point, we are not seeking for a generic solution rather than indicators that could be applied in a second step, with a larger dataset that would deliver overall results. The above-stated limits make the problem under discussion even more demanding and difficult to deal with. To the best of our knowledge, this is the first time that such a challenging task has been confronted within the restrictions of narrow dataset availability and the associated training difficulties, even within the investigative character of the current work. The rest of the paper is organized as follows. Section 2.1 presents problem definition and motivation. The proposed framework is analyzed in Section 2.2, Section 2.3, Section 2.4, Section 2.5, Section 2.6, considering all the involved data selection, configuration and training aspects. Experimental results are analyzed and discussed in Section 3, followed by discussion and future work remarks.

Problem Definition and Motivation
This paper investigates machine learning techniques and methodologies that will support multilingual semantic analysis of broadcasted content, deriving from European (and world-wide) radio organizations. The current work is an extension of previous/already published research that is continually elaborated [1- 3,33], aiming at meeting the demands of users and industries for audio broadcasting content description and efficient post-management. Since the interconnectivity and spread of Web 2.0 is increasing (moving to Semantic Web 3.0 and beyond), more potential features are engaged for radio networks to continue to develop cross-border broadcasting. Given the achieved progress in separating the main patterns of the radio broadcasted streams (Voice, Phone, Music and main speakers, telephone conversations and music interferences (VPM) scheme) [1-3], the proposed framework investigates language detection mechanisms, targeting the segmentation of spoken content at a linguistic level. Focused research was conducted in speaker diarization/verification problems or sentiment analysis via Gaussian Mixture Modeling with cepstral properties [8,17]. Moreover, innovative Deep Learning and Convolutional Neural Networks architectures are deployed in this direction [4,5,35] with 2-D input features [7,15]. In addition, several experiments were conducted either on singing voices [18] or in utterance level [19]. Despite the progression in algorithmic level, many related efforts shifted to the development of mechanisms for archiving increased language material [36][37][38][39][40]. Furthermore, though the first trials in resolving this problem are dated many years ago, nonetheless, the multicultural and differentiated linguistic nature of the radio industry deteriorates the possibility of achieving effective recognition scores. Therefore, new methodologies need to be developed for language identification demands.
The aforementioned task is quite demanding because of the diversity of audio patterns that appear in radio productions. Specifically, common radio broadcasting usually includes audio signals deriving from the main speakers' voices (frequently, with strong overlapping), telephone dialogues (exposing distinctive characteristics that depend on the communication channel properties), music interferences (imposed with fade in/out transitions or background music), various SFX and radio jingles or even noise segments (ambient/environmental noise, pops and clicks, distortions, etc.) [1- 3,33]. It has to be noted that broadcasted music content depends mainly on the presenters' or audiences' preferences and, consequently, may not be in accordance with the respective language of origin (i.e., a German radio broadcasting may include either German or English songs). The same applies to the corresponding live dialogues and comments that precede or follow music playback, making it rather inevitable to encounter multilingual terms. Most of all, it is not possible to ascribe universal linguistic labeling to the entire recording based on the main language of the associated program.
Taking the above aspects into consideration, and the experience gained from previous research, hierarchical classification approaches are best suited to resolve this task, allowing the initial, highly accurate VPM scheme to isolate non-voice segments, i.e., to exclude them from the subsequent language classification process. Furthermore, both theory and previous experimentation showed that supervised machine learning solutions outperform the accuracy of unsupervised/clustering methods, with the counterbalance of the need for time-accurate standardized semantic annotation, which is rather an old fashioned and time-consuming human-centric procedure [1]. While large-scale labeled audio repositories are currently available to be involved in deep learning processes (e.g., audio books), again, past testing revealed that spontaneous speech and non-stopping sound of real-world radio streams do pose some distinctive features that would not be easily confronted with generic solutions. In fact, program-adaptive training solutions proved to be even more advantageous, since they provide adaptation and generalization mechanisms to the specific speakers, the jingles and the favorite music of each radio show [1- 3,33]. In this context, the grouping of multiple shows with similar characteristics may also be feasible. Concerning the specific problem under study, it is further expected that speaker information can be associated with the different multi-lingual patterns, thus facilitating language recognition through similar adaptations (i.e., a speaker would have specific voicing and pronouncing features while speaking different languages, which could be more easily detected and associated).
The idea and the motive behind the current work is the examination of whether it is possible to train such a language detection system with a small initial dataset. After that, following the strategy that is originally deployed in [33], ensemble learning by means of late integration techniques could be applied [41], combining multiple unsupervised and supervised modalities and with different time windowing configurations. Hence, with the only restriction that the initially trained model should be speaker-independent, matching labels between the different modalities would offer ground-truth data augmentation mechanism through semi-automated annotation (i.e., even by requiring users' feedback to verify highly-confidence instances). Thereafter, sessions of semi-supervised learning [42] could be iteratively deployed, thus offering the wanted gradual growth of both generic and program-adaptive repositories (the principle data-flow of this process is given in [33], while more specific details on the current task are provided in the corresponding implementation sections).
Media organizations, communication researchers and news monitoring service providers could be benefited by such an information retrieval system. Distinct spoken words that are recognizable in a multilingual environment could accelerate the processes of content documentation, management, summarization, media analytics monitoring, etc. Classified audio files, subjected to transcription and/or translation, could be propelled to other media rather than radio. In addition, they might be used for automatic interpretation of worldwide radio streams. On the next level, the categorized and annotated data, extracted through the implementation of semantic analysis and filtering, could be exploited for content recommendations to users or journalists, enhancing their work while avoiding time-consuming personal quests in vast repositories. An additional outcome of this research, in its fully developed phase, could be the valuable provision of feedback to the radio producers themselves, through the analysis of their own used vocabulary, for instance, by using term-statistics on the frequency of foreign words in their dialogues. This user-centric metric approach could lead them to personal and professional improvements. In another context, it could give linguists significant content-related data on the path of a nation's heritage, based on the alterations of the spoken language through the additions of foreign vocabulary.

Proposed Use Scenario Framework
The far-reaching use scenario of the recommended methodology is depicted in Figure 1. In terms of problem definition, a new long-term sound record (i.e., of 1-h duration or more) is subjected to pattern analysis, segmentation, and indexing, targeting proper documentation and semantic labeling that could offer efficient content-management automation. As already explained, the current work focuses on the semi-supervised classification and annotation of linguistic information. The input audio signals refer to the associated streams of specific radio programs, which are captured along with their live transmission (via RF-broadcasting or webcasting). Two types of different ground-truth datasets are involved. The first one refers to a "generic multilingual audio repository," with the time-marks of the different language segments, which is not expected to match well the individualities of the different radio shows (as argued in the previous section). For this reason, a small dataset is initially formed for each different radio show, investigating the feasibility of the proposed classification scheme through the subsequent experimentation. After this initiation step, the iterative process led to the augmentation of the starting repository, within the analysis of the same recording (i.e., the same broadcasting stream). Given that the proposed procedure will be deployed on multiple episodes of the same radio-show, a modular repository is created, containing multiple records of the same program (or groups of programs). Proceeding to other shows, the dataset will be enhanced by ingesting additional groups of respective records. Hence, the generic repository of Figure 1 is composed by the smaller ones (as the connecting dotted line implies). In simpler terms, a small set is involved in the beginning of the training sessions, posing the most demanding part of the whole system. This is the principal motive and aim of the current paper, to answer whether such kind of traditional machine learning is feasible. Next, provided that the whole iterative process can rely on this assumption, the further looping operations involve the augmentation of the collection with new audio stream entries. Hence, a well-organized and self-indexed database is promoted. This gradually incremented repository can be utilized for media assets management and retrieval purposes, also facilitating future research experimentation on the broader multidisciplinary domain. The smaller dataset contains instances of a specific radio show or group of shows with similar characteristics, aiming at facilitating the training of program-adaptive pattern recognition modules. The main idea is that both the "local" and the "global" ground-truth databases will be elaborated, as the method is deployed with multiple broadcasting sound sequences as inputs. Likewise, the formed training pairs will be further exploited in the implementation of associated "generic" and "specific" language recognition systems. Hence, the anticipated results are equally related to the materialization of the iterative self-learning procedures and the formation of a well-organized big-data structure (that is assembled by multiple featured sub-groups). The dataflow of the proposed architecture begins with the engagement of an initial broadcast audio content that is subjected to audio windowing and feature extraction processes, as Figure 1 depicts. In this process, a prime clustering (unsupervised classification) is feasible for the investigative linguistic data grouping (according to hierarchical VMP schemes, as it will be explained later). Moreover, feature representations of the windowed sound signals are fed to pre-trained "generic" and "radio show-adaptive" language recognition models (provided that the latter would be available; if not, they will be omitted, relying entirely on the broad modules). Supervised and unsupervised classification outcomes are then produced for different windowing-lengths and algorithms, in which confident pattern recognition can be indicated based on the number (C) of the classifiers that provide identical or consistent results [13]. The term "consistent" is used because clustering methods do not actually deliver specific pattern information but only their grouping, so the class is implied by the comparison with the pre-trained modalities. These confidently classified audio segments can be subsequently annotated (semi-autonomously) for the acquisition of voice and language class tags, to support the show-adaptive supervised classification and the formulation of the associated ground-truth database. It has to be noted that the implementation of this work follows an iterative character of a multivariate experimental setup, especially in terms of window lengths, aiming at identifying common grounds (agreement) in the classification results deriving from the different paths of machine learning. In this way, the pre-trained models would contribute to the formulation of a Generic Audio Language Repository that could be further tested in future implementations such as Deep Architectures.

Data Collection-Content Preprocessing
For the conducted experiments, radio content was collected from broadcasts in four different languages, namely native Greek, English, French, and German, which are among the most commonly spoken foreign languages in Europe (at least from a Greek radio audience point of view). The audio signals were formatted (transcoded) to PCM (Pulse-Code Modulation) Wav files (16-bit depth, 44,100 Hz sample rate). At the same time, the stereo property was discarded, since it could serve only for the music/genre discrimination and not for voice (and language) recognition, as it was thoroughly studied in [16][17][18][19][20].
Since the radio content derives from broadcasting in different countries, it is anticipated that several differentiations in structure/fashion may appear. In order to overcome this obstacle, only the common audio patterns were retained, therefore aiming at avoiding homogeneity problems in the conducted semantic analysis. Specifically, each radio segment of the data collection had a 10-min duration, which involved 8 min of main speakers' voices (male and female), 1-min phone conversations and 1-min of music clips, forming a quite representative proportion/ratio of a data sequence with temporal length of 1 h, which is a typical duration of a broadcast show. In this context, the entire multilingual audio signal had an overall duration of 40 min (4 × 10). As already justified, this small duration was a somewhat conscious choice, since it reflects the real-world nature of the given scenario, depicted in Figure 1.

Definition of Classification Taxonomies
As described in the problem definition section, the methodology that was implemented in the current work was constituted by the development of two classification taxonomies, in a hierarchical fashion. Specifically, the first level involves an initial scheme in the radio content, aiming to separate the voice signals of the main speakers (V), the telephone conversations (P) and the music interferences (M), therefore forming the VPM discrimination scheme. It has to be noted that the VPM taxonomy was firstly introduced and examined in [14][15][16][17][18][19][20] for Greek radio broadcasts, to isolate the main voices for validation and indexing purposes. Hence, the adoption of the correspondent VPM scheme in the current work serves as an extension to [14][15][16][17][18][19][20], in the context of generic classification of speech/non-speech radio signals, deriving from an augmented database of multilingual radio broadcasts. Thereafter, the second discrimination level includes the hierarchical classification of the spoken language sub-categories, constituting the Language Detection (LD) scheme. As mentioned before, the LD taxonomy was based only on the speakers' voice samples (that were successfully discriminated via the previous VPM scheme), because of the degradation/deterioration that could be implicated by the unique properties of the phone signals [14,15] and the disorientation by the multinational character of the music data. Moreover, music language can be easily identified in textual processing terms, using the associated meta-data of the song files (title, artists, etc.). The phone category, on the other hand, is not present in all radio streams and it holds a much shorter duration. Hence, it might not be worth complicating the LD hierarchy in such a way that it will not be applicable in the presented scenarios. Even within the same radio program, telephone voices would have significant differentiation, emanated by both the speech features of the involved persons and the nature/quality of the connecting lines. Given that the detection of P windows has proved quite easy [14][15][16][17][18][19][20], these segments could be further processed to match the V patterns (i.e., in terms of spectral shaping and dynamic range accommodation). Thus, in case that language recognition of the few phone events is considered necessary, mapping to the used LD scheme could be an alternative approach. Overall, the implemented hierarchical taxonomy was formed in a trade-off kind of concept (to balance between genericity and complexity) and is presented in Figure 2.

Windowing Process-Annotation
One of the most crucial steps in audio data mining problems is the signal windowing, since the segmentation length influences the performance of the classification models. In the current work, the audio content was segmented in short frames via Python script coding [21]. Moreover, several lengths were utilized and more specifically 100 ms, 500 ms, 1000 ms and 2000 ms, aiming at investigating the impact of the temporal windowing, indicating the most effective ones in the current language discrimination process. In addition, the differentiated window length analysis was considered necessary, since two hierarchical schemes were involved in the subsequent classification experiments. Table 1 presents the population of the input samples in each category with the correspondent labeling of Figure 2, according to the following notations: • M, P, V for Music, Phone and Voice samples, respectively. • Vgr, Veng, Vfr, Vger for Greek, English, French, German Voice samples, respectively.

Feature Engine-Feature Evaluation
After the completion of the segmentation step, the formulated datasets of the audio samples were subjected to the feature extraction procedure. Specifically, from each audio frame, an initial set of audio properties was extracted. In the current work, 56 features (Table 2) were computed, taking into consideration previous experience, trial and error tests and bibliographic suggestions [1-11], while the extraction process was conducted via the MIRToolbox specialized software in the Matlab environment [43]. The audio properties include time-domain variables (number of peaks, RMS (Root Mean Square) energy, number of onsets, rhythmic parameters, zero-crossing rate, etc.), spectral characteristics (rolloff frequencies, brightness, spectral statistics, etc.) and cepstral features (Mel Frequency Cepstral Coefficients). A thorough description of the extracted feature set can be addressed in [13][14][15][22][23][24][25][26]. The values of the computed audio properties from each audio frame (100 ms, 500 ms, 1000 ms, 2000 ms) were combined to the respective annotations of Table 1, in order to formulate the ground-truth database, which is necessary for the subsequent data mining experiments, and specifically for training the supervised machine learning models.
The extracted audio properties usually implicate different discriminative performance, with their efficiency and suitability to be strongly related to the specific task under investigation. For this reason, their impact is examined in the current work via an evaluation process, that algorithmically ranks the most powerful features of the 2-layer classification problem. Specifically, the "InfoGain Attribute Evaluation" method was utilized in the WEKA environment [44], which estimates the importance of each property separately, by computing the related achieved information gain with entropy measures (for the correspondent classification scheme). Table 3 presents the feature ranking that was formulated for the first discrimination scheme (VPM), with the implementation of the differentiated window length during the segmentation step. It has to be noted that Table 3 exhibits only the first 10 properties that prevailed during the evaluation tests, while the hierarchy/ranking continues for the whole 56-dimensional feature set. As Table 3 presents, the feature ranking for the VPM discrimination scheme involves slight variations in relationship with the respective temporal lengths. The supremacy of the spectral properties is evident, since the associated features are placed in the first ranking positions, i.e., the brightness (for 3000 Hz, 4000 Hz and 8000 Hz threshold frequencies), the rolloff frequencies (for all threshold energies) and the spectral statistical values of centroid, spread, kurtosis and flatness. Furthermore, the high discriminative power of the cepstral coefficients (mfccs) in speech/non speech classification problems is validated, as these properties hold high order ranking in the feature hierarchy.
The extracted feature vector was also subjected to the evaluation process, on the basis of the second discrimination layer LD. Table 4 presents the feature ranking that was formulated via the InfoGain Attribute algorithm. However, the hierarchy of LD scheme involves both spectral properties (brightness measures, rolloff frequencies, spectral centroid and spread) and temporal features (rms energy, zerocross, attacktime, flux, rhythm_clarity) in the first ten prevailing positions. Since the LD scheme is focused on the discrimination of (multilingual) voice signals, the prevalence of cepstral coefficients (mfccs) is somehow diminished, compared to their impact in the VPM classification layer. The aforementioned feature rankings of Tables 3 and 4 are based on the computation of entropy metrics, hence, they represent a comparative feature analysis (an initial indication of their impact) rather than a strict evaluation of their actual efficiency. Performance and suitability of the selected feature vector are derived from the discrimination rates of the subsequent machine learning experiments, in which all the attributes are simultaneously implicated/exploited in the classification process. Because of the investigatory nature of the current research along with the restricted sample size, the aforementioned hierarchy of audio properties cannot be generalized at this step, towards the extraction of solid conclusions, since potential differentiations could occur while moving to the augmented multilingual audio repository. Nevertheless, the feature ranking results can be useful toward combined, early and late temporal (feature) integration decision making, combining multiple modalities/machines.
The specific ranking was conducted entirely quantitatively, based on the used information evaluation algorithms. A potential explanation of the higher ranking of the spectral attributes might be found on the basis of the differentiated letters/phonemes distribution [45][46][47][48] in the various languages (for example, the increased energy containing "t" and "p" phonemes/utterances). The same effect can be explained on the fact that some languages favor explosive-like speech segments and/or instances. For instance, the rolloff_0.99 parameter (that is constantly first in all LD time-windows) efficiently detects such transients and their associated high spectra (not solely but in combination with other features). Again, the focus of the current paper and its investigative character does not leave room for other related experiments, beyond the provided ranking with the associated trial and error empirical observation and justification comments. In truth, it would be risky to attempt such an interpretation within the small-size audio dataset used, where slight variations in the recording conditions and the particularities of the different broadcasted streams could have a stronger effect than the speaker or the language attributes.

Configuration and Validation of the Training and Evaluation Procedures
The audio samples that were collected, segmented and annotated in the previous steps, along with the respective values of the extracted features, constitute the training pairs for the machine learning and testing experiments, based on both the formulated classification schemes (VPM, LD). Extensive experiments were conducted in [14], aiming at comparing supervised classification methods (decisions trees, artificial neural systems, regressions, etc.) based on their overall and partial discrimination rates, in various implementations and schemes of broadcast audio content. In this context, the utilization of artificial neural networks (multilayer perceptrons) achieved increased and more balanced classification rates. Consequently, artificial neural systems (ANS) were selected as the main supervised machine learning technique in the current work. Several experiments were conducted regarding the network topology, in order to achieve efficient training performance, leading in the structures of one hidden layer (with sigmoid trigger function) and an output linear layer, while an approximate number of 20-23 neurons was engaged in the intermediate layer (via trial and error tests). Furthermore, the k-fold validation method was implemented for training purposes, which divides the initial input data set into k-subsets and thereafter, the (k-1) subsets are exploited for training the classifier and the remaining subset is utilized for model validation, while the whole process is repeated k times iteratively [13,14]. The k-fold validation technique aims to evaluate the performance of the ANS topologies, and furthermore, contribute to the formulation of generalized classification rules. The value of k = 10 was selected in the supervised machine learning experiments.
In particular, the eight-minute duration of audio speech signal involves eight different speakers, at each language (one minute each). Regarding the k-fold validation, the process was involved both at the Voice/Music/Phone taxonomy (with k = 10) and in the language discrimination task (with k = 8). However, it has to be clarified that in all cases the training process was completely different at each session, while feature treatment (offset, scaling, etc.) was also applied individually, i.e., only the training samples were involved in every case. Furthermore, a randomization of the samples was applied at the audio level, with the precaution of avoiding the same speakers to be engaged both in training and validation tasks. Thus, seven out of the eight speakers were used for training, leaving the remaining one for testing/evaluation purposes. The same operation was conducted eight times, with each training loop leaving aside as unseen data a speaker for each language, also ensuring the avoidance of co-existence of same data in both training/testing subsets. In this concept, we were also able to tackle the partial scores in each language.
Additional experiments were conducted under the hold-out validation approach, this time forming pairs of two different speakers and languages (21 in total) as evaluation sets, while using the remaining data for training purposes (again, initiating entirely different/isolated training sessions for each assessment cycle). The results showed that the difference (with the k=fold validation) in the observed accuracy scores varied between 1% and 3%, thus validating the soundness of the approach. While this analysis was considered adequate for the investigative character of the paper, still, indicative testing was performed on additional (entirely unseen) audio recordings, revealing almost identical performance. Among others, mixed language sentences (i.e., Greek with English words) spoken by the same person were included, also anticipating the ability of the system to monitor language alteration at the word level. While, again, the recognition scores were attained at the same levels, the configuration of a proper window and hop lengths was tricky, depending on the speech rate of the involved speaker and related timing characteristics of the associated recordings and languages.
On the other hand, the popular K-Means algorithm was selected as the respective unsupervised learning method (clustering process), aiming at inspecting and detecting the formulation of groups of data (clusters of feature values), according to a similarity metric. The measure that determines the integration of each sample into a cluster is usually related to a distance metric (Euclidean, Manhattan, Chebyshev, Min-Max, etc.), that estimates the nearness of each sample to the cluster center. The most commonly used metric of Euclidean Distance was also used in the current work (for simplicity reasons), in order to investigate the possible formulation of data clusters, based on the defined discrimination categories of the proposed schemes.
The overall pattern recognition performance (P) of the ANS modules was estimated for each of the implemented schemes by the generated confusion matrices. Specifically, the classification rate is represented by the % ratio of the number of the correctly classified samples to the total number of the input samples [1-3]. In the same way, the partial recognition rate P(X) of class X was based on the % ratio of the correctly classified samples (within the class X) to the total number of samples that X class includes, according to the ground-truth annotations. On the contrary, the clustering models attempt to detect the formulation of data groups directly. Hence, the previous metrics cannot be utilized for the performance estimation, since the unsupervised method (K-Means) does not take into account the ground-truth dataset. It has to be noted that one of the main objectives of the current investigation is to examine the feasibility of the automatic unsupervised classification process in audio signals through clustering strategies and compare the results with the corresponding ones by supervised machine learning (ANS). For this reason, the formulated data clusters of K-Means were compared to the respective annotated classes of ANS, to evaluate the cluster concentration solely. Consequently, the partial integrity/performance P(U) of a cluster U is calculated as the % ratio of the number of samples of the ground-truth class X (that were assigned to cluster U), to the total number of class-X samples. This metric essentially represents a % estimation measure of the resemblance between cluster U and class X (also known as cluster purity). In this way, the overall performance of the clustering process (P) was computed through the % proportion of the correctly gathered samples (summed for each class X) to the total number of samples. These metrics also favored the performance evaluation independently of the size of the formed clusters, which, in real-world radio content, are expected to have unbalanced size distributions. The number of clusters in the unsupervised classification experiments was configured manually to four (in accordance to the attempted formulation of groups of data), based on trial and error experiments, while also employing the expectation maximization (EM) algorithm [44], implicating the probability distributions on the involved samples. Moreover, this parameter could also be set by the user based on prior knowledge on specific shows, supporting this way the program-adaptive classification process in a real-world scenario that this research investigates.

Performance Evaluation Results
The supervised and clustering techniques (ANS and K-Means) were deployed independently in the first discrimination scheme VPM. Thereafter, the machine learning techniques could be combined in the second classification layer LD, promoting either a strict supervised/unsupervised character of the classification modules or a hybrid coupling. Figure 3 presents the above combinations for the two layers of the adopted taxonomy. For simplicity reasons the letters S, U are used for Supervised and Unsupervised classification, respectively. It has to be noted that the clustering "path" leads in the integration of more automation in whole semantic analysis process, compared to the prerequisites of ground-truth data formulation that supervised learning demands. However, it cannot stand alone, because it does not offer a class labeling outcome (either comparison with a supervised module is needed or some subjective labeling). Given the hierarchical scheme with the two classification layers (VPM, LD) and the two different types of classifiers (S, U), four combinations of the machine learning methods, namely SS, SU, US, UU, are formed according to each path ( Figure 3).  Table 5 and Figure 4 show the classification and clustering results for the first discrimination layer (VPM), according to the respective window lengths 100 ms, 500 ms, 1000 ms, 2000 ms. The increased overall and partial discrimination rates for the VPM scheme while the supervised method of ANS is utilized can be observed. The maximum classification percentage for voice signal is achieved via the segmentation window of 1000 ms. The discrimination percentage of 99.58% refers to the correct classification of 1912 out of 1920 voice data, leaving eight misclassified samples. Moreover, the clustering approach implicated the efficient formulation of data groups, while addressing the respective categories of the ground-truth set. The maximum overall and partial discrimination rates also derive from the utilization of 1000 ms window length, namely 89.75% and 89.95%. In this case, the voice cluster involved 1727 samples out of 1920 previously annotated. It has to be noted that the misclassified data (eight for supervised ANS and 193 for the unsupervised K-Means) had to be removed before proceeding to the next discrimination scheme LD, in order to avoid the transmission of the classification errors in the hierarchical implementation. In this way, only the 1912 samples (voice signal with a duration of 31 min and 52 s) and 1727 samples (voice signal with a duration of 28 min and 47 s) were engaged in the language discrimination process, while also retaining the strategy of parametric window lengths for comparison purposes. While this accommodation was made for purely experimental purposes, such kind of screening is not feasible in real world scenarios, in which error propagation is inevitable. However, this unwanted effect can be diminished by adjusting the confidence classification parameter (C) that was explained with regard to the analysis of Figure 1.   Table 6 and Figure 5 present the overall pattern recognition scores for the combination of the different classifiers in the LD scheme, including the partial recognition rates for each language. Again, the paths starting with supervised language detection (SU, SS) seem to have better performance when compared to the associated unsupervised paths (UU, US). However, it seems that the S and U models do not have significant recognition differences in the LD layer, which can be exploited during the implementation of late integration and/or ensemble learning approaches. Overall, even with such small datasets, the achieved accuracy is considered quite satisfactory, thus making an initial proof of concept for the proposed methodology presented in Figure 1.  More specifically, the employed investigations indicated the feasibility of potential detection and classification of spoken language, even when these audio data derive from short-duration broadcast content. Therefore, the conducted experiments validated the possibility to quickly/efficiently identify spoken-language, based on a small amount of input data, i.e., short-duration annotated files with the voices of the main speakers. Furthermore, previous work has shown that radio program adaptive classification is beneficial and outperforms generic solutions [1,33]. The problem under discussion reflects a real-word scenario, encountered in modern media/monitoring organizations, where semi-automated indexing and documentation are needed, which could be facilitated by the proposed language detection preprocessing. In these grounds, the experimentation with a small dataset is really essential in the direction of the potential formulation of a quick-decision model (and actually more demanding). Hence, the target of this work was not to implement a generic (i.e., for every purpose) language recognition model, in which elongated ground-truth (e.g., using audio books) could be formed to serve the pattern recognition needs (i.e., exhaustive learning, deep learning architectures, etc.). This attempt would be possibly very useful when the Generic Audio Language Classification Repository will be gradually augmented via the iterative radio-broadcast adaptive operation of Figure 1. However, even for this subsequent step, as described above, the validation of efficient classification performance for reduced size input data is a prerequisite, before moving into further analysis, justifying the experimentation on small duration audio signals as an initial exploring procedure. Consequently, this strategy matches the investigatory inception of the project, which seeks for indicators that could be applied in a second step with a larger dataset and/or an already pre-trained model, with the purpose of delivering overall results.

Discussion and Future Work
The current work addressed the problem of audio content classification, derived from European radio productions, based on the spoken language. In order to support this process, a hierarchical structure was proposed and implemented, consisting of two successive discrimination schemes (VPM and LD). Moreover, several segmentation window lengths were tested for comparison purposes, while supervised machine learning and clustering techniques were employed, both independently and combined. In the first step, the conducted experiments achieved high percentages of voice signal detection/isolation (above 99%). The language classification process achieved overall and partial discrimination rates above 90% in most cases, indicating the distinctive characteristics of each spoken language. This implementation was supported by an effective multi-domain feature engine, based on the increased classification performances. Finally, the successful formulations of data clusters favored the integration of automation in the whole semantic analysis process, because of their independence of the ground-truth data.
In the context of future potentials, the presented work could be extended in order to involve radio broadcasts, in more European countries/languages (Spanish, Italian, etc.), or even dialects, which constitute slight variations of the same language, usually dependent on origin. For this purpose, the adaptive classification process could be fed with specialized broadcast data that involve these kinds of language alterations, deriving possibly from radio content in the regions of the dialects. It is also expected that specific radio producers would be attached to specific dialects, which favor the adaptive nature of the approach. In the same direction, another pattern recognition scheme that the model could involve is the identification of a male/female voice in the broadcasted signals (taking advantage of its binary nature and the used hierarchical type). The aforementioned goals can be supported by the computation, evaluation and experimentation within an expanded feature set (even in trial and error tests), because of the complex and specialized character of voice data in the radio productions (articulation, speed, transmission channels, etc.). Furthermore, more thorough research may be conducted on the phonetic level, with very small window lengths, in order to captivate special sound properties, dependent on-air movement (from lungs to mouth, nose, etc.).
Since radio broadcasts appeal to differentiated interests of the audience, the semantic analysis process may be extended in the thematic classification of radio content, dependent on the "character" of the corresponding broadcast (news, athletic, music programs, etc.). These discrimination potentials can be empowered by intelligent pattern recognition systems, aiming to extract and formulate effective metadata mechanisms, aiming for efficient content description, storing, accessing and management, meeting the users' demands and expectations.
As already explained, the motivation behind this work emanates from specific practical needs in documenting and indexing audio broadcasted content, using short-in-duration annotated podcast samples from indicative/past radio program streams. In this context, the whole approach with the small dataset and the machine training difficulties constitute a very demanding problem, which also signifies the research contribution of the conducted work.
Clearly, the outmost target remains the gradual implementation of the full potentials of the proposed methodology, as depicted in Figure 1. In this context, iterative self-learning procedures can be elaborated for hierarchical language (and generally audio) classification in broadcasting content, taking advantage of the well-organized big-data structure (assembled by multiple program-adaptive sub-groups). Moreover, depending on the size of the "global" and "local" repositories, more sophisticated deep learning architecture can further propel the potentials of this initiative.