Restricted Boltzmann Machine Vectors for Speaker Clustering and Tracking Tasks in TV Broadcast Shows

: Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker veriﬁcation systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained to transform utterances into a vector based representation. Because of the lack of data for a test speaker, we propose RBM adaptation to a global model. First, the global model—which is referred to as universal RBM—is trained with all the available background data. Then an adapted RBM model is trained with the data of each test speaker. The visible to hidden weight matrices of the adapted models are concatenated along with the bias vectors and are whitened to generate the vector representation of speakers. These vectors, referred to as RBM vectors, were shown to preserve speaker-speciﬁc information and are used in the tasks of speaker clustering and speaker tracking. The evaluation was performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed speaker clustering system gained up to 12% relative improvement, in terms of Equal Impurity (EI), over the baseline system. On the other hand, in the task of speaker tracking, our system has a relative improvement of 11% and 7% compared to the baseline system using cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring, respectively.


Introduction
Deep learning has been successfully applied to various tasks of image and speech technologies in recent decades. Their success has influenced the research community to make use of these techniques in speaker recognition tasks [1][2][3][4][5]. Deep learning has been applied to extracting bottle neck features (BNF) and then compute Gaussian Mixture Models (GMM) posterior probabilities in a hybrid Deep Neural Network-Hidden Markov Model (DNN-HMM) model [6,7]. At the front end, deep learning is capable of learning deep features from acoustic features, which are used in several speaker recognition tasks [5,[8][9][10][11]. Deep learning has also been applied to learning a vector representation of a speaker for speaker verification, such as in References [5,[12][13][14]. There are some interesting works that address the performance loss on degraded speech condition and acoustic mismatch between enrollment and test phases of speaker recognition systems [15,16]. Also, there are several recent approaches to obtaining fast training, for example, the Extreme Learning Machine (ELM), which has been extremely efficient in representational learning and several other learning tasks [17][18][19].
Unsupervised deep learning architectures like Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs) and Deep Autoencoders have the ability of representational learning power.
speaker. The target speakers are first enrolled in the system. We represent all the segments and target speakers by RBM vectors. Then, the RBM vectors of all the segments are scored against the RBM vectors of all the target speakers using cosine and PLDA scoring. We have found that the RBM vector representation of speakers is successful in both these tasks as in speaker verification. The experimental results show that the RBM vector outperforms the conventional i-vectors based systems using both the cosine and PLDA scoring methods.
The rest of the paper is organized as follows: Section 2 explains the detailed procedure of the proposed vector representation of speakers by using RBMs; Section 3 contains a brief description of the speaker clustering system; Section 4 contains a detailed description of the fundamental stages of our speaker tracking system; Section 5 describes the experimental setup, the database used and how the experiments were carried out; the results obtained are discussed in Section 6; and finally, in Section 7, some conclusions are drawn as the findings of this paper.

RBM Vector Representation
In this paper, we propose the use of a compact, vector based representation of speakers using RBM adaptation for speaker tracking and speaker clustering tasks. Figure 1 shows a detailed block diagram of the proposed RBM vector extraction. First, a global model-referred to as Universal RBM (URBM)-is trained with a large amount of background data. The URBM is then adapted to the data of every test speaker and thus an RBM is trained per test speaker. The visible to hidden weight matrices of these adapted models are used to generate the desired vector representation for the corresponding speaker. These vector representations of speakers are further used in the above-mentioned tasks using cosine/PLDA scoring. The whole process of the vector representation of speakers has three main steps, namely URBM training, RBM adaptation and RBM vector extraction using PCA whitening with dimensionality reduction.

URBM Training
To extract the desired RBM vector, the first step is to train a global or universal model with a large amount of available background speakers' utterances. This global model is referred to as URBM, which is supposed to convey speaker-independent information. The URBM is trained as a single model with the features extracted from all the background speakers' data. For the real valued input features, we have used Gaussian real-valued units for the visible layer of the RBM [37]. The training is performed using the CD-1 algorithm [38,39] assuming that the inputs have zero mean and unit variance. Thus, the features are Mean Variance Normalized (MVN) before the RBM training. Finally, the universal model is trained with a large number of training samples generated from the feature vectors of the background speakers' utterances. This universal model is supposed to learn both speaker and session variabilities from the large background data [22].

RBM Adaptation
After the URBM training, we perform speaker adaptation for every test speaker. The adapted RBM model is trained only with the data of the corresponding speaker, in order to capture speaker-specific information. In this step, the RBM model of the speaker segment is initialized with the parameters (weights and biases) of the URBM. In other words, the adaptation step drives the URBM model in a speaker-specific direction. This kind of adaptation technique is successfully applied in References [25,[40][41][42]. The adaptation is also carried out by the CD-1 algorithm. As we have only one weight matrix in an RBM, all the information learned by an RBM is in the weight matrix and it is supposed to convey speaker-specific information of the corresponding speaker.

RBM Vector Extraction
An RBM model is assigned to each test speaker after the adaptation step. The visible to hidden weight matrices along with their corresponding bias vectors of the adapted RBMs are concatenated in order to generate a higher dimensional speaker vector. These are referred to as RBM supervectors. After this, a PCA whitening with dimensionality reduction is applied to the RBM supervectors in order to generate the lower dimensional RBM vectors. The PCA whitening transforms the original data to the principal component space which de-correlates the data components. The PCA is trained with the RBM supervectors extracted from the background speakers' utterances and is applied to the RBM supervectors of the test speakers. All the RBM supervectors are mean-normalized before subjecting to PCA whitening and dimensionality reduction. The extracted RBM vectors are supposed to convey enough speaker-specific information, which can discriminate different speakers. Figure 2 shows a visualization of a pair of RBM vectors (top and bottom) extracted from different utterances of two different speakers randomly selected from the test audios. From the Figure, it is clear that the two RBM vectors extracted for Speaker 1 look similar but are different from those extracted for Speaker 2. Similarly, the two RBM vectors extracted for Speaker 2 look similar but are different from those extracted for Speaker 1. In our previous work [22], it has been shown that the RBM vector extracted in this way is successful in learning speaker-specific information in a speaker verification task. Thus, we make an effort to make use of the RBM vector in the tasks of speaker clustering and speaker tracking.

Speaker Clustering
In order to evaluate the effect of RBM vectors in a speaker clustering task, we considered the conventional bottom-up AHC clustering system with the options of single and average linkages. We did not consider the model retraining approach because it is costly in terms of computations as compared to the linkage approaches to clustering [28]. The system starts with an initial number of clusters equal to the total number of speaker segments. Iteratively, the segments that are more likely to be from the same speaker are clustered together until a stopping criterion has reached. The stopping criterion can be thresholding the score in order to decide to merge clusters or it can be a desired (known) number of clusters achieved. The clustering algorithm is based on computing a distance/similarity matrix M(X) between all the speakers' segments where X is the set of segments to be clustered. Hence, the RBM vectors of all the segments are extracted, the matrix M(X) is computed by scoring all the RBM vectors against all. Thus, for N RBM vectors, the matrix M(X) has dimensions N × N. In every iteration, the segments with minimum/maximum distance/similarity scores are clustered together and the matrix M(X) is updated. The corresponding rows and columns of the clustered segments are removed from M(X) and a new row and column are added. The new row and column contain the distance scores between the new and old clusters. The new scores are computed according to the linkage algorithm used. For example, segments S a and S b are clustered in S ab . Then the scores between new cluster (S ab ) and old segment (S n ) are computed as follows: (a) Average Linkage: where s(S ab , S n ) is the score between new cluster S ab and old segment S n while s(S a , S n ) is the score between old segments S a and S n . In this way, the process is iterated until a stopping criterion is met. There are two methods to control the iterations: (1) to fix a threshold and (2) to add an additional information to the system about the desired (known) number of clusters. The system stops when this number is reached. In this work, we did not let the system know any desired number of clusters and we have used the thresholding method. We have tuned a threshold in order to see the performance of the system at different possible working points. The system performance is measured with respect to a ground truth cluster label.

Speaker Tracking
We extend our previous work in Reference [24] in order to investigate the effect of RBM vectors on a speaker tracking task. We implemented a two stage speaker tracking system, that is, speaker segmentation and speaker identification. Figure 3 shows the basic steps of speaker segmentation, RBM vector extraction and identification. First of all, the audio is segmented according to the speaker change points. The speaker change points are detected using 'the sliding window and searching for speaker change' approach. A fixed length window is slid over the audio with a very small shift and speaker change is detected using some distance metric. We have used the Divergence Shape distance as a distance metric in this paper. The distance is thresholded in order to decide if the neighboring windows are spoken by the same speaker or whether there exists a speaker change. As a result of these speaker change points, the audio is segmented. In the next stage, a speaker identification of the target speakers against the segments is performed in order to know 'to which target speaker the corresponding segment belongs?' All the target speakers and segments are transformed into a vector based representation by means of RBMs, that is, RBM vectors. These RBM vectors are scored using cosine and PLDA scoring methods. In the following sections, the two stages of our speaker tracking system are discussed in detail.

Audio
Speech

Speaker Segmentation
As shown in Figure 3, first an energy-based Speech Activity Detection (SAD) is performed on the audio. Then, the speech parts are segmented into small segments of d seconds with an overlap of (d − ∆) seconds, where ∆ is the shift. This is referred to as initial segmentation in the segmentation part of Figure 3. The segments generated in this step are reffered to as small segments. The shift ∆ defines the resolution of speaker change detection. Then, Mel-Frequency Cepstral Coefficients (MFCC) features are extracted for every small segment. In order to detect speaker change points, the Divergence Shape distance is computed between every adjacent small segments and is thresholded. We compute the Divergence Shape distance as in References [34,36], using the following simplified expression: where tr is the trace function that sums the diagonal elements of a matrix, C i is the covariance of the features from small segment S i and C j is the covariance of the features from small segment S j . A speaker change point is marked if the distance at that point is greater than the distances at the two neighboring points (one before and one after) and a threshold at that point. For example, a speaker change point at small segment S i occurs if: and and where D(i, i + 1), D(i, i + 2) and D(i − 1, i) are the Divergence Shape distances of small segment S i to S i+1 , S i to S i+2 and S i−1 to S i , respectively. Threshold i is an adaptive threshold which is computed for every small segment and is defined in [34] as: where α is a scaling factor and needs to be tuned experimentally. We have evaluated the segmentation with different values of α which we will discuss in Section 6. In Equation (7), N is the number of previous distances used for predicting the threshold. Once we detect the speaker change points by using this method, we segment the audio on these points. The segments generated will be used in the next step, that is, speaker identification. It is worth noting that we did not perform any refining algorithm for the speaker change points. Rather, we fixed the value of α so as to minimize the Miss Detection error in order not to miss a speaker change. This is because a False Alarm error can possibly be corrected in the speaker identification stage but a Miss Detection error cannot be corrected.

Speaker Identification
The second stage of our speaker tracking system performs a conventional speaker identification test on the segments and target speakers as shown in Figure 3. The goal is to answer to which target speaker, the segments belong? We propose the use of RBM vector representation for both the target speakers and segments generated in the segmentation stage. The MFCC features are extracted both for target speakers and segments. Then, RBM vectors are extracted and all the segments are tested against target speakers using cosine and PLDA scoring. Assume that S Tm,Sn represents the cosine/PLDA score for testing the target speaker T m against the segment S n . For a segment under test, first we select a potential candidate among all the target speakers. The target speaker with the maximum score is a potential candidate for the segment. Then, if the maximum score is greater than a threshold, the identity of that target speaker is assigned to that segment. Generally, the identity of the target T m is assigned to the segment S n according to: where λ is a threshold to decide whether the segment under test does not belong to any of the target speakers. If the score is less than λ, the segment is not assigned to any of the target speakers. This is reflected as a Missed Speaker Time (MST) error for the target speaker which the segment actually belongs to. There are no speakers that should be rejected by the system because we consider all the speakers as possible target speakers. We have performed experiments with different values of λ in order to analyze the effect of the proposed RBM vectors at all possible working points.

Database
The experiments are performed on the AGORA database, which contains audio recordings of 34 TV shows from Catalan broadcast TV3 [43] (in total 68 audios of approximately 38 min each). These audios contain segments from 871 adult Catalan and 157 adult Spanish speakers. For all the experiments in this work, we selected 38 audio files for testing and 30 audios are used as background data. The background data were used to train the Universal Background Model (UBM) and Total Variability (T) matrix for the baseline i-vector system. For the proposed system, the background data were used to train the URBM and PCA. We manually extracted 2631 speaker segments from the test audios, according to ground truth rich transcription. These segments were used in the speaker clustering experiments. In the testing audios, 414 different speakers appear which were used as target speakers for the tracking experiments. For an audio file, all the speakers are considered as possible target speakers. A priori knowledge is required to enroll the target speakers in the system. Thus, the target speakers are enrolled using i-vectors and RBM vector approaches for the baseline and proposed systems, respectively. The target speakers are enrolled with 30 s of utterances. These enrollment utterances of target speakers are manually selected from the corresponding audio file (in which they appear) according to the ground truth rich transcription. It is worth noting that each target speaker appears in at least one of the test segments.

Baseline and RBM Vector Setup
For all the experiments, 20 dimensional MFCC features were extracted, for both the baseline and proposed systems, using a Hamming window of 25 ms with 10 ms shift. A 512 component UBM was trained to extract i-vectors for the baseline system and the PLDA was trained with the background i-vectors. A more recent and competitive features could have been used, for example the BottleNeck Features (BNF). These features (either in the baseline or in the proposed approach) would require a huge amount of labeled background data (for example phonetic labels). On the other hand, MFCC features do not require labeled data for training our models. This is the strength of our proposed RBM vectors, which were trained in a completely unsupervised manner. The UBM training, i-vector extraction, i-vector testing and PLDA training were carried out using Alize, a free open source toolkit [44]. For the proposed system, more than 3000 speaker segments were extracted from the background audios according to the ground truth rich transcription. For each segment, the features of 4 neighboring frames were concatenated in order to generate 80-dimensional feature inputs to the RBMs. With a shift of one frame, we generated almost 10 million samples for the URBM training. All the RBMs used in this paper consisted of 80 visible and 400 hidden units. The URBM was trained for 200 epochs with a learning rate of 0.0005, weight decay of 0.0002 and a batch size of 100. All the adapted RBM models for the segments and target speakers were trained with 200 epochs with a learning rate of 0.005, weight decay of 0.000002 and a batch size of 64.
For the baseline i-vector system, the hyperparameters were set to the typical values that are commonly used in speaker recognition tasks. For the proposed RBM vector system, the set of hyperparameters, that is, the visible and hidden units in all the RBMs, the number of epochs and batch size for the URBM, and learning rate for the adapted RBM models were adopted from our previous work in Reference [22]. For the adapted RBM models, we used a higher value for the number of epochs and a slightly lower value for batch size because the segments were very short as compared to our previous work in Reference [22].
The PCA was trained with the background RBM supervectors and was applied to the background RBM supervectors and test RBM supervectors, as discussed in Section 2.3. Finally, fixed dimensional RBM vectors were extracted for the test speakers that were used in the speaker tracking and clustering experiments. Different dimensions for the RBM vectors were evaluated in the experiments which is discussed in Section 6.

Evaluation Metrics
The results of the speaker clustering system were evaluated in terms of Cluster Impurity (CI). CI measures the quality of a cluster, to what extent a cluster contains segments from different speakers. However, this metric has a trivial solution when there is only one segment per cluster. To deal with this, Speaker Impurity (SI) was measured at the same time. SI measures to what extent a speaker is distributed among clusters. There is always a trade-off between these two metrics [45]. CI and SI were plotted against each other in an Impurity Trade-off (IT) curve and an Equal Impurity (EI) point was marked as a working point.
We evaluated the results for speaker segmentation in terms of False Alarm Rate (FAR) and Miss Detection Rate (MDR), as discussed in Reference [46]. The overall speaker tracking system was evaluated in terms of False Alarm (FA) and Missed Speaker Time (MST). In this case, FA is the percentage of duration (in seconds) that is falsely accepted for a target speaker while MST is the percentage of duration (in seconds) that is falsely rejected for a target speaker.

Speaker Clustering
Different lengths for RBM vectors, as well as for i-vectors, were evaluated using cosine scoring and the average linkage clustering algorithm. The results are shown in the second column of Table 1. From the Table, it can be observed that if the dimension is increased, the performance is improved, both in case of i-vectors and RBM vectors, in terms of Equal Impurity (EI). However, in the case of i-vectors, the best choice is 800 dimension. In case of RBM vectors, the 2000 dimensional RBM vectors perform better than the others. In this case, a relative improvement of 11% is achieved compared to 800 dimensional i-vectors. A further increase in the length of RBM vectors beyond 2000 degrades the performance in terms of EI.
The third column of Table 1 compares the performance of the RBM vector with the baseline i-vectors in the case of the single linkage algorithm for clustering using cosine scoring. From the table it is seen that single linkage was a better choice for our experiments. In this case, a minimum EI of 37.14% is obtained with 2000 dimensional RBM vectors which has a relative improvement of 12% over 800 dimensional i-vectors. Finally, we evaluated the proposed system using PLDA scoring as well. The PLDA was trained using background RBM vectors for 15 iterations. The number of eigenvoices were set to 250, 450 and 500 for RBM vectors of dimensions 400, 800 and 2000, respectively. All the RBM vectors were subjected to length normalization prior to PLDA training. As per the previous results, we performed this experiment with the single linkage algorithm only. The results were compared with i-vectors in the fourth column of Table 1. It was observed that 800 and 2000 dimensional RBM vectors have a better EI compared to the respective similar dimensional i-vectors. In this case, the RBM vectors of dimension 2000 have a minimum EI of 31.68% which results in a relative improvement of 11% over the 800 dimensional i-vectors. However, in the case of 400 dimensions, the i-vectors outperform RBM vectors.
The Impurity Trade-off (IT) curves for the baseline, as well as the proposed system, are shown in Figure 4. Figure 4a shows the evaluation of different dimensions of i-vectors and RBM vectors in the average linkage clustering using cosine scoring. It can be seen that RBM vectors of length 2000 gives a better performance than 800 dimensional i-vectors at all working points. On the other hand, RBM vectors of dimensions 400, 800, 2400 and 3000 perform worse than i-vectors. It is observed that 400 and 800 dimensional RBM vectors could not capture enough information about the speaker while 2400 and 3000 dimensional RBM vectors include unnecessary information which degrades the performance. 20 Figure 4b, it can be seen that the RBM vectors perform better at all working points as compared to i-vectors using their respective cosine and PLDA scoring. However, at low Speaker Impurity regions, the RBM vector with cosine scoring outperforms the baseline i-vector with PLDA scoring. Overall, the 2000 dimensional RBM vector has a consistent improved performance compared to i-vectors.

Speaker Tracking
The application of RBM vectors was further extended to a speaker tracking task. For speaker change detection and segmentation, 20 MFCC features were extracted for all the small segments using a Hamming window of 25 ms with 10 ms shift. We performed segmentation using different sizes of small segments, that is, the d parameter discussed in Section 4.1 was equal to 2, 2.5 and 3 s. The value of ∆ was set to 0.25 s. The speech parts smaller than d were not considered in these experiments and were simply discarded. Figure 5 shows the graph of FAR against MDR for different values of d and α. The results were computed, accepting a tolerance (collar) of ±0.25 s in the position of detected speaker change points. We experimented with different values of d in order to see the behaviour at different working points, that is, d = 2, 2.5 and 3 s. Then, we experimented with different values of α and the results are plotted in Figure 5. From the Figure it is clear that the best choice for d is a 3 s window.
The MDR for this window is not very sensitive to alpha as compared to the other window sizes. This is because in our experiments, the segments less than the selected window size were discarded. Thus the segments have longer durations, which have strong boundaries with the neighbouring segments as compared to a window size of 2 and 2.5 s. A strong boundary is not very likely to be missed by the system. That is why, when we vary α, the MDR does not vary a lot and thus the MDR seems to be insensitive. On the other hand, if the window size is small, the segments have weak boundaries with the neighbour segments and are relatively more likely to be missed by the system.
Our actual working point is marked as a black circle which is obtained for α = 2 (in Equation (7)). We performed the final segmentation at this point which has less Miss Detection (MDR) as compared to False Alarm (FAR). At this point, a FAR of 10% and MDR of 7.8% are achieved. There is a trade-off between the two metrics (FAR and MDR). One can decrease one of the metrics at the cost of increasing the other.  The segments generated in the speaker segmentation were then tested against the target speakers for the tracking task. Table 2 shows the results of speaker tracking for different lengths of RBM vector in terms of Equal Error Rate (EER). In this case EER was the coinciding point between FA and MST. The second column of Table 2 shows the comparison of RBM vector with the baseline i-vectors using cosine scoring. We fixed the length of i-vectors to 800 as a conclusion of the speaker clustering experiments. It is observed that, as the length of the RBM vector is increased, the performance is improved. The best EER of 3.30% was obtained using 2000 dimensional RBM vector, which gained a relative improvement of 11.76% as compared to the baseline 800 dimensional i-vectors. Increasing the dimensions of the RBM vectors does not affect the computational costs of training the models. The dimensions of RBM vectors are only controlled by the number of components while applying PCA to the RBM supervectors, as discussed in Section 2. 3 The third column of Table 2 shows the comparison of the RBM vector with the baseline i-vectors using the PLDA scoring method. For the RBM vector/PLDA framework, the PLDA is trained using the background RBM vectors for 15 iterations. The number of eigenvoices are set to 350, 450 and 500 for RBM vectors of lengths 600, 800 and 2000 respectively. All the RBM vectors are subjected to length normalization prior to PLDA training. From the table, it is clear that the 2000 dimensional RBM vector/PLDA system outperforms the 800 dimensional i-vector/PLDA system by a relative improvement of 7.74%. In the case of PLDA post processing, increasing the dimensions of the RBM vectors will increase the computational costs of PLDA training. This is because the PLDA model is trained on higher dimensional background RBM vectors. Figure 6 shows the comparison of Detection Error Trade-off (DET) curves for the baseline as well as the proposed system. These graphs are obtained by tuning the λ parameter in Equation (8).
In Figure 6a we have evaluated different lengths of RBM vectors by comparing with i-vectors using cosine scoring. It can be observed that RBM vector of lengths 800 and 2000 give a better performance than the baseline i-vectors at low MST points only. An RBM vector of length 600 can be comparable with baseline i-vectors in this region. On the other hand, at low FA points the baseline i-vectors outperform RBM vectors of either length. However, at very few working points in low FA region, the RBM vector of length 2400 can be comparable with the baseline i-vectors. In Figure 6b we have shown a comparison of 2000 dimensional RBM vector (which gives the best results with the cosine scoring method) with baseline i-vectors using both cosine and PLDA scoring. From the figure, a similar kind of behavior is observed for RBM vectors using PLDA scoring as well. It can be seen that the 2000 dimensional RBM vector outperforms the baseline i-vectors in low MST regions using both cosine and PLDA scoring. However, in the low FA regions, the i-vector/PLDA framework still performs better which was also the case using the cosine scoring method.
The plots in Figure 6 are not very smooth and seem insensitive to λ. This is because the segments are not necessarily of the same duration. As the error (FA and MST) depends on the duration of segments, a false acceptance/rejection does not affect the error in a linear manner. Sometimes a certain value of lambda will falsely accept/reject a long segment which will highly affect the error. While in the case of a short segment a false acceptance/rejection will have a minimum reflection in the error.
We show the error variations of our experiments in Figure 7. The box plots in Figure 7 show the EER distribution of 38 test shows for the proposed RBM vector and i-vector based speaker tracking systems. Each box plot shows the minimum, lower quartile, mean, upper quartile, and maximum EER scores.
(a) EER variation using cosine scoring (b) EER variation using PLDA scoring  Figure 7a,b depict the box plots for different lengths of RBM vectors and 800 dimensional i-vectors using cosine and PLDA scoring, respectively. Figure 7a shows that RBM vectors reduce the EER variations as compared to i-vectors. It is seen that when we increased the length of RBM vectors, the EER variation was reduced further. The lowest EER variation is observed for 2000 dimensional RBM vector. Similarly, Figure 7b shows the same behaviour in EER variation using PLDA scoring. The EER variation was reduced in a similar manner for 800 and 2000 dimensional RBM vectors. However the mean EER was lower for the 200 dimensional RBM vector. Overall, the respective mean values of EER were lower for PLDA scoring as compared to cosine scoring.

Conclusions
In this paper, we have proposed the use of Restricted Boltzmann Machine (RBM) vectors for the tasks of speaker tracking and speaker clustering in TV broadcast shows. RBM is applied for learning a fixed dimensional vector representation of a speaker which is referred to as an RBM vector. First, a Universal RBM model is trained with a large amount of available background data. Then an adapted RBM model is trained per test speaker. The visible to hidden weight matrices along with the bias vectors of these adapted models are concatenated to generate RBM supervectors. The RBM supervectors are further subjected to a PCA whitening with dimensionality reduction to extract the desired RBM vectors. These RBM vectors are used in the tasks of speaker clustering and speaker tracking. For speaker clustering experiments, two linkage algorithms for an AHC approach are explored with RBM vectors scored using cosine and PLDA. Using cosine scoring, the performance of the proposed system is better for both the linkage algorithms as compared to i-vector based clustering. Overall, the single linkage algorithm with 2000 dimensional RBM vectors is the best choice for our experiments, using both cosine and PLDA scoring. For speaker tracking experiments, we performed speaker segmentation followed by a speaker identification. We proposed the use of RBM vectors for the speaker identification stage. In general, the proposed system is more effective in low MST regions. The experimental results have shown that, in terms of EER, the proposed system outperforms the baseline i-vectors system using both cosine and PLDA scoring methods. We conclude that the RBM vectors can be successfully used as a speaker representation in speaker clustering and speaker tracking tasks.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: