# The Domain Mismatch Problem in the Broadcast Speaker Attribution Task

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- To study the influence of diarization on the performance of speaker attribution systems;
- To analyze the impact of domain mismatch between models and data;
- To propose robust approximations that mitigate the domain mismatch between models and data under analysis.

## 2. The Speaker Attribution Problem

## 3. Experimental Protocol

#### 3.1. Albayzín Corpus and Allowed Data

#### 3.2. Performance Metrics for Diarization and Speaker Attribution

## 4. Methodology

#### 4.1. Front-End, SCPD and Embedding Extractor Blocks

#### 4.2. PLDA Tree-Based Clustering Block

#### 4.3. The Identity Assignment Block

#### 4.4. The Direct Assignment Approach

#### 4.5. Clustering and Assignment: The Indirect Assignment Approximation

#### 4.6. Hybrid Solution

#### 4.7. Semisupervised Alternative

#### 4.8. Open-Set vs. Closed-Set Conditions

## 5. Results

- An illustration of the influence of diarization on the speaker attribution problem;
- A depiction of the impact of broadcast domain variability into the speaker attribution task;
- A proposal of alternative approximations to deal with this variability, with special emphasis on unseen domains.

#### 5.1. The Influence of Diarization

#### 5.2. Broadcast Domain Mismatch in Speaker Attribution

#### 5.3. Semisupervised Solutions

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

AER | Assignment Error Rate |

AHC | Agglomerative Hierarchical Clustering |

ASR | Automatic Speech Recognition |

CARTV | Corporación Aragonesa de Radio y Televisión |

CLR | Cross Likelihood Ratio |

DER | Diarization Error Rate |

JFA | Joint Factor Analysis |

LSTM | Long-Short Term Memory |

MAP | Maximum a Posteriori |

MGB | Multi-Genre Broadcast |

PLDA | Probabilistic Linear Discriminant Analysis |

RTTH | Red Temática de Tecnologías del Habla |

RTVE | Radio Televisión Española |

SCPD | Speaker Change Point Detection |

TDNN | Time Delay Neural Network |

VAD | Voice Activity Detection |

## References

- Kenny, P. Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms. In (Report) CRIM-06/08-13; CRIM: Montreal, QC, Canada, 2005; pp. 1–17. [Google Scholar]
- Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis For Speaker Verification. IEEE TASLP
**2011**, 19, 788–798. [Google Scholar] [CrossRef] - Prince, S.J.D.; Elder, J.H. Probabilistic Linear Discriminant Analysis for Inferences About Identity. In Proceedings of the IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
- Snyder, D.; Ghahremani, P.; Povey, D.; Garcia-Romero, D.; Carmiel, Y.; Khudanpur, S. Deep Neural Network-based Speaker Embeddings for End-to-end Speaker Verification. IEEE SLT
**2016**, 165–170. [Google Scholar] [CrossRef] - Ferras, M.; Boudard, H. Speaker diarization and linking of large corpora. IEEE SLT
**2012**, 280–285. [Google Scholar] [CrossRef] [Green Version] - Ghaemmaghami, H.; Dean, D.; Sridharan, S. Speaker Linking Using Complete-Linkage Clustering. In Proceedings of the 14th Australasian International Conference on Speech Science and Technology, Sydney, Australia, 3–6 December 2012; pp. 1–4. [Google Scholar]
- Ghaemmaghami, H.; Dean, D.; Sridharan, S.; van Leeuwen, D.A. A study of speaker clustering for speaker attribution in large telephone conversation datasets. Comput. Speech Lang.
**2016**, 40, 23–45. [Google Scholar] [CrossRef] - Ferras, M.; Madikeri, S.; Bourlard, H. Speaker Diarization and Linking of Meeting Data. IEEE ACM TASLP
**2016**, 24, 1935–1945. [Google Scholar] [CrossRef] - Ferras, M.; Madikeri, S.; Motlicek, P.; Bourlard, H. System Fusion and Speaker Linking for Longitudinal Diarization of TV Shows. IEEE ICASSP
**2016**, 5495–5499. [Google Scholar] [CrossRef] - Viñals, I.; Gimeno, P.; Ortega, A.; Miguel, A.; Lleida, E. Diarization and Identity Assignment Compatibility in the Albayzín 2020 Challenge. Iberspeech
**2021**, 94–98. [Google Scholar] [CrossRef] - Wang, J.; Xiao, X.; Wu, J.; Ramamurthy, R.; Rudzicz, F.; Brudno, M. Speaker attribution with voice profiles by graph-based semi-supervised learning. arXiv
**2020**, arXiv:2102.03634. [Google Scholar] - Lleida, E.; Ortega, A.; Miguel, A.; Bazán, V.; Pérez, C.; Gómez, M.; de Prada, A. Albayzin 2018 evaluation: The IberSpeech-RTVE challenge on speech technologies for Spanish broadcast media. Appl. Sci.
**2019**, 9, 5412. [Google Scholar] [CrossRef] [Green Version] - van Leeuwen, D. Speaker Linking in Large Data Sets. In Proceedings of the Speaker and Language Recognition Workshop, ODYSSEY, Brno, Czech Republic, 28 June–1 July 2010; pp. 202–208. [Google Scholar]
- Hujibregts, M.; van Leeuwen, D.A. Large-Scale Speaker Diarization for Long Recordings and Small Collections. IEEE TASLP
**2012**, 404–413. [Google Scholar] [CrossRef] [Green Version] - Delgado, H.; Anguera, X.; Fredouille, C.; Serrano, J. Fast Single- and Cross-Show Speaker Diarization Using Binary Key Speaker Modeling. IEEE ACM TASLP
**2015**, 2286–2297. [Google Scholar] [CrossRef] - Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. IEEE ICASSP
**2018**, 4879–4883. [Google Scholar] [CrossRef] [Green Version] - Cumani, S.; Brummer, N.; Burget, L.; Laface, P.; Plchot, O.; Vasilakakis, V. Pairwise discriminative speaker verification in the I-vector space. IEEE TASLP
**2013**, 21, 1217–1227. [Google Scholar] [CrossRef] [Green Version] - Kenny, P. Bayesian Speaker Verification with Heavy-Tailed Priors. In Proceedings of the Odyssey Speaker and Language Recogntion Workshop, Brno, Czech Republic, 28 June–1 July 2010. [Google Scholar]
- Brummer, N.; Silnova, A.; Burget, L.; Stafylakis, T. Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model. ODYSSEY
**2018**, 349–356. [Google Scholar] [CrossRef] [Green Version] - Ramoji, S.; Krishnan, P.; Ganapathy, S. NPLDA: A Deep Neural PLDA Model for Speaker Verification. ODYSSEY
**2020**, 202–209. [Google Scholar] [CrossRef] - Schwarz, G. Estimating the Dimension of a Model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Chen, S.S.; Gopalakrishnan, P. Speaker, Environment and Channel Change Detection and Clustering Via the Bayesian Information Criterion. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, 8–11 February 1998; Volume 6, pp. 127–132. [Google Scholar]
- Li, R.; Schultz, T.; Jin, Q. Improving speaker segmentation via speaker identification and text segmentation. In Proceedings of the Tenth Annual Conference of the International Speech Communication Association, Brighton, UK, 6–10 September 2009; pp. 904–907. [Google Scholar]
- Gupta, V. Speaker change point detection using deep neural nets. IEEE ICASSP
**2015**, 4420–4424. [Google Scholar] [CrossRef] - Siegler, M.A.; Jain, U.; Raj, B.; Stern, R.M. Automatic Segmentation, Classification and Clustering of Broadcast News Audio. In Proceedings of the DARPA Speech Recognition Workshop, Chantilly, VA, USA, 2–5 February 1997; pp. 97–99. [Google Scholar]
- Reynolds, D.A.; Torres-Carrasquillo, P. Approaches and Applications of Audio Diarization. IEEE ICASSP
**2005**, 5, 953–956. [Google Scholar] [CrossRef] - Fukunaga, K.; Hostetler, L. The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Trans. Inf. Theory
**1975**, 21, 32–40. [Google Scholar] [CrossRef] [Green Version] - Comaniciu, D.; Meer, P. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 603–619. [Google Scholar] [CrossRef] [Green Version] - Senoussaoui, M.; Kenny, P.; Stafylakis, T.; Dumouchel, P. A study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization. IEEE TASLP
**2014**, 22, 217–227. [Google Scholar] [CrossRef] - Vaquero, C.; Ortega, A.; Miguel, A.; Lleida, E. Quality Assessment of Speaker Diarization for Speaker Characterization. IEEE TASLP
**2013**, 21, 816–827. [Google Scholar] - Macqueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 1 January 1967; Volume 1, pp. 281–297. [Google Scholar]
- Valente, F.; Motlicek, P.; Vijayasenan, D. Variational Bayesian Speaker Diarization of Meeting Recordings. IEEE ICASSP
**2010**, 4954–4957. [Google Scholar] [CrossRef] - Diez, M.; Burget, L.; Matejka, P. Speaker Diarization based on Bayesian HMM with Eigenvoice Priors. ODYSSEY
**2018**, 147–154. [Google Scholar] [CrossRef] - Villalba, J.; Ortega, A.; Miguel, A.; Lleida, E. Variational Bayesian PLDA for Speaker Diarization in the MGB Challenge. IEEE ASRU
**2015**, 667–674. [Google Scholar] [CrossRef] - Viñals, I.; Ortega, A.; Villalba, J.; Miguel, A.; Lleida, E. Domain Adaptation of PLDA models in Broadcast Diarization by means of Unsupervised Speaker Clustering. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 2829–2833. [Google Scholar] [CrossRef] [Green Version]
- Viñals, I.; Gimeno, P.; Ortega, A.; Miguel, A.; Lleida, E. ViVoLAB Speaker Diarization System for the DIHARD 2019 Challenge. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 988–992. [Google Scholar]
- Viñals, I.; Ortega, A.; Villalba, J.; Miguel, A.; Lleida, E. Unsupervised adaptation of PLDA models for broadcast diarization. EURASIP JASM
**2019**, 2019. [Google Scholar] [CrossRef] - Diez, M.; Burget, L.; Wang, S.; Rohdin, J.; Cernocký, H. Bayesian HMM based x-vector clustering for Speaker Diarization. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 346–350. [Google Scholar]
- Ferràs, M.; Masneri, S.; Schreer, O.; Bourlard, H. Diarizing Large Corpora Using Multi-Modal Speaker Linking. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014; pp. 602–606. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A large-scale speaker identification dataset. arXiv
**2017**, arXiv:1706.08612. [Google Scholar] - Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep Speaker Recognition. arXiv
**2018**, arXiv:1806.05622. [Google Scholar] - Bell, P.; Gales, M.J.F.; Hain, T.; Kilgour, J.; Lanchantin, P.; Liu, X.; McParland, A.; Renals, S.; Saz, O.; Wester, M.; et al. The MGB Challenge: Evaluating Multi-Genre Broadcast Media Recognition. IEEE ASRU
**2015**, 687–693. [Google Scholar] [CrossRef] - Davis, S.B.; Mermelstein, P. Comparison of Parametric Representations for. IEEE Trans. Acoust. Speech Signal Process.
**1980**, 28, 357–366. [Google Scholar] [CrossRef] [Green Version] - Gimeno, P.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Convolutional Recurrent Neural Networks for Speech Activity Detection in Naturalistic Audio from Apollo Missions. Iberspeech
**2021**, 26–30. [Google Scholar] [CrossRef] - Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1–32. [Google Scholar] [CrossRef] [PubMed] - Villalba, J.; Chen, N.; Snyder, D.; Garcia-Romero, D.; McCree, A.; Sell, G.; Borgstrom, J.; Richardson, F.; Shon, S.; Grondin, F.; et al. State-of-the-art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1488–1492. [Google Scholar]
- Waibel, A.; Hanazawa, T.; Hinton, G.E.; Shikano, K.; Lang, K.J. Phoneme recognition using time-warping neural networks. IEEE Trans. Acoust. Speech Signal Process.
**1989**, 37, 328–339. [Google Scholar] [CrossRef] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5999–6009. [Google Scholar]
- Garcia-Romero, D.; Espy-Wilson, C.Y. Analysis of I-vector Length Normalization in Speaker Recognition Systems. In Proceedings of the Interspeech 2011, Florence, Italy, 27–31 August 2011; pp. 249–252. [Google Scholar]
- Villalba, J.; Lleida, E. Unsupervised Adaptation of PLDA By Using Variational Bayes Methods. IEEE ICASSP
**2014**, 744–748. [Google Scholar] [CrossRef] - Zhang, A.; Wang, Q.; Zhu, Z.; Paisley, J.; Wang, C. Fully Supervised Speaker Diarization. arXiv
**2019**, arXiv:1810.04719v4. [Google Scholar] - Blei, D.M.; Frazier, P.I. Distance Dependent Chinese Restaurant Processes. J. Mach. Learn. Res. JMLR
**2011**, 12, 2461–2488. [Google Scholar] - Brummer, N.; de Villiers, E. The Speaker Partitioning Problem. ODYSSEY
**2010**, 194–201. [Google Scholar] - Jelinek, F.; Anderson, J. Instrumentable Tree Encoding of Information Sources. IEEE Trans. Inf. Theory
**1971**, 17, 118–119. [Google Scholar] [CrossRef] - Brummer, N.; Strasheim, A. AGNITIO’s Speaker Recognition System for EVALITA 2009. In Proceedings of the 11th Conference of the Italian Association for Artificial Intelligence, Reggio Emilia, Italy, 9–12 December 2009. [Google Scholar]
- Viñals, I.; Ortega, A.; Miguel, A.; Lleida, E. An Analysis of the Short Utterance Problem for Speaker Characterization. Appl. Sci.
**2019**, 9, 3697. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Concept diagram of speaker attribution. For the given audio, we assign the portion of speech generated by each enrolled speaker. Additionally, we must also detect the audio belonging to non-enrolled speakers (red arrow).

**Figure 3.**Diagram of the direct assignment approach. The embeddings obtained from the different parts of the given audio are independently assigned to the identities. These assignments can be done to enrolled identities or the generic unknown one (red arrow).

**Figure 4.**Flowchart of the direct assignment approach. Red and yellow boxes, respectively, represent the embedding extraction pipelines for the evaluation audio $\mathsf{\Omega}$ (online) and enrollment audios ${\mathsf{\Omega}}_{enroll}$ (offline). The obtained embeddings ($\mathsf{\Phi}$ and ${\mathsf{\Phi}}_{enroll}$) are taken into account in the identity assignment block.

**Figure 5.**Diagram of the indirect assignment approach. Embeddings from the audio are first clustered during diarization (${C}_{\mathbf{1}}$,...,${C}_{\mathbf{3}}$). Then, clusters are assigned to the available identities, either the enrolled speakers or the unknown generic cluster (red arrow).

**Figure 6.**Flowchart of the indirect assignment approach. Red and yellow boxes, respectively, stand for embedding extraction pipelines for the evaluation audio $\mathsf{\Omega}$ (online) and enrollment audios ${\mathsf{\Omega}}_{enroll}$. The green box means a diarization system, which clusters the evaluation embeddings $\mathsf{\Phi}$ to obtain diarization labels ${\Theta}_{DIAR}$. The obtained embeddings ($\mathsf{\Phi}$ and ${\mathsf{\Phi}}_{enroll}$) as well as the estimated labels ${\Theta}_{DIAR}$ are taken into account in the identity assignment block.

**Figure 7.**Diagram of the hybrid approach. Embeddings (${\mathit{\varphi}}_{1}$,...,${\mathit{\varphi}}_{4}$) are sequentially assigned to the available clusters at each time t. Initially, the available clusters are only those for the enrolled speakers. When the embedding is not assigned to an existing cluster (t = 3), it is responsible for an extra cluster for an unknown speaker (red arrow). This new cluster is then available along the posterior assignments.

**Figure 8.**Flowchart for the hybrid approach. Red and yellow boxes, respectively, stand for the embedding extraction pipelines for the evaluation audio $\mathsf{\Omega}$ (online) and the enrollment audios ${\mathsf{\Omega}}_{enroll}$ (offline). Both sets of embeddings are used in the new hybrid clustering and identity assignment block.

**Table 1.**DER (%) results for the Albayzín 2020 corpus, including results for both development and test subsets.

Scenario | Development DER (%) | Test DER (%) |
---|---|---|

Closed scenario | 6.72 | 8.67 |

Open scenario | 17.27 | 15.16 |

**Table 2.**Study of the impact of diarization on speaker attribution with oracle calibration. Experiments carried out with direct (without diarization) and indirect assignment (with diarization) systems. Three degrees of calibration generality are shown. AER (%) results for the Albayzin 2020 development and test subsets. Experiment corresponding to the open condition.

Data Subset | Subset-Level | Show-Level | Audio-Level | |||
---|---|---|---|---|---|---|

Direct | Indirect | Direct | Indirect | Direct | Indirect | |

Dev. subset | 41.91 | 37.45 | 41.27 | 35.88 | 39.89 | 29.09 |

Eval. subset | 48.19 | 34.87 | 41.70 | 28.10 | 40.31 | 26.54 |

**Table 3.**AER (%) results for the Albayzín 2020 corpus. Results of direct and indirect assignment as well as the hybrid systems, including results for both development and test subsets. Experiment corresponding to closed and open conditions.

Subset | Closed Condition | Open Condition | ||||
---|---|---|---|---|---|---|

Direct | Indirect | Hybrid | Direct | Indirect | Hybrid | |

Dev. subset | 13.73 | 15.27 | 15.89 | 41.91 | 37.45 | 37.68 |

Eval. subset | 25.11 | 17.20 | 16.49 | 65.31 | 60.34 | 31.95 |

**Table 4.**AER (%) results for the Albayzín 2020 corpus for the assisted configuration. Results from indirect assignment and hybrid systems, including results for both development and test subsets. Experiment corresponds to an open-set condition.

Data Subset | Unsupervised | Semisupervised | ||
---|---|---|---|---|

Indirect | Hybrid | Indirect | Hybrid | |

Dev. subset | 38.86 | 39.07 | 42.40 | 38.45 |

Eval. subset | 59.00 | 30.56 | 30.66 | 28.74 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Viñals, I.; Ortega, A.; Miguel, A.; Lleida, E.
The Domain Mismatch Problem in the Broadcast Speaker Attribution Task. *Appl. Sci.* **2021**, *11*, 8521.
https://doi.org/10.3390/app11188521

**AMA Style**

Viñals I, Ortega A, Miguel A, Lleida E.
The Domain Mismatch Problem in the Broadcast Speaker Attribution Task. *Applied Sciences*. 2021; 11(18):8521.
https://doi.org/10.3390/app11188521

**Chicago/Turabian Style**

Viñals, Ignacio, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida.
2021. "The Domain Mismatch Problem in the Broadcast Speaker Attribution Task" *Applied Sciences* 11, no. 18: 8521.
https://doi.org/10.3390/app11188521