# The Domain Mismatch Problem in the Broadcast Speaker Attribution Task

## Abstract

## 1. Introduction

- To study the influence of diarization on the performance of speaker attribution systems;
- To analyze the impact of domain mismatch between models and data;
- To propose robust approximations that mitigate the domain mismatch between models and data under analysis.

## 2. The Speaker Attribution Problem

## 3. Experimental Protocol

#### 3.1. Albayzín Corpus and Allowed Data

#### 3.2. Performance Metrics for Diarization and Speaker Attribution

## 4. Methodology

#### 4.1. Front-End, SCPD and Embedding Extractor Blocks

#### 4.2. PLDA Tree-Based Clustering Block

#### 4.3. The Identity Assignment Block

#### 4.4. The Direct Assignment Approach

#### 4.5. Clustering and Assignment: The Indirect Assignment Approximation

#### 4.6. Hybrid Solution

#### 4.7. Semisupervised Alternative

#### 4.8. Open-Set vs. Closed-Set Conditions

## 5. Results

- An illustration of the influence of diarization on the speaker attribution problem;
- A depiction of the impact of broadcast domain variability into the speaker attribution task;
- A proposal of alternative approximations to deal with this variability, with special emphasis on unseen domains.

#### 5.1. The Influence of Diarization

#### 5.2. Broadcast Domain Mismatch in Speaker Attribution

#### 5.3. Semisupervised Solutions

## 6. Conclusions

**Figure 1.**Concept diagram of speaker attribution. For the given audio, we assign the portion of speech generated by each enrolled speaker. Additionally, we must also detect the audio belonging to non-enrolled speakers (red arrow).

**Figure 3.**Diagram of the direct assignment approach. The embeddings obtained from the different parts of the given audio are independently assigned to the identities. These assignments can be done to enrolled identities or the generic unknown one (red arrow).

**Figure 4.**Flowchart of the direct assignment approach. Red and yellow boxes, respectively, represent the embedding extraction pipelines for the evaluation audio $\mathsf{\Omega}$ (online) and enrollment audios ${\mathsf{\Omega}}_{enroll}$ (offline). The obtained embeddings ($\mathsf{\Phi}$ and ${\mathsf{\Phi}}_{enroll}$) are taken into account in the identity assignment block.

**Figure 5.**Diagram of the indirect assignment approach. Embeddings from the audio are first clustered during diarization (${C}_{\mathbf{1}}$,...,${C}_{\mathbf{3}}$). Then, clusters are assigned to the available identities, either the enrolled speakers or the unknown generic cluster (red arrow).

**Figure 6.**Flowchart of the indirect assignment approach. Red and yellow boxes, respectively, stand for embedding extraction pipelines for the evaluation audio $\mathsf{\Omega}$ (online) and enrollment audios ${\mathsf{\Omega}}_{enroll}$. The green box means a diarization system, which clusters the evaluation embeddings $\mathsf{\Phi}$ to obtain diarization labels ${\Theta}_{DIAR}$. The obtained embeddings ($\mathsf{\Phi}$ and ${\mathsf{\Phi}}_{enroll}$) as well as the estimated labels ${\Theta}_{DIAR}$ are taken into account in the identity assignment block.

**Figure 7.**Diagram of the hybrid approach. Embeddings (${\mathit{\varphi}}_{1}$,...,${\mathit{\varphi}}_{4}$) are sequentially assigned to the available clusters at each time t. Initially, the available clusters are only those for the enrolled speakers. When the embedding is not assigned to an existing cluster (t = 3), it is responsible for an extra cluster for an unknown speaker (red arrow). This new cluster is then available along the posterior assignments.

**Figure 8.**Flowchart for the hybrid approach. Red and yellow boxes, respectively, stand for the embedding extraction pipelines for the evaluation audio $\mathsf{\Omega}$ (online) and the enrollment audios ${\mathsf{\Omega}}_{enroll}$ (offline). Both sets of embeddings are used in the new hybrid clustering and identity assignment block.

**Table 1.**DER (%) results for the Albayzín 2020 corpus, including results for both development and test subsets.

Scenario | Development DER (%) | Test DER (%) |
---|---|---|

Closed scenario | 6.72 | 8.67 |

Open scenario | 17.27 | 15.16 |

**Table 2.**Study of the impact of diarization on speaker attribution with oracle calibration. Experiments carried out with direct (without diarization) and indirect assignment (with diarization) systems. Three degrees of calibration generality are shown. AER (%) results for the Albayzin 2020 development and test subsets. Experiment corresponding to the open condition.

Data Subset | Subset-Level | Show-Level | Audio-Level | |||
---|---|---|---|---|---|---|

Direct | Indirect | Direct | Indirect | Direct | Indirect | |

Dev. subset | 41.91 | 37.45 | 41.27 | 35.88 | 39.89 | 29.09 |

Eval. subset | 48.19 | 34.87 | 41.70 | 28.10 | 40.31 | 26.54 |

**Table 3.**AER (%) results for the Albayzín 2020 corpus. Results of direct and indirect assignment as well as the hybrid systems, including results for both development and test subsets. Experiment corresponding to closed and open conditions.

Subset | Closed Condition | Open Condition | ||||
---|---|---|---|---|---|---|

Direct | Indirect | Hybrid | Direct | Indirect | Hybrid | |

Dev. subset | 13.73 | 15.27 | 15.89 | 41.91 | 37.45 | 37.68 |

Eval. subset | 25.11 | 17.20 | 16.49 | 65.31 | 60.34 | 31.95 |

**Table 4.**AER (%) results for the Albayzín 2020 corpus for the assisted configuration. Results from indirect assignment and hybrid systems, including results for both development and test subsets. Experiment corresponds to an open-set condition.

Data Subset | Unsupervised | Semisupervised | ||
---|---|---|---|---|

Indirect | Hybrid | Indirect | Hybrid | |

Dev. subset | 38.86 | 39.07 | 42.40 | 38.45 |

Eval. subset | 59.00 | 30.56 | 30.66 | 28.74 |

