End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models

Toleu, Alymzhan; Tolegen, Gulmira; Pak, Alexandr; Assel, Jaxylykova; Zhumazhanov, Bagashar

doi:10.3390/app15084324

Open AccessArticle

End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models

by

Alymzhan Toleu

^1,2,

Gulmira Tolegen

^1,2,*

,

Alexandr Pak

^1,3,

Jaxylykova Assel

^1,3

and

Bagashar Zhumazhanov

¹

Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan

²

AI Research Laboratory, Satbayev University, Almaty 050040, Kazakhstan

³

School of Information Technology and Engineering, Kazakh-British Technical University, Almaty 050000, Kazakhstan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4324; https://doi.org/10.3390/app15084324

Submission received: 5 March 2025 / Revised: 1 April 2025 / Accepted: 11 April 2025 / Published: 14 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In this work, we propose a multi-modal speaker change detection (SCD) approach with focal loss, which integrates both audio and text features to enhance detection performance. The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mechanism to optimize useful features related to speaker change. The extracted features are fused and processed through a fully connected classification network, with layer normalization and dropout for stability and generalization. To address class imbalance, we apply focal loss, which reduces errors for the difficult samples, leading to better balanced performance. Extensive experiments on a multi-talker meeting dataset demonstrate that the proposed multi-modal approach consistently outperforms single-modal models, proving the complementary nature of audio and text for SCD. Fine-tuning pre-trained models (Wav2Vec2 and Bert) for audio and text significantly boosts accuracy, achieving a 21% improvement over frozen models. The self-attention mechanism further improves performance by 2%, highlighting its ability to capture speaker transition cues effectively. Additionally, focal loss enhances the model’s performance, making it more robust to imbalanced data.

Keywords:

speaker change detection; pre-trained model; multi-modal

1. Introduction

Speaker change detection (SCD) is an important task in audio processing, particularly for applications such as speech segmentation, speaker diarization, and automatic transcription. Accurate detection of speaker transitions is critical in multi-person conversations, especially in meeting scenarios, where challenges such as overlapping speech and background noise can make correct detection difficult.

Traditional SCD methods rely on audio features, including pitch, Mel Frequency Cepstral Coefficient (MFCC), and cepstral coefficients, with various approaches such as metric-based and neural network-based methods. However, these approaches often underperform in complex acoustic environments such as those with low signal-to-noise ratios or simultaneous speech. Other multi-modal approaches to SCD have been explored, but they face practical limitations. Their approaches rely on ground-truth transcripts, which are often impractical in real-world scenarios due to transcription errors and limited availability. Another issue is that binary cross-entropy loss (BCE loss) is used as the main objective function, which causes models to fail to learn useful features from imbalanced data. Usually, an SCD dataset contains many negative samples (no change points) and fewer positive samples (change points); it has imbalanced features.

To address these challenges, we propose an end-to-end multi-modal SCD model with a self-attention layer based on large pre-trained models and enhance it with focal loss to mitigate the issue of imbalanced data. The model employs a pre-trained large audio model to extract acoustic features from raw audio and a large language model (LLM) to derive semantic and contextual information from transcripts generated by an automatic speech recognition (ASR) system. By leveraging both modalities in a practical, real-world setting, this approach enhances detection accuracy in challenging acoustic conditions while avoiding the reliance on manually annotated transcripts. Furthermore, this study investigates the impact of different configurations, including audio-only, text-only, and multi-modal setups, to evaluate their effectiveness in SCD. A comparison between BCE loss and different focal loss configurations is carried out. Additionally, we explore different fine-tuning strategies, examining the impact of updating the pre-trained models to optimize performance while balancing computational efficiency. The evaluation is conducted using the multi-talker meeting corpus [1], a widely recognized dataset in speech and audio processing research. Experimental results showed that updating the pre-trained model’s parameters during the training significantly outperformed freezing them for all modalities. Compared to the text-only model, the audio-only model outperforms it in various settings, and the multi-modal configuration surpasses both. The self-attention layer introduced in the approach effectively improved the F1-score compared to the model without using it for various modalities. The proposed approach outperforms the baselines and also outperforms previous approaches enhanced with multi-modals.

The rest of the work is organized as follows: (i) Section 2 reviews related studies on SCD, including multi-modal approaches. (ii) Section 3 presents the approach, including task formalization and the proposed model. (iii) Section 4 provides details on the experiments, including model setup, data statistics, baselines, and a discussion of the results. (iv) Section 5 describes a discussion related to the obtained results. (v) Section 6 discusses the limitations of the approach and potential future directions. (vi) Section 7 concludes the work with a summary of the key findings.

2. Related Work

Existing approaches to SCD problems can be categorized into two groups: (i) unsupervised approaches; (ii) supervised approaches.

Most unsupervised approaches are considered metric-based methods that calculate the discrepancy between two consecutive frames or segments. A change point is detected if the distance between the two segments exceeds a predefined threshold; otherwise, it is considered that no change point exists. In this direction, one of the simplest metric-based approaches [2] is to use a distance function, such as Kullback–Leibler (KL) divergence [3], to measure the similarity between two consecutive segments modeled with Gaussian mixture models (GMMs). It is very sensitive to a predefined threshold and difficult to generalize to unseen audio data. Early methods to SCD are the combination of GMMs with Bayesian information criterion (BIC) [4,5,6]. Specifically, each segment will be modeled with separate GMMs (referred to as two speaker models), and two consecutive segments will be modeled with a GMM (referred to as a single speaker model). BIC is used to calculate a score that estimates how well these GMMs will predict the given segments. Generalized Likelihood Ratio (GLR) [7] can also be applied with GMMs. It measures the likelihood ratio between two distributions of these segments. Studies have shown that using neural network-based speaker embeddings [8,9] outperforms traditional audio features such as mel-frequency cepstral coefficients (MFCCs), pitch, and filter banks in SCD. In the work [10], the authors applied a neural network to calculate speaker embeddings at the frame level, such as d-vector, and used it with metric-based segmentation to detect speaker changes. It achieved a fast and effective SCD model using deep speaker vectors.

In the supervised direction, many machine learning methods have been applied to SCD, including conditional random field (CRF) [11], which was applied to speaker detection and tracking. In their work, the authors proposed a variational CRF training algorithm for SCD, and it outperforms GMM-based methods; however, it requires knowing the speaker number in the audio in advance. In addition to traditional models, various neural networks [12,13] have been applied to SCD in recent years.

In the work [13], the author proposed an end-to-end SCD approach which also covers overlap speaker change issues. They consider this as a multi-label classification task, which is performed with permutation-invariant training. SincNet [14] and bidirectional Long Short-Term Memory (LSTM) were applied on short audio chunks with higher temporal resolution. The authors reported that their results are better on both voice activity detection and audio segmentation.

Beyond SCD-specific methods, several studies have contributed important insights into improving ASR systems more generally. In a practical application, Grossinho et al. [15] developed a multi-modal ASR system for speech therapy that integrates audio and visual cues, demonstrating improved recognition in noisy environments. Dziadzio et al. [16] showed that language models trained on speech transcripts significantly outperform those trained on written texts, even when using smaller corpora. Wanumen and Florez [17] highlighted how architectural choices in phoneme recognition systems affect performance, especially when combining modular components. To address noise robustness, Deng [18] introduced a Bayesian framework that unifies front-end and back-end compensation techniques.

In recent developments, studies on SCD based on pre-training technologies have become more frequent. In the work [12], the authors proposed a model for speaker embeddings, which was pre-trained on three conditions: (i) gender classification; (ii) contrastive loss; (iii) triplet loss. Then, the model takes two audio segments (2 s) and feeds them into a classification layer to identify whether they belong to a speaker or more. During the training process, the pre-trained model is not trained; only the classification part is fine-tuned for low-latency SCD in broadcast news. The text-based and multi-modal-based SCD approach is another promising research direction [19,20,21,22]. For the text-based approach, Anidjar et al. [19], proposed a text-only speaker change detection method using word embeddings. Instead of relying on audio, the model uses fixed-length text windows and predicts speaker changes based on semantic similarity. The authors performed tests on Hebrew conversations, it achieved strong performance and generalized well to unseen speakers. The approach is useful in settings where audio is unavailable or not suitable.

Zhao et al. [20] present a text-based SCD method that does not rely on audio signals. The authors propose a hierarchical recurrent neural network (RNN) architecture with static sentence-level attention, allowing the model to effectively incorporate surrounding context when determining speaker changes in text transcripts. The model processes dialogue in two levels: an LSTM encodes individual sentences, and another LSTM handles the sequence of sentence representations. A static attention mechanism helps the model focus on informative parts of the context. Experiments on a large dataset of TV talk show transcripts demonstrate that the method outperforms traditional and non-attention neural models, especially in settings where only textual data are available.

For the multi-modal approach, in the work [21], the authors use the Transformer-Transducer (T-T) model to train a model to turn the audio into speaker turn augmented transcriptions, in which they added a special symbol in the transcriptions to indicate speaker turn for the next training process. Then, they used these transcriptions with a speaker turn symbol as training data and retrain with the previously obtained ASR model, then train a new model with a token-level loss function.

Anidjar et al. [22] propose a hybrid approach to speaker change detection by combining both speech and text-based features. The authors use a Transformer-based model to detect speaker changes in ASR-generated transcripts, while also incorporating audio cues like speaker embeddings. By evaluating multiple fusion strategies (early, late, and hybrid fusion), the work demonstrates that joint modeling of audio and text outperforms using either modality alone. The method achieves strong results on broadcast news datasets, highlighting the complementary nature of acoustic and textual features in SCD.

Another multi-modal approach was introduced by Jung et al. [23], a state-of-the-art multi-modal SCD model that combines both audio and text features using an encoder-decoder architecture. Instead of aligning speaker embeddings with individual words, the authors extract X-vector embeddings from 1.5 s audio windows, allowing for more robust speaker representation. The model architecture features modality-specific encoders (a BERT-based encoder for text and a CNN-based encoder for audio), followed by a Transformer decoder that performs multi-modal fusion and prediction. The system is trained using ASR-generated transcripts, not gold-standard text, making it suitable for real-world deployment. It outputs token-level speaker change predictions and is trained with a binary cross-entropy loss. An evaluation of the AMI dataset shows that the model outperforms previous single-modal and multi-modal baselines.

The last two previous works are compared with the proposed approach in this study. The main difference of our model is the use of large pre-trained models for both text and audio modalities, along with the application of a self-attention layer to enhance audio feature extraction. To address the data imbalance issue, which is inherent to the speaker change detection task, we employ focal loss to further improve the model’s performance.

3. Approach

3.1. Problem Formalization

Speaker change detection aims to find the time points in an audio stream where the speaker changes. Formally, we are given an input sequence of audio features

X = {x_{1}, x_{2}, \dots, x_{T}}

, where

x_{t} \in R^{d}

is the audio feature vector at time frame or segment t, and T is the total number of frames or segments. The task is to predict a binary label sequence

Y = {y_{1}, y_{2}, \dots, y_{T}}

, where

y_{t} = 1

if a speaker change occurs at frame or segment t, and

y_{t} = 0

otherwise.

The goal is to find a function

f_{θ} : X \to Y

, parameterized by

θ

, that minimizes the difference between the predicted labels

\hat{Y}

and the true labels

Y

.

3.2. Multi-Modal SCD Model

Speaker change detection in audio streams can be addressed using a multi-modal approach that integrates audio and text data. Figure 1 illustrates the proposed multi-modal SCD model, which uses both audio and text to find when speakers change during a meeting. First, the multi-talker meeting audio is split into 1 s pieces. These pieces go through a pre-trained model that learns important sounds (using CNN and Transformer), and it creates a speech feature embedding, a short summary of each audio piece. The audio is converted into text using an automatic speech recognition system, rather than using ground-truth transcripts, which makes the approach more practical. The text is split into tokens and processed with a pre-trained language model to create a text embedding. Both the speech and text embeddings are then combined in a fusion layer. After that, a classification layer uses the combined information to decide whether a speaker change happens at each time step.

This model employs two pre-trained large models: (i) large audio models to extract acoustic features from raw audio; (ii) large language model to derive semantic information from the transcript. The combination of these modalities enhances the accuracy of detecting speaker transitions.

3.2.1. Audio Feature Extraction

In the model, audio features are extracted using the pre-trained Wav2Vec2 encoder, which transforms raw audio signals into contextualized hidden representations. Given a 1 s raw audio waveform segment represented as a vector

X \in R^{L}

, where L is the number of audio samples (e.g.,

L = 16, 000

for 16 kHz sampling), we pass this input through the Wav2Vec2 encoder to obtain a sequence of feature vectors:

z^{audio} = Wav 2 Vec 2 (X) \in R^{T \times d}

(1)

where T is the number of time frames after subsampling by the encoder, and

d = 768

is the dimensionality of each feature vector.

To enhance this representation, we further apply a self-attention mechanism that enables the model to re-weight the importance of different time steps based on their relevance to speaker change detection. The attention-enhanced audio representation is computed using the output of Equation (1):

h^{audio} = SelfAttention (z^{audio}) \in R^{T \times d},

(2)

To convert the temporal sequence

h^{audio}

into a fixed-size vector suitable for classification, we apply mean pooling across the time dimension to the output of the attention layer:

{\bar{h}}^{audio} = \frac{1}{T} \sum_{t = 1}^{T} h_{t}^{audio} \in R^{d} .

(3)

This final vector

{\bar{h}}^{audio}

serves as the aggregated audio embedding, capturing both global and locally attended information from the original speech segment. It is used either directly (in audio-only mode) or concatenated with text embeddings (in multi-modal mode) for downstream classification.

3.2.2. Text Feature Extraction

The audio is segmented into 1 s intervals, and an automatic speech recognition (ASR) system is applied to generate transcripts for each segment, rather than relying on ground truth transcriptions, which are impractical for real-world applications. The text processing leverages a pre-trained large language model, Bert [24], and its contextual embeddings. The transcript is tokenized and input into Bert, producing embeddings for each token. The embedding of the [CLS] token is selected to represent the sentence’s semantic content:

h^{text} = z_{CLS},

(4)

where

z_{CLS}

encapsulates the overall meaning of the text.

3.2.3. Feature Fusion and Classification

To integrate audio and text modalities, a simple concatenation is performed between the two. The resulting audio and text embeddings are then concatenated:

h^{fusion} = [{\bar{h}}^{audio}; h^{text}],

(5)

where

h^{fusion}

is the fusion embedding obtained by concatenating the embeddings from the text and audio extraction modules.

The final audio representation serves as input to a fully connected neural network for classification. The classifier consists of three linear layers with decreasing dimensions. Each linear transformation is followed by layer normalization to stabilize training and improve generalization. Non-linearity is introduced using the ReLU activation function, and dropout regularization is applied after each hidden layer to mitigate overfitting. The final output is passed through a sigmoid activation function to produce a probability score for the classification task.

To optimize the model, we employ focal loss, which is designed to address class imbalance by down-weighting well-classified examples and focusing more on hard-to-classify instances. Given the predicted probability and the ground-truth label, the focal loss is computed as follows:

L_{focal} = α {(1 - p_{t})}^{γ} L_{BCE}

(6)

where

L_{BCE}

is the binary cross-entropy loss,

p_{t}

represents the predicted probability for the true class,

α

is a weighting factor that balances the contribution of positive and negative samples, and

γ

controls the down-weighting of easy examples.

By combining structured feature extraction with a classification strategy, the model effectively learns meaningful audio representations while addressing class imbalance in the dataset.

4. Experiments

4.1. Model Setup

The model utilizes pre-trained Wav2Vec2 and Bert, specifically wav2vec2-base and bert-base-cased variants. Training employs the AdamW optimizer with a learning rate scheduler, ensuring stable updates and effective convergence. The base model components (Wav2Vec2 and Bert) use a learning rate of

1 \times 10^{- 5}

, while the classifier has a higher learning rate of

1 \times 10^{- 3}

. A weight decay of

0.01

is applied for regularization. To enhance stability, a warmup ratio of

0.1

is used, meaning the learning rate gradually increases during the first 10% of total training steps before following a linear decay. Training is conducted for 10 epochs, with the total number of training steps computed as the product of the number of epochs and the length of the training dataloader. Additionally, a random seed is set to ensure reproducibility across experiments.

The classifier is designed to process the features with a hidden size of 768. The head of self-attention is set to 8. The classifier consists of three fully connected layers with 512 and 256 hidden units, each followed by layer normalization, ReLU activation, and dropout (0.4), leading to a final output layer for classification.

Each of the three configurations, multi-modal, text-only, and audio-only, supports two training modes: (i) a freeze mode (F); (ii) a non-freeze mode (NF). In freeze mode, all Wav2Vec2 and Bert parameters are fixed, and only the classification layers are trained. In non-freeze mode, the configurations differ as follows: both base models with their classification layers are trained. For the focal loss of the model,

α

is set to 0.31, and

γ

is set to 2.0. For convenience in naming, we use eMD to refer to the proposed approach. Using the different settings described above, we define two versions of eMD: (i) A multi-modal model without self-attention and cross-attention layers, which consists only of the three fully connected layers mentioned earlier for classification. This model is denoted as eMD. (ii) A model with self-attention layers, which enhances feature interaction. This version is denoted as eMD $_{a t t .}$

4.2. Baselines

To check how well the proposed models work, we compare them with these baselines:

(i) Neural Network (NN) Baseline: This model takes a flattened vector of 13-dimensional MFCC features as input. It has fully connected layers with non-linear activation functions. Then, a sigmoid output layer predicts the chance of a speaker change. This baseline captures feature patterns in each segment but does not model time relations between segments.

(ii) Bidirectional Long Short-Term Memory (BiLSTM): This model takes sequences of 13-dimensional MFCC feature vectors. Its bidirectional structure helps capture both past and future time relations. BiLSTM outputs are pooled over time to form a fixed-size representation. Then, a fully connected layer with a sigmoid activation is used for classification.

(iii) Fine-Tuned Approach with Multilayer Perceptron (MLP): This model is a strong baseline similar to this work. It uses only audio and has a single linear classification layer. It is referred to as Wav2Vec2-MLP.

4.3. Dataset Statistics

Experiments were conducted using the AMI meeting corpus [1], a well-established audio dataset commonly utilized in speech and audio processing research. The AMI corpus provides a diverse collection of meeting recordings with noisy data, interrupted data, etc., allowing for the evaluation of the model under a variety of conditions and scenarios. The dataset used for training and evaluating the multi-modal speaker change detection model is detailed in two tables.

Table 1 presents the overall statistics of the training dataset, which includes 137 meetings with an average of 790.99 change points per meeting. The dataset exhibits an average of 306.61 overlap points, indicating instances of simultaneous speech, and an average of 3.99 speakers per meeting. The average length of each meeting is 34.11 min, with a total duration of 4673.69 min. The class distribution shows that 30.85% of the data are positive instances (indicating speaker changes), while 69.15% are negative instances (no speaker changes).

Table 2 provides statistics for the test set, covering eight meetings identified by specific IDs (e.g., EN2002a, EN2002b, etc.). It reports the total number of change points, overlap points, and speakers per meeting, as well as the audio length in minutes. For example, the EN2002a meeting has 1366 change points, 678 overlaps, 4 speakers, and a duration of 43.90 min. The test set averages 821.31 change points, 328.19 overlaps, 3.94 speakers, and 33.38 min per meeting. Additionally, the class distribution for each meeting is provided, with percentages for positive (P) and negative (N) classes. On average, 31.40% of the test set instances are positive, and 68.60% are negative, reflecting a similar imbalance to the training set. These statistics highlight the complexity and variability of the dataset, which are particularly due to overlapping speech and the distribution of speaker changes.

4.4. Pre-Processing

The pre-processing pipeline is designed to prepare audio data from the AMI meeting corpus for speaker change detection tasks. We use the standard split of the AMI dataset into training and test sets, as provided by the Hugging Face datasets library, and load corresponding metadata files. For each meeting, the audio is segmented into fixed-length chunks, typically 1 s using librosa. Each segment is then labeled as a speaker change or not based on its temporal overlap with annotated change points. The segments are then standardized to exactly 16,000 samples by zero-padding or truncation, ensuring consistent input size for downstream modeling.

Each audio segment is fed into the whisper-base model, used as the ASR system to obtain transcripts from audio. The hyperparameters for the Whisper model are set as follows: temperature is set to 0, best_of is set to 1, and beam_size is set to 5. To generate text embeddings, we use the bert-base-cased model, a case-sensitive pre-trained language model (PLM) trained on a large collection of English texts. No additional text normalization is applied to the transcripts.

4.5. Evaluation Metrics

Several metrics are employed to assess the performance of the baseline and proposed methods. The false alarm rate (FAR) measures the proportion of incorrect speaker change detections, which is calculated as the ratio of false positives to all non-change points:

FAR = \frac{False Positives}{False Positives + True Negatives}

(7)

The missed detection rate (MDR) evaluates the frequency of undetected speaker changes, which is determined as the ratio of false negatives to all true change points:

MDR = \frac{False Negatives}{True Positives + False Negatives}

(8)

The Hit rate measures how well the model correctly identifies actual speaker change points.

Hit = \frac{True Positive}{Total Actual Positives}

(9)

Additionally, precision is used to indicate the accuracy of detected speaker changes, recall assesses the ability to identify all true speaker changes, and the F1-score provides a balanced measure of precision and recall. These metrics collectively offer a comprehensive evaluation of the model’s effectiveness in speaker change detection. It should be noted that the average results (Precision, Recall, F1-score, FAR, and MDR) reported below are calculated using the Macro method.

4.6. Results

4.6.1. Results of eMD Model

Table 3 and Table 4 detail the performance of the eMD model for SCD using freeze and non-freeze models across three modalities: Text, Audio, and Multi-modal.

The Multi-modal approach consistently outperformed single modality configurations in both the freeze and non-freeze settings. For instance, in the freeze models, the Multi-modal modality achieved a Hit Rate of 79.88% in meeting EN2002a (45.51% P, 54.49% N), compared to 72.28% for Text and 78.34% for Audio. In the non-freeze models, the Multi-modal Hit Rate reached 82.98% in EN2002d (40.20% P, 59.80% N), surpassing Text (70.43%) and Audio (82.26%) modalities. The non-freeze models exhibited marked improvements over freeze models, particularly in Audio and Multi-modal settings. The FAR in Audio modality decreased from a range of 19.08–40.48% in freeze models to 8.32–25.99% in the non-freeze models. Similarly, the Multi-modal MDR reduced from 20.12–30.53% (freeze) to 17.02–33.33% (non-freeze), indicating enhanced detection accuracy.

The class imbalance significantly impacted the Text modality in the freeze models, resulting in elevated MDR and reduced Hit Rates. For example, in meeting ES2004b (25.46% P, 74.54% N), the Text modality recorded an MDR of 43.72% and a Hit Rate of 56.28%, reflecting a bias toward the majority negative class. This imbalance compromised the detection of speaker changes. The non-freeze models, particularly Multi-modal, effectively addressed these challenges. In ES2004b, the Multi-modal non-freeze model improved the Hit Rate to 72.53% and reduced the MDR to 27.47%, demonstrating the advantage of adaptive learning and multi-modal integration in handling imbalanced distributions.

Figure 2 compares the performance of the freeze (F) and non-freeze (NF) models across three setups of the eMD model: text-only, audio-only, and multi-modal. In Figure 2a, the stacked bar chart shows the average FAR and MDR. The hatched bars represent FAR, while the solid bars represent MDR. The light blue bars correspond to freeze mode, and the red bars correspond to non-freeze mode. Generally, the multi-modal configuration shows a lower FAR and MDR compared to the text-only and audio-only setups. All non-freeze models outperform the freeze models in all three settings. In the audio-only mode, it can be seen that the FAR decreases significantly from 31.49% to 25.98%, while the MDR drops from 25.24% to 16.93%.

Figure 2b shows the averaged results of F1-scores. The text-only configuration achieves an F1-score of 54.22% with freeze and 55.26% with non-freeze. The audio-only configuration records an F1-score of 60.51% under freeze and 62.58% under non-freeze. The multi-modal configuration performs best, with an F1-score of 70.54% for freeze and 70.91% for non-freeze. The results demonstrate that the multi-modal approach outperforms both text-only and audio-only configurations, with a slight improvement in the non-freeze mode.

4.6.2. Results of eMD $_{a t t}$ Model

Table 5 and Table 6 show the results of the eMD $_{a t t}$ model. It applies a self-attention layer only to the multi-modal and audio-only configuration, as previous experiments showed that adding self-attention to the text-only models did not improve performance. In the freeze (F) mode (Table 5), where feature extraction layers remain unchanged, the audio-only setup outperforms the text-only setup. This suggests that audio inherently provides more informative cues for speaker change detection than text. The multi-modal setup performs best, leveraging both modalities to achieve the highest Hit Rates across all meetings.

In the non-freeze (NF) mode Table 6, where feature extraction layers are fine-tuned, overall performance improves. The text-only model sees a reduction in FAR from 87–92% to 77–87%, and MDR drops to 9.07% in the best case. The audio-only model continues to outperform the text-only model, with FAR decreasing to 16.80% (IS1009c) and 17.23% (IS1009b), and MDR remaining below 14%. The multi-modal approach remains the most effective, maintaining FAR around 20% and MDR between 9 and 16%. The multi-modal setup achieves the most significant improvements across the most imbalanced meetings (TS3003c, IS1009c, ES2004b). Fine-tuning substantially reduces FAR, dropping from 57.93% to 21.06% in TS3003c, from 76.07% to 14.04% in IS1009c, and from 78.26% to 21.97% in ES2004b. These results suggest that utilizing both text and audio in a fine-tuned model can help reduce the impact of class imbalance, making multi-modal processing a promising approach for speaker change detection in imbalanced scenarios.

The precision, recall, and F1-score for the eMD and eMD $_{a t t}$ models are reported in Appendix A.

Figure 3 compares the performance of freeze (F) and non-freeze (NF) models across three setups of the eMD $_{a t t}$ model: text-only, audio-only, and multi-modal. In Figure 3a, the stacked bar chart shows the average FAR and MDR. The hatched bars represent FAR, while the solid bars represent MDR. The light blue bars correspond to freeze mode, and the red bars correspond to non-freeze mode. It is clearly shown that eMD $_{a t t}$ uses the attention layer; the fine-tuned model (NF) reduces both FAR and MDR significantly compared to the freeze model. For example, the multi-modal model with freeze mode achieves a high FAR of 69.81% and a low MDR of 10.54; however, the fine-tuned (the red bar) trade-off between these two rates reduces the false alarm from 69.81% to 21.78%. Figure 3b reports the F1-score for these models, and it shows that the highest F1-score 73.39% was obtained by the multi-modal model with NF.

4.6.3. Optimization Parameters for Focal Loss

In order to test how focal loss [25] influences the model’s performance in the SCD imbalanced situation, we predefined various values of parameters of focal loss, such as

α = [0.1, 0.2, 0.3, 0.4]

and

γ = [0, 1, 2, 3]

. The former

α

is a weighting factor that balances class importance in imbalanced datasets. A lower

α

(<0.5) gives more weight to the majority class (non-change points), while a higher

α

(>0.5) gives more weight to the minority class (speaker change points).

γ

is a focusing parameter that reduces the weight of easy samples, making the model focus more on hard-to-classify examples. A higher

γ

increases this effect, but too high a value can cause training instability.

We selected a highly imbalanced meeting from the training set to analyze different parameter combinations and reduce training time. Figure 4 presents the curves for different combinations of values of

α

and

γ

. Traditional binary cross-entropy (BCE) loss is also included for comparison and is denoted as BCEloss. Figure 4a,b illustrate how the FAR and MDR change with different values of

α

and

γ

. It can be observed that

γ = 0

and

γ = 1

, when using BCEloss, achieve comparably low FAR scores. Among them, the loss function, when

γ = 0

, becomes weighted BCE loss, which performs slightly better than the others. From Figure 4b, it can be seen that

γ = 1

with BCEloss results in a higher MDR compared to other settings. Figure 4c displays F1-scores across different settings, clearly showing that focal Loss outperforms BCEloss. When

γ = 1

(orange curve) and

α = 0.3

, the model achieves the highest F1-score. These values are chosen for focal Loss, and the final model is trained using this configuration, denoted as eMD

_{a t t}^{*}

.

4.6.4. Comparison with Baselines

The performance of the eMD and eMD $_{a t t}$ models is evaluated in Table 7, which indicates improvements of up to 2.73% in the F1-score over the baseline Wav2vec-MLP. The comparison includes baseline models: Neural Network (NN), Long Short-Term Memory (LSTM), and Wav2vec-MLP. The NN struggles with a high missed detection rate, while LSTM shows improvement but remains weak in detecting speaker changes. Wav2vec-MLP provides a stronger baseline with higher precision, though it still misses some detections.

The eMD model, tested in audio-only and multi-modal modes, improves recall compared to Wav2vec-MLP, but its false alarm rate increases. The multi-modal setup of eMD offers a slight recall boost over the audio-only version, with similar false alarm rates, suggesting limited additional benefit. The eMD

_{a t t}

model, incorporating attention, further enhances recall in both audio-only and multi-modal modes, reducing missed detections despite a higher false alarm rate. The multi-modal eMD

_{a t t}

configuration performs best, achieving the lowest missed detection rate (14.15%) and highest recall (85.85%) among all models, with an F1-score of 73.39%, surpassing Wav2vec-MLP’s overall score.

Compared to the baselines, eMD

_{a t t}

multi-modal approach stands out, leveraging attention and multi-modal data to outperform Wav2vec-MLP, with notable improvements in recall (it is similar to the hit rate, an important metric for SCD, because it measures how well the model captures actual speaker change points) and reduced missed detections.

4.6.5. Comparison with Previous Work

Table 8 compares the proposed approach with previous multi-modal methods and a pitch-based approach. However, a direct comparison is not possible due to differences in experimental settings. For instance, Zhao et al. [21] use a custom definition of precision and recall, making the task easier. Other studies apply forgiveness collars when identifying change points. Despite these differences, Table 8 shows that the proposed approach performs competitively and achieves better results than previous multi-modal methods.

4.7. Visualization of Attention Layer

Figure 5 presents the average intra-segment self-attention maps for two types of audio segments: those containing speaker changes (Figure 5a) and those without speaker changes (Figure 5b). Each attention matrix is of size 50 × 50, reflecting attention weights between 50 temporal frames (approximately one frame every 20 milliseconds in a 1 s segment).

Notably, the attention map for speaker change segments exhibits more diffuse and scattered patterns. The attention weights are distributed more broadly across off-diagonal regions, indicating that the model attends to a wider temporal context when a speaker transition is occurring. This suggests increased uncertainty or contextual reevaluation during such transitions.

In contrast, normal segments show strongly diagonal attention patterns, meaning each frame primarily attends to itself and its immediate neighbors. This reflects temporal stability and high model confidence when no speaker change is detected.

These patterns align with our entropy-based analysis, which showed statistically higher attention entropy in speaker change segments. As shown in Figure 6, the attention entropy is significantly higher in speaker change segments compared to normal ones. The result shows that speaker change segments tend to have significantly higher attention entropy, suggesting that the model distributes attention more broadly when detecting transitions. This further supports the hypothesis that the model attends more diffusely under speaker transition conditions.

5. Discussion

The results across various configurations of the proposed eMD and eMD $_{a t t}$ models demonstrate consistent improvements in SCD. These gains can be attributed to three key design choices: multi-modal integration, fine-tuning of PLM, and the use of self-attention layers in combination with focal loss.

Multi-modal integration proved to be effective across all settings. Models that jointly processed audio and text consistently outperformed those relying on a single modality. This advantage was especially apparent in imbalanced data scenarios, where speaker change points were rare. While audio captures acoustic variations, text offers semantic information, forming a more complete signal for identifying transitions.

Fine-tuning the PLM further enhanced performance, particularly for the audio and multi-modal configurations. Models with the unfrozen pre-trained model adapted more effectively to the data, reducing both the FAR and MDR. These improvements were reflected in the overall F1-scores, suggesting that trainable backbones are more effective at capturing the complex interaction patterns and overlapping speech dynamics characteristic of multi-talker meetings.

Introducing self-attention into the architecture, as in eMD $_{a t t}$ , further boosted performance. The attention mechanism enabled the model to weigh temporal features more flexibly, particularly around potential speaker transitions. Attention visualizations revealed clear distinctions in how the model processed speaker change versus normal segments. In speaker change segments, attention was distributed more broadly across time, suggesting increased uncertainty and the need for wider contextual integration. Normal segments, on the other hand, showed tightly focused attention patterns, indicating greater temporal stability. Higher attention entropy in speaker change segments indicates that the model is actively attending to a broader temporal context, which helps it capture the complex and distributed cues necessary for accurately detecting speaker transitions.

Focal loss optimization played a critical role in improving performance under imbalance. By assigning more weight to challenging examples and reducing the influence of easy negatives, the model achieved better recall and a stronger balance between false alarms and missed detections. This was particularly helpful in cases where standard loss functions failed to adequately address the rarity of speaker change events.

6. Limitations

Despite the effectiveness of our approach, there are several limitations that could impact performance. First, we rely on a hard alignment between text and audio, which may introduce errors when transcription and audio segmentation are not perfectly synchronized. This can lead to misaligned speaker change points, affecting downstream processing. Second, the pre-trained Bert model used in our approach does not inherently contain speaker information. As a result, the textual representation lacks explicit speaker cues, which could otherwise improve the detection of speaker transitions.

A more effective approach would involve a joint model that simultaneously performs automatic speech recognition and speaker change detection, reducing alignment errors. Alternatively, a stacked model architecture, where ASR is followed by SCD in a sequential manner, could improve robustness by leveraging intermediate speech representations. These models can be integrated with pre-trained audio models to further improve performance. Future research should investigate joint modeling of automatic speech recognition and speaker change detection to improve alignments between text and audio, or use hidden embeddings from a pre-trained ASR model to improve SCD performance.

7. Conclusions

In this work, we proposed a multi-modal speaker change detection (SCD) approach that integrates audio and text information to improve detection. The proposed method leverages pre-trained large-scale models for feature extraction and incorporates a self-attention mechanism to refine contextual representations. The features are combined and passed through a fully connected classification network with layer normalization and dropout to make the model more stable. The model also uses focal loss to handle the class imbalance of the data.

We tested the model on a speaker change detection task, using audio–text pairs from a multi-talker meeting dataset. The experiments included comparing multi-modal and single-modal models, checking the effect of fine-tuning pre-trained models, and testing the impact of self-attention layers. The results show important findings. The multi-modal approach works better than single-modality models, which shows that audio and text together help detect speaker changes. Fine-tuning pre-trained models makes a much better model, as opposed to freezing them. Updating the pre-trained model’s parameters during training gives a 21% improvement in the multi-modal setting. Adding a self-attention layer improves feature learning, which boosts performance by about 2%, compared to a model without it. We have compared attention entropy between speaker change and non-change segments and found that entropy is higher during speaker changes. This suggests that the model attends to a broader temporal context when detecting speaker transitions, helping it capture subtle and distributed cues. Higher attention entropy in these segments reflects more adaptive and effective attention behavior for speaker change detection in multi-talker meetings.

Despite these improvements, the approach has certain limitations. There is a reliance on hard alignment between ASR-generated text and audio segments. The text information is first used symbolically, then it is turned into embeddings, which lack speaker-specific information, reducing their effectiveness in capturing speaker transitions based solely on textual cues.

To address these limitations, future research should explore jointly performing automatic speech recognition and speaker change detection to reduce alignment errors. Alternatively, a stacked ASR-to-SCD pipeline could leverage intermediate ASR representations for better speaker change detection.

Author Contributions

Conceptualization, A.T. and G.T.; Methodology, A.T. and G.T.; Validation, A.T., G.T., A.P. and J.A.; Resources, G.T.; Data curation, G.T., A.P. and J.A.; Writing—original draft, A.T. and G.T.; Writing—review & editing, A.T. and G.T.; Visualization, G.T.; Project administration, B.Z.; Funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan under grant number AP19676744.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are publicly available. The corresponding URLs to the used datasets are provided in the following link. [https://groups.inf.ed.ac.uk/ami/download/] [accessed on 1 August 2023].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1 and Table A2 report the precision, recall, F1-score for eMD for different modalities. Table A3 and Table A4 report the precision, recall, F1-score for eMD $_{a t t}$ for various modalities.

Table A1. The results of eMD model for per-meeting Precision, Recall, and F1-score (%) for SCD: freeze (F) models.

Meeting ID	Text			Audio			Multi-Modal
	Prec	Rec	F1	Prec	Rec	F1	Prec	Rec	F1
EN2002a	58.91	72.28	64.91	61.78	78.34	69.08	62.94	79.88	70.41
EN2002b	54.88	69.54	61.35	60.39	72.32	65.82	62.27	76.22	68.54
EN2002c	55.35	69.79	61.74	60.25	73.67	66.29	61.85	75.78	68.11
EN2002d	56.92	71.25	63.29	63.63	77.57	69.91	63.63	79.71	70.77
ES2004a	43.16	61.62	50.76	49.52	70.03	58.02	51.23	70.37	59.29
ES2004b	37.63	56.28	45.10	45.39	69.35	54.87	46.15	70.18	55.68
ES2004c	42.30	64.16	50.99	49.88	64.00	56.06	50.94	69.60	58.82
ES2004d	50.53	64.45	56.65	59.95	63.36	61.61	59.53	69.47	64.12
IS1009a	45.33	69.62	54.91	57.74	64.56	60.96	54.52	71.31	61.79
IS1009b	40.94	61.81	49.25	50.62	64.37	56.67	52.84	69.69	60.10
IS1009c	39.88	61.50	48.38	47.23	61.97	53.60	49.65	67.37	57.17
IS1009d	45.62	67.62	54.48	53.62	68.51	60.16	54.16	69.57	60.90
TS3003a	42.42	60.71	49.94	54.17	67.86	60.24	55.81	72.53	63.08
TS3003b	40.69	62.25	49.21	51.20	65.88	57.62	51.80	70.60	59.75
TS3003c	37.39	60.57	46.24	44.30	65.41	52.82	44.48	70.07	54.42
TS3003d	53.21	69.46	60.26	60.61	68.89	64.48	62.01	75.89	68.25

Table A2. The results of eMD model for per-meeting Precision, Recall, and F1-score (%) for SCD: non-freeze (NF) models.

Meeting ID	Text			Audio			Multi-Modal
	Prec	Rec	F1	Prec	Rec	F1	Prec	Rec	F1
EN2002a	59.65	70.43	64.60	72.33	81.31	76.56	73.41	82.75	77.80
EN2002b	57.42	66.76	61.74	69.22	80.39	74.39	69.36	79.97	74.29
EN2002c	57.11	66.08	61.27	71.42	79.92	75.43	71.18	80.25	75.45
EN2002d	58.53	69.22	63.43	72.12	82.26	76.86	71.84	82.98	77.01
ES2004a	45.57	62.29	52.63	60.91	72.39	66.15	61.10	75.08	67.37
ES2004b	40.51	56.11	47.05	59.07	72.53	65.11	58.04	72.53	64.48
ES2004c	45.22	63.52	52.83	62.17	71.52	66.52	61.54	72.96	67.76
ES2004d	52.97	61.74	57.02	71.37	71.37	71.37	69.99	72.46	71.20
IS1009a	46.85	72.15	56.81	68.75	69.62	69.18	68.55	71.73	70.10
IS1009b	44.49	62.01	51.81	74.28	71.06	72.64	72.33	72.05	72.19
IS1009c	40.76	60.56	48.73	69.90	65.96	67.87	69.61	66.67	68.11
IS1009d	46.38	66.19	54.55	71.53	72.42	71.97	70.54	72.42	71.47
TS3003a	44.16	62.36	51.71	62.98	76.65	69.14	63.76	76.37	69.50
TS3003b	44.11	63.16	51.94	64.10	75.50	69.33	64.25	75.68	69.50
TS3003c	39.86	62.01	48.53	54.45	74.55	62.93	56.85	75.81	64.98
TS3003d	53.32	67.39	59.53	68.30	78.65	73.11	69.90	79.45	74.37

Table A3. The results of eMD $_{a t t}$ model for per-meeting Precision, Recall, and F1-score (%) for SCD: freeze (F) models.

Meeting ID	Text			Audio			Multi-Modal
	Prec	Rec	F1	Prec	Rec	F1	Prec	Rec	F1
EN2002a	40.25	70.33	51.20	56.15	88.09	68.59	49.16	92.92	64.30
EN2002b	36.13	73.02	48.34	55.18	84.42	66.74	47.28	90.82	62.19
EN2002c	35.29	73.08	47.60	54.32	84.30	66.07	45.68	92.91	61.25
EN2002d	39.84	72.38	51.39	57.11	87.56	69.13	49.67	92.56	64.65
ES2004a	25.03	76.09	37.67	44.77	86.53	59.01	34.90	90.24	50.33
ES2004b	23.10	79.73	35.82	38.33	83.92	52.63	28.60	91.79	43.61
ES2004c	23.53	75.68	35.90	42.55	78.08	55.08	30.49	90.40	45.60
ES2004d	30.36	75.71	43.34	51.85	77.88	62.26	40.19	85.89	54.76
IS1009a	24.72	74.26	37.09	45.61	81.01	58.36	35.30	90.72	50.83
IS1009b	22.64	79.92	35.29	40.80	82.48	54.59	27.70	89.96	42.35
IS1009c	20.87	77.00	32.83	36.23	83.10	50.46	26.34	87.79	40.52
IS1009d	25.79	75.98	38.50	44.56	84.52	58.35	33.40	88.97	48.57
TS3003a	21.79	76.37	33.90	45.89	79.67	58.23	34.35	83.52	48.68
TS3003b	22.64	78.95	35.19	45.12	76.41	56.74	30.30	85.84	44.79
TS3003c	19.08	75.45	30.46	37.30	78.14	50.49	29.82	88.17	44.57
TS3003d	29.64	75.77	42.61	53.03	82.32	64.51	43.80	88.86	58.68

Table A4. The results of the eMD $_{a t t}$ model for per-meeting Precision, Recall, and F1-score (%) for SCD: non-freeze (NF) models.

Meeting ID	Text			Audio			Multi-Modal
	Prec	Rec	F1	Prec	Rec	F1	Prec	Rec	F1
EN2002a	47.87	90.04	62.51	68.65	92.61	78.85	71.08	89.32	79.16
EN2002b	42.18	88.18	57.07	63.40	90.82	74.67	65.09	86.37	74.24
EN2002c	42.22	87.51	56.96	65.22	90.21	75.71	68.33	87.76	76.84
EN2002d	45.92	88.89	60.56	67.06	93.58	78.13	68.91	90.83	78.36
ES2004a	29.06	88.55	43.76	57.78	87.54	69.61	61.47	87.54	72.22
ES2004b	26.98	84.42	40.89	54.71	86.60	67.06	56.31	82.91	67.07
ES2004c	29.25	89.76	44.12	56.71	87.20	68.73	59.67	86.40	70.59
ES2004d	35.94	88.06	51.04	65.33	86.16	74.31	67.96	83.18	74.80
IS1009a	29.61	89.45	44.49	62.92	87.34	73.14	69.18	81.43	74.81
IS1009b	27.53	87.40	41.87	63.11	87.20	73.22	68.53	84.45	75.66
IS1009c	26.02	88.26	40.19	61.56	86.85	72.05	64.65	82.86	72.63
IS1009d	31.09	88.08	45.96	67.03	87.19	75.79	70.61	83.81	76.65
TS3003a	25.48	90.93	39.81	55.99	91.21	69.38	58.99	86.54	70.16
TS3003b	27.56	90.38	42.24	55.13	88.75	68.01	58.71	86.21	69.85
TS3003c	22.69	87.28	36.02	50.00	89.43	64.14	52.91	84.77	65.15
TS3003d	34.32	87.60	49.32	63.59	92.65	75.42	66.35	89.21	76.10

References

Carletta, J.; Ashby, S.; Bourban, S.; Flynn, M.; Guillemot, M.; Hain, T.; Kadlec, J.; Karaiskos, V.; Kraaij, W.; Kronenthal, M.; et al. The AMI Meeting Corpus: A Pre-announcement. In Proceedings of the Machine Learning for Multimodal Interaction, Edinburgh, UK, 11–13 July 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 25–29. [Google Scholar] [CrossRef]
Toleu, A.; Tolegen, G.; Mussabayev, R.; Zhumazhanov, B.; Krassovitskiy, A. Comparative Analysis of Distance Measures for Unsupervised Speaker Change Detection. In Proceedings of the 2024 20th International Asian School-Seminar on Optimization Problems of Complex Systems (OPCS), Novosibirsk, Russia, 19–30 July 2024; pp. 28–32. [Google Scholar] [CrossRef]
Joyce, J.M. Kullback-Leibler Divergence. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 720–722. [Google Scholar] [CrossRef]
Chen, S.; Gopalakrishnan, P.S.; Watson, I.T.J. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion. 1998. Available online: https://api.semanticscholar.org/CorpusID:13624754 (accessed on 1 December 2024).
Sivakumaran, P.; Fortuna, J.; Ariyaeeinia, A.M. On the use of the Bayesian information criterion in multiple speaker detection. In Proceedings of the Interspeech, Aalborg, Denmark, 3–7 September 2001. [Google Scholar] [CrossRef]
Toleu, A.; Tolegen, G.; Mussabayev, R.; Krassovitskiy, A.; Zhumazhanov, B. Comparative Analysis of Audio Features for Unsupervised Speaker Change Detection. Appl. Sci. 2024, 14, 12026. [Google Scholar] [CrossRef]
Gish, H.; Siu, M.H.; Rohlicek, R. Segregation of speakers for speech recognition and speaker identification. In Proceedings of the [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 14–17 April 1991; Volume 2, pp. 873–876. [Google Scholar] [CrossRef]
Bredin, H. TristouNet: Triplet loss for speaker turn embedding. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5430–5434. [Google Scholar] [CrossRef]
Jati, A.; Georgiou, P. An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1131–1135. [Google Scholar] [CrossRef]
Wang, R.; Gu, M.; Li, L.; Xu, M.; Zheng, T.F. Speaker segmentation using deep speaker vectors for fast speaker change scenarios. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5420–5424. [Google Scholar] [CrossRef]
Moattar, M.H.; Homayounpour, M.M. Variational conditional random fields for online speaker detection and tracking. Speech Commun. 2012, 54, 763–780. [Google Scholar] [CrossRef]
Sarı, L.; Thomas, S.; Hasegawa-Johnson, M.; Picheny, M. Pre-training of Speaker Embeddings for Low-latency Speaker Change Detection in Broadcast News. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6286–6290. [Google Scholar] [CrossRef]
Bredin, H.; Laurent, A. End-to-end speaker segmentation for overlap-aware resegmentation. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021. [Google Scholar] [CrossRef]
Ravanelli, M.; Bengio, Y. Speaker Recognition from Raw Waveform with SincNet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar] [CrossRef]
Grossinho, A.; Guimarães, I.; Magalhães, J.; Cavaco, S. Robust Phoneme Recognition for a Speech Therapy Application. In Proceedings of the 2016 4th International Conference on Serious Games and Applications for Health (SeGAH), Orlando, FL, USA, 11–13 May 2016; pp. 1–8. [Google Scholar] [CrossRef]
Dziadzio, S.; Nabożny, A.; Smywiński-Pohl, A.; Ziółko, B. Comparison of Language Models Trained on Written Texts and Speech Transcripts in the Context of Automatic Speech Recognition. In Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, Łódź, Poland, 13–16 September 2015; Ganzha, M., Maciaszek, L., Paprzycki, M., Eds.; ACSIS: Lodz, Poland, 2015; Volume 5, pp. 193–197. [Google Scholar] [CrossRef]
Wanumen, L.; Florez, H. Architectural Approaches for Phonemes Recognition Systems. In Applied Informatics; Sucar, L.E., Morales, E.F., González, J.A., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 267–279. [Google Scholar] [CrossRef]
Deng, L. Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition. In Robust Speech Recognition of Uncertain or Missing Data; Kolossa, D., Haeb-Umbach, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 67–99. [Google Scholar] [CrossRef]
Anidjar, O.H.; Lapidot, I.; Hajaj, C.; Dvir, A. A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3121–3125. [Google Scholar] [CrossRef]
Meng, Z.; Mou, L.; Jin, Z. Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; pp. 2203–2206. [Google Scholar] [CrossRef]
Zhao, G.; Wang, Q.; Lu, H.; Huang, Y.; Moreno, I.L. Augmenting Transformer-Transducer Based Speaker Change Detection with Token-Level Training Loss. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Anidjar, O.H.; Lapidot, I.; Hajaj, C.; Dvir, A.; Gilad, I. Hybrid Speech and Text Analysis Methods for Speaker Change Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2324–2338. [Google Scholar] [CrossRef]
Weon Jung, J.; Seo, S.; Heo, H.S.; Kim, G.; Kim, Y.J.; Kwon, Y.K.; Lee, M.; Lee, B.J. Encoder-decoder Multimodal Speaker Change Detection. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 5311–5315. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Hogg, A.O.T.; Evers, C.; Naylor, P.A. Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5826–5830. [Google Scholar] [CrossRef]

Figure 1. The multi-modal SCD model: using two pre-trained models to extract audio and text features, respectively, which are fused and processed to detect speaker changes.

Figure 2. Comparison of freeze (F) and non-freeze (NF) models across text-only, audio-only, and multi-modal configurations of eMD model: (a) Average FAR and MDR (%) Comparison: F vs. NF; (b) Average F1-score (%) Comparison: F vs. NF.

Figure 3. Comparison of freeze (F) and non-freeze (NF) models across text-only, audio-only, and multi-modal configurations of eMD $_{a t t}$ model. (a) Average FAR and MDR (%) Comparison: F vs. NF. (b) Average F1-score (%) Comparison: F vs. NF.

Figure 4. Performance metrics (F1-score, False Alarm Rate, and Miss Detection Rate) as a function of

α

for different

γ

values: (a) False Alarm Rate as a function of

α

for different

γ

values. (b) Miss Detection Rate as a function of

α

for different

γ

values. (c) F1-score as a function of

α

for different

γ

values.

Figure 4. Performance metrics (F1-score, False Alarm Rate, and Miss Detection Rate) as a function of

α

for different

γ

values: (a) False Alarm Rate as a function of

α

for different

γ

values. (b) Miss Detection Rate as a function of

α

for different

γ

values. (c) F1-score as a function of

α

for different

γ

values.

Figure 5. Comparison of attention distribution between speaker change and normal segments in the eMD $_{att}$ model: (a) Average attention heatmap for speaker change segments. (b) Average attention heatmap for normal segments.

Figure 6. Comparison of attention entropy between speaker change and normal segments. Each point corresponds to a 1 s segment. Entropy is computed over the intra-segment attention matrix, where higher entropy indicates more dispersed attention across time frames. The orange line in each box indicates the median attention entropy within the corresponding group.

Table 1. Statistics of the training dataset.

Metric	Value
Total Meetings	137
Average change points	790.99
Average overlap points	306.61
Average Speaker	3.99
Average Length	34.11 min
Total Length	4673.69 min
Positive samples	30.85%
Negative samples	69.15%

Table 2. Test set statistics: total change points, overlaps, speakers, audio length, and class distribution. Percentages are shown for positive (P) and negative (N) classes.

Meeting ID	Change Points	Overlaps	Speakers	Length (min)	P (%)	N (%)
EN2002a	1366	678	4	43.90	45.51	54.49
EN2002b	954	444	4	33.52	41.18	58.82
EN2002c	1468	681	3	56.46	40.20	59.80
EN2002d	1331	662	4	46.02	44.45	55.55
ES2004a	360	145	4	15.71	28.31	71.69
ES2004b	687	257	4	37.48	25.46	74.54
ES2004c	722	262	4	37.71	27.02	72.98
ES2004d	893	351	4	34.02	34.38	65.62
IS1009a	290	122	4	11.91	28.76	71.24
IS1009b	602	251	4	33.24	25.26	74.74
IS1009c	494	142	4	26.51	23.65	76.35
IS1009d	688	245	4	29.27	29.04	70.96
TS3003a	424	95	4	17.54	24.81	75.19
TS3003b	658	157	4	30.63	25.00	75.00
TS3003c	629	160	4	31.83	21.82	78.18
TS3003d	1061	345	4	35.16	33.31	66.69
Total	13,127	5253	-	548.70	-	-
Average	821.31	328.19	3.94	33.38	31.40	68.60

Table 3. The results of the eMD model for per-meeting FAR, MDR, and Hit (%) for SCD: freeze (F) models.

Meeting ID	Text			Audio			Multi-Modal
	FAR	MDR	Hit	FAR	MDR	Hit	FAR	MDR	Hit
EN2002a	42.11	27.72	72.28	40.48	21.66	78.34	39.28	20.12	79.88
EN2002b	40.02	30.46	69.54	33.20	27.68	72.32	32.33	23.78	76.22
EN2002c	37.83	30.21	69.79	32.67	26.33	73.67	31.42	24.22	75.78
EN2002d	43.15	28.75	71.25	35.48	22.43	77.57	36.46	20.29	79.71
ES2004a	32.05	38.38	61.62	28.19	29.97	70.03	26.46	29.63	70.37
ES2004b	31.86	43.72	56.28	28.49	30.65	69.35	27.97	29.82	70.18
ES2004c	32.41	35.84	64.16	23.82	36.00	64.00	24.82	30.40	69.60
ES2004d	33.05	35.55	64.45	22.17	36.64	63.36	24.73	30.53	69.47
IS1009a	33.90	30.38	69.62	19.08	35.44	64.56	24.02	28.69	71.31
IS1009b	30.14	38.19	61.81	21.22	35.63	64.37	21.02	30.31	69.69
IS1009c	28.73	38.50	61.50	21.45	38.03	61.97	21.16	32.63	67.37
IS1009d	32.99	32.38	67.62	24.25	31.49	68.51	24.11	30.43	69.57
TS3003a	27.20	39.29	60.71	18.95	32.14	67.86	18.95	27.47	72.53
TS3003b	30.25	37.75	62.25	20.93	34.12	65.88	21.90	29.40	70.60
TS3003c	28.31	39.43	60.57	22.96	34.59	65.41	24.41	29.93	70.07
TS3003d	30.50	30.54	69.46	22.36	31.11	68.89	23.22	24.11	75.89

Table 4. The results of the eMD model for per-meeting FAR, MDR, and Hit Rate (%) for SCD: non-freeze (NF) models.

Meeting ID	Text			Audio			Multi-Modal
	FAR	MDR	Hit	FAR	MDR	Hit	FAR	MDR	Hit
EN2002a	39.79	29.57	70.43	25.99	18.69	81.31	25.04	17.25	82.75
EN2002b	34.66	33.24	66.76	25.02	19.61	80.39	24.73	20.03	79.97
EN2002c	33.35	33.92	66.08	21.50	20.08	79.92	21.84	19.75	80.25
EN2002d	39.23	30.78	69.22	25.45	17.74	82.26	26.02	17.02	82.98
ES2004a	29.39	37.71	62.29	18.35	27.61	72.39	18.88	24.92	75.08
ES2004b	28.15	43.89	56.11	17.16	27.47	72.53	17.91	27.47	72.53
ES2004c	28.50	36.48	63.52	16.11	28.48	71.52	16.88	27.04	72.96
ES2004d	28.71	38.26	61.74	15.00	28.63	71.37	16.28	27.54	72.46
IS1009a	33.05	27.85	72.15	12.78	30.38	69.62	13.29	28.27	71.73
IS1009b	26.15	37.99	62.01	8.32	28.94	71.06	9.31	27.95	72.05
IS1009c	27.27	39.44	60.56	8.80	34.04	65.96	9.02	33.33	66.67
IS1009d	31.32	33.81	66.19	11.80	27.58	72.42	12.38	27.58	72.42
TS3003a	26.02	37.64	62.36	14.87	23.35	76.65	14.32	23.63	76.37
TS3003b	26.68	36.84	63.16	14.10	24.50	75.50	14.04	24.32	75.68
TS3003c	26.11	37.99	62.01	17.41	25.45	74.55	16.06	24.19	75.81
TS3003d	29.47	32.61	67.39	18.23	21.35	78.65	17.09	20.55	79.45

Table 5. The results of the eMD $_{a t t}$ model for per-meeting FAR, MDR, and Hit (%) for SCD: freeze (F) models.

Meeting ID	Text			Audio			Multi-Modal
	FAR	MDR	Hit	FAR	MDR	Hit	FAR	MDR	Hit
EN2002a	87.22	29.67	70.33	57.46	11.91	88.09	80.27	7.08	92.92
EN2002b	90.36	26.98	73.02	48.00	15.58	84.42	70.89	9.18	90.82
EN2002c	90.07	26.92	73.08	47.65	15.70	84.30	74.25	7.09	92.91
EN2002d	87.44	27.62	72.38	52.61	12.44	87.56	75.04	7.44	92.56
ES2004a	90.03	23.91	76.09	42.15	13.47	86.53	66.49	9.76	90.24
ES2004b	90.68	20.27	79.73	46.11	16.08	83.92	78.26	8.21	91.79
ES2004c	91.05	24.32	75.68	39.04	21.92	78.08	76.30	9.60	90.40
ES2004d	90.97	24.29	75.71	37.88	22.12	77.88	66.95	14.11	85.89
IS1009a	91.31	25.74	74.26	39.01	18.99	81.01	67.12	9.28	90.72
IS1009b	92.28	20.08	79.92	40.45	17.52	82.48	79.37	10.04	89.96
IS1009c	90.47	23.00	77.00	45.31	16.90	83.10	76.07	12.21	87.79
IS1009d	89.51	24.02	75.98	43.04	15.48	84.52	72.61	11.03	88.97
TS3003a	90.48	23.63	76.37	31.01	20.33	79.67	52.67	16.48	83.52
TS3003b	89.90	21.05	78.95	30.97	23.59	76.41	65.82	14.16	85.84
TS3003c	89.29	24.55	75.45	36.67	21.86	78.14	57.93	11.83	88.17
TS3003d	89.85	24.23	75.77	36.41	17.68	82.32	56.94	11.14	88.86

Table 6. The results of the eMD $_{a t t}$ model for per-meeting FAR, MDR, and Hit (%) for SCD: non-freeze (NF) models.

Meeting ID	Text			Audio			Multi-Modal
	FAR	MDR	Hit	FAR	MDR	Hit	FAR	MDR	Hit
EN2002a	81.90	9.96	90.04	35.33	7.39	92.61	30.36	10.68	89.32
EN2002b	84.62	11.82	88.18	36.71	9.18	90.82	32.42	13.63	86.37
EN2002c	80.49	12.49	87.51	32.33	9.79	90.21	27.34	12.24	87.76
EN2002d	83.77	11.11	88.89	36.79	6.42	93.58	32.79	9.17	90.83
ES2004a	85.37	11.45	88.55	25.27	12.46	87.54	21.68	12.46	87.54
ES2004b	78.03	15.58	84.42	24.49	13.40	86.60	21.97	17.09	82.91
ES2004c	80.39	10.24	89.76	24.64	12.80	87.20	21.62	13.60	86.40
ES2004d	82.23	11.94	88.06	23.95	13.84	86.16	20.54	16.82	83.18
IS1009a	85.86	10.55	89.45	20.78	12.66	87.34	14.65	18.57	81.43
IS1009b	77.78	12.60	87.40	17.23	12.80	87.20	13.11	15.55	84.45
IS1009c	77.75	11.74	88.26	16.80	13.15	86.85	14.04	17.14	82.86
IS1009d	79.90	11.92	88.08	17.55	12.81	87.19	14.28	16.19	83.81
TS3003a	87.76	9.07	90.93	23.66	8.79	91.21	19.85	13.46	86.54
TS3003b	79.19	9.62	90.38	24.08	11.25	88.75	20.21	13.79	86.21
TS3003c	82.99	12.72	87.28	24.96	10.57	89.43	21.06	15.23	84.77
TS3003d	83.72	12.40	87.60	26.49	7.35	92.65	22.59	10.79	89.21

Table 7. Averaged results of eMD, eMD $_{a t t}$ , and eMD $_{a t t}^{*}$ compared with baselines. The asterisk (*) denotes the model optimized with focal loss parameters. Bold values indicate the best performance among all models for each metric.

Models	FAR	MDR	Precision	Recall	F1
NN	7.66	71.11	62.74	28.89	38.16
LSTM	10.21	56.65	65.97	43.35	50.99
Wav2vec2-MLP	8.25	35.03	78.47	64.97	70.66
eMD (audio-only)	16.93	25.24	67.06	74.76	70.54
eMD (multi-modal)	17.07	24.43	67.02	75.57	70.91
eMD $_{a t t}$ (audio-only)	25.69	10.92	61.14	89.08	72.39
eMD $_{a t t}$ (multi-modal)	21.78	14.15	64.30	85.85	73.39
*eMD $_{a t t}^{}$** (multi-modal)	15.88	20.77	69.61	79.23	74.01

Table 8. Comparison with the previous work on multi-modal approaches. The asterisk (*) denotes the model optimized with focal loss parameters. Bold values indicate the best performance among all models for each metric.

Previous Work	P	R	F1	FAR	MDR
Aidan et al [26] (2019)	-	-	-	70.46	19.62
Jung et al. [23] (2023)	66.73	80.49	73.61	-	-
Zhao et al [21] (2023)	79.40	68.1	73.30	-	-
eMD $_{a t t}$ (audio-only)	61.14	89.08	72.39	25.69	10.92
eMD $_{a t t}$ (multi-modal)	64.30	85.85	73.39	21.78	14.15
*eMD $_{a t t}^{}$** (multi-modal)	69.61	79.23	74.01	15.88	20.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Toleu, A.; Tolegen, G.; Pak, A.; Assel, J.; Zhumazhanov, B. End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models. Appl. Sci. 2025, 15, 4324. https://doi.org/10.3390/app15084324

AMA Style

Toleu A, Tolegen G, Pak A, Assel J, Zhumazhanov B. End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models. Applied Sciences. 2025; 15(8):4324. https://doi.org/10.3390/app15084324

Chicago/Turabian Style

Toleu, Alymzhan, Gulmira Tolegen, Alexandr Pak, Jaxylykova Assel, and Bagashar Zhumazhanov. 2025. "End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models" Applied Sciences 15, no. 8: 4324. https://doi.org/10.3390/app15084324

APA Style

Toleu, A., Tolegen, G., Pak, A., Assel, J., & Zhumazhanov, B. (2025). End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models. Applied Sciences, 15(8), 4324. https://doi.org/10.3390/app15084324

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models

Abstract

1. Introduction

2. Related Work

3. Approach

3.1. Problem Formalization

3.2. Multi-Modal SCD Model

3.2.1. Audio Feature Extraction

3.2.2. Text Feature Extraction

3.2.3. Feature Fusion and Classification

4. Experiments

4.1. Model Setup

4.2. Baselines

4.3. Dataset Statistics

4.4. Pre-Processing

4.5. Evaluation Metrics

4.6. Results

4.6.1. Results of eMD Model

4.6.2. Results of eMD ⁢ a t t Model

4.6.3. Optimization Parameters for Focal Loss

4.6.4. Comparison with Baselines

4.6.5. Comparison with Previous Work

4.7. Visualization of Attention Layer

5. Discussion

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6.2. Results of eMD $_{a t t}$ Model